Speech technology has come a long way since the invention of the first phonograph by Thomas A. Edison way back in 1784. Now machines are able to understand speech with an accuracy close to human performance (at an error rate of under 5%). Speech Technology has now pervaded to our daily chores.
Voice technology has a strong futuristic outlook with strong backing from major players like Google, Amazon, and Microsoft who have put their weight behind democratization of technology, continuous research, and technology improvement.
Let's start with the current business trends and needs in the consumer domain. Sometime back, we saw voice assistants being built into smartphones, namely, Siri and Cortana. Now the situation is changing with the fast adoption of smart speakers. By 2022, 55% of all US households are expected to have at least one smart speaker. Google and Amazon are specifically focusing on the proliferation of their voice devices and platforms. These devices are leading to the expansion of smart devices ranging from smart fridges, headphones, smoke alarms, and the list continues to increase. Google reports that 62% of users plan to make a purchase through their voice assistants over the coming month, while 58% use smart speakers to create a weekly shopping list. What it essentially indicates that speech technology would lead conversations to make purchases.
Speech technology would lead conversations to making purchases
So, the key takeaway is that consumer products are expected to be voice-enabled at large scale driven by business benefits. Voice assistant enablement may be accomplished by being part of the Google/Amazon smart speaker ecosystems. At the same time, some OEMs would be looking away from the Google/Amazon smart speaker ecosystem and integrate a voice assistant in their product itself.
From an end consumer perspective, they expect a personalized experience that improves with time. This would need state-of-the-art NLP technologies beyond speech that would continuously strive to understand user intent and context and deliver responses. Furthermore, end users are expected to have diverse voice assistants e.g. Echo Dot at home, Google Assistant integrated into cars, and would expect seamless conversation as they shift from one to the other on a daily basis.
Now coming to content providers, they have a critical role to play with regards to content delivery to end users. Content providers currently depend upon existing voice search apps (Google, Amazon) for content delivery but they would be looking for a native or a cross-platform solution that delivers a consistent content delivery while preserving their respective brand value and exclusive right to content.
In other domains such as automobiles and medicine, where voice interfaces are a necessity, voice search apps are slowly catching up in the technology from supporting just a few domain-specific keywords and sentences to a general-purpose natural voice conversation. The challenge here is to deploy speech technology on the edge for better responsiveness. Speech solution for a general-purpose language model are computationally expensive and are best deployed on a high-end server cluster a cloud platform. However, there is a possibility of optimizing domain-specific solutions that can be deployed on the edge platform.
While Speech and accompanying NLP technologies are getting pervasive and are becoming useful, we are still some distance away from realizing the true potential of speech recognition technology. The current digital assistants can interpret speech very well but are still far away from providing a natural conversation interface. Conversation interfaces need to comprehend a conversation context that takes into account various aspects like a past conversation, the user’s current mood, and factors related to regional/gender/political bias.
Speech recognition also has challenges with regards to scaling for supporting variations in spoken languages based on the regional dialect, speed, emphasis, social class, and gender.
Quick Facts:
- The first speech machine was based on acoustics-mechanical voice technology and was created in Vienna.
- Google launched a voice search app on mobile devices way back in 2008.
- A human being can speak 150 words per minute compared to 40 words that can be written at the same time.
- The world’s first “continuous speech recognizer” was released by Dragon in 1997 and is still in use in the medical industry.