10 Apr 2018 |
World innovation news |
Software Systems, Multimedia and Cybersecurity
Speech Recognition Technology
Speech recognition technology and voice user interfaces (VUIs) are becoming more accurate; their error rate is now only about 5.5 percent . This is about the same error rate as in humans because, depending on who we are, we miss one to two words out of every 20 we hear. For most of us it isn’t a problem, however, imagine how difficult it can be for a computer!
The earliest advances in speech recognition focused mainly on the creation of vowel sounds, and interpreting phonemes. A phoneme is a sound or a group of different sounds that help distinguish words from each other. An example is the English phoneme /k/, which occurs in words such as cat, kit, scat, skit . Although the phoneme /k/ may sound the same, it is different.
The first speech recognition system, called the “Audrey” system, was developed by Bell Laboratories in 1952. It could only recognize numbers with an accuracy of 90%, spoken by only one person, its creator. In the 1970s, Carnegie Mellon came up with the “Harpy” speech-understanding system, which was able to recognize over 1000 words and some sentences. It could also recognize different pronunciations of the same word. In 1986, IBM Tangora was developed and introduced a hidden Markov model to speech recognition and prediction of phonemes, which led to today’s innovations. Until the 1990s, even the most successful systems were based on pattern matching, where sound waves would be translated into a set of numbers and stored into computers. The system would then compare the sound waves with an identical sound that was spoken into the machine. For the system to be able to recognize the sounds, the speaker would have to speak very clearly, slowly, and in an environment with no background noise . It was only in 1997 that the world’s first “continuous speech recognizer” was able to understand 100 words per minute, which was largely used by doctors.
In 2008, the Google Voice search app for iPhone was launched. It was made possible due to the latest technology in powerful cloud-based data-sharing computing, combined with the breakthrough and accuracy of machine learning algorithms.
Siri, Apple’s voice user interface, was the first virtual agent to enter the voice recognition market. Since then, voice user experiences have reached critical mass with Alexa from Amazon, Cortana from Microsoft, Google Assistant from Google and Siri from Apple. These companies also developed smart speakers to integrate the smart assistant into our homes. They can be triggered with voice and wake-up words. The value of the virtual-assistant market, and the speech recognition it requires, is expected to exceed $3 billion by 2020 .
Machines started to understand speech with phonemes and gradually evolved into individual words, phrases and, finally, full sentences. They are now able to understand speech with an accuracy almost as close to humans. They have now entered millions of homes due to smart speakers and can be controlled by voice, even offering conversational responses to a wide range of enquiries.
Marie-Anne Valiquette obtained a Bachelor's degree in Mechanical Engineering at the École de technologie supérieure (ÉTS) in Montreal. She lives in Silicon Valley, California where she studies artificial intelligence through online platforms like Udacity and deeplearning.ai.
Program : Mechanical Engineering