Microsoft (NASDAQ:MSFT) has been fine-tuning its speech recognition software for years. The aim of speech recognition research is set on having the most accurate mechanism to dictate letters, reports, mails, as well as the means to increase productivity by reducing mouse navigation. Now, researchers say that they have built a speech recognition technology that deciphers human conversation just as humans would.
Microsoft had already boasted a 6.3 percent word error rate last month. Just this Tuesday, a Cornell University Library paper published that a team of researchers and engineers of Microsoft Artificial Intelligence Research unit has recorded a 5.9 percent word error rate with its new conversational speech recognition system. The 5.9 percent word error rate is a figure that roughly equates to that of humans’. ”We’ve reached human parity,” said Microsoft’s chief speech scientist Xuedong Huang. “This is an historic achievement.” It is the lowest ever recorded in the average Switchboard speech recognition test.
The speech recognition research started in the early 1970’s with DARPA, Defense Advanced Research Projects Agency, a U.S. agency responsible for developing technological advances for the military as a way to aid national security. After decades of research, Microsoft has finally developed Artificial Intelligence that can decipher the words in a conversation as well as people do. Harry Shum, software giant’s executive vice president of Artificial Intelligence and Research team said in the release that even five years ago, he wouldn’t have imagined achieving this.
The breakthrough will have a great impact on business and consumer products like the Apple (NASDAQ:AAPL) smartphone assistant Siri and Watson cognitive computing system of IBM, which might be enhanced by this new speech recognition technology. This might also pave the way to certain improvements in speech-to-text transcription devices, Window’s digital assistant Cortana, and entertainment gadgets like the Xbox.
It should be noted that Microsoft’s Computational Network ToolKit, which has a deep learning algorithm that is run through several computers with a graphics processing unit chip, was the key in Msoft’s success in creating human parity. It is a system that the Redmond, WA-based tech giant invented and one that has been made available on GitHub through an open-source license.
Even with the huge breakthroughs over the years in speech recognition, Microsoft still has a lot of work to be done. Researchers are now working on how it can be used in real-life. According to Geoffrey Zweig, from Microsoft’s Speech and Dialog research group, they are focusing on streamlining the technology by ensuring it works well with various accents, age, and voice quality. Certain aspects also have to be refined like background noise, multiple people talking at the same time, and driving on a highway.
While this research breakthrough is remarkable, just as humans, the computer is still unable to decipher every word perfectly. However, the word error rate is at par with what you’d expect from a real person listening to the same conversation.
The next phase researchers are preparing for is teaching computers not just to transcribe audio signals but also to understand what humans are saying. This technology will basically be able to answer questions or act on what is being said. Amazing, isn’t it.