In 2004, economists and professors, Frank Levy and Richard Murnane, published a book detailing a list of jobs that will one day be rendered obsolete by automation.
According to their research, truck drivers can rest easy. They will most likely survive the computer takeover, as their jobs require a different type of information processing – that of expert thinking and complex decision-making.
Fast forward to today, and companies like German automaker Daimler (Tesla, Uber, and search giant Google are also trying to revolutionize the trucking industry) are currently running a series of tests to fine-tune the first autonomous truck in the US. What their results have revealed is that by eliminating “driver variability,” these trucks, which are already approved for use on public roads, not only emit less carbon footprint they are also more fuel-efficient.
Now, with Microsoft’s speech recognition system reaching human-level of accuracy, it looks like transcribers will soon be added to the robots-stole-my-job list.
According to Microsoft (NASDAQ:MSFT), their latest update yielded a 5.1 percent error rate on the Switchboard speech recognition task, which is reduced by 12 percent from last year’s 5.9 percent – the average error rate for humans. This means, that the system is already performing at the same level as that of professional human transcribers.
But, what’s even more interesting is that, the technical report on the system shows, that transcriptions made by the human transcriber and one made by the automatic speech recognition (ASR) system, were strikingly similar.
Both made the most mistakes in the same short function words, had either an easy or difficult time when transcribing the same speaker, and humans can’t tell whether the transcript was made by a person or a computer.
Microsoft achieved this by increasing the vocabulary size of their Language Models from 30,500 words to 164,000, resulting in a lower out-of-vocabulary (OOV) percentage.
This was further enhanced by adding the CNN-BLSTM (Convolutional Neural Network – Bidirectional Long-Short-Term Memory) model to their existing acoustic structure.
The CNN-BLSTM can predict words using a combination of acoustic models – utterance-level LSTM and dialogue session-based LSTM.
While the utterance-level LSTM determines words based on characters instead of words, dialogue session-based LSTM uses entire conversations. This two-step approach allows the ASR to learn not just more words but complex words from a wider acoustic context.
The team further refined their research by using Microsoft products like their Cognitive Toolkit 2.1 (CNTK) to train the system’s algorithms to learn like a human brain and integrated cloud service Azure to build, manage, and deploy their ideas.
While there are still much to improve, like increased accuracy in recognizing words when in a noisy environment or when the speaker uses a thick and heavy accent, the researchers remain motivated to continue improvements in their ASR system.
So while it took Microsoft 25 years to create a highly-functional ASR system for conversational speech – which has long been considered as one of the hardest in the field of speech recognition because its affected by unlimited variables – the arrival of more advanced technology could make it so, that it can outperform humans in 5-10 years time.
So, unless you’re part of Microsoft’s ASR system development team, better start writing your resume and checking out the job market.
Leave a Reply