Microsoft is claiming a victory on a major battleground in the AI arms race—speech recognition. The company's research arm says it has created a speech recognition system that performs on par with professional human transcriptionists. Here are the key details from Microsoft's announcement:
The researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported just last month.
The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.
“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”
The milestone means that, for the first time, a computer can recognize the words in a conversation as well as a person would. In doing so, the team has beat a goal they set less than a year ago — and greatly exceeded everyone else’s expectations as well.
Microsoft researchers used the company's Computational Network Toolkit deep learning system to reach the milestone, while leveraging the use of GPUs to crunch the algorithms in parallel.
There are some caveats to note over Microsoft's achievement. One, it remains in the labs and won't reflect itself in the company's consumer and business products immediately. Second, the testing conditions were in the lab, after all—not in a crowded, noisy room or in a car, where the conditions can make it harder to understand what someone is saying.
Microsoft is also planning to work on improving voice recognition for group conversations, as well as a broad variety of voice types and accents, the company said.
Analysis: From Recognition to Actions
"The reason voice accuracy is becoming more and more important is not just because we are using it for voice dictation, but NLP is becoming a way to perform actions," says Constellation Research VP and principal analyst Alan Lepofsky. That's where Microsoft and other vendors are headed with their voice recognition technologies.
Imagine interacting with an ERP or CRM system via voice command, placing orders and punching information into customer records while walking around the office performing other tasks—and doing so accurately. That last piece is the Holy Grail for Microsoft and rival vendors, as accuracy and speed will be core to to application user adoption of voice interfaces, Lepofsky says.
Hear more analysis of Microsoft's milestone from Lepofsky in the video below.