Babies born on Earth overwhelmingly tend to learn to speak before they can read or write (We try to avoid absolutes, so hence the hedge. There might be exceptions out there). To that end, researchers at the Massachusetts Institute of Technology have come up with a speech recognition system that learns language in much the same way humans do—through a combination of audio and visual inputs. MIT's news service has the details:
Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.
At the Neural Information Processing Systems conference this week, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn’t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.
“The goal of this work is to try to get the machine to learn language more like the way humans do,” says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. “The current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you’re told what’s said. And you do this for a large body of data."
Conversely, text terms associated with similar clusters of images, such as, say, “storm” and “clouds,” could be inferred to have related meanings. Because the system in some sense learns words’ meanings — the images associated with them — and not just their sounds, it has a wider range of potential applications than a standard speech recognition system.
Traditional methods of training speech recognition systems require many expensive man-hours and therefore, only the world's major languages have gotten much attention. There are actually 7,000 languages spoken around the world, the MIT report notes.
However, the CSAIL researchers' work is in its early stages:
To test their system, the researchers used a database of 1,000 images, each of which had a recording of a free-form verbal description associated with it. They would feed their system one of the recordings and ask it to retrieve the 10 images that best matched it. That set of 10 images would contain the correct one 31 percent of the time.
“I always emphasize that we're just taking baby steps here and have a long way to go,” Glass says. “But it’s an encouraging start.”
It's not clear when or even if this approach to speech recognition will become fully viable, but overall, such technologies are going to have a major impact on enterprise applications, particularly for productivity and collaboration.
“The standard keyboard and screen and being challenged by new input and display technologies and devices," says Constellation Research VP and principal analyst Alan Lepofsky. "Employees are always looking for ways to improve productivity, and the future of work is clearly going to involve voice recognition and augmented reality. If employees can enter information more efficiently, accurately and safely (hands free for example) it can lead to significant improvements in the way we get work done."
24/7 Access to Constellation Insights
Subscribe today for unrestricted access to expert analyst views on breaking news.