What do you do when you're reading an article or paper, one that it's very important you understand, and get stumped by a particular passage? More often than not, you'll head over to Google—or whatever your favorite search engine is—start surfing the Web, and won't stop until you find a satisfactory answer to the puzzle.
Researchers at MIT have developed a machine learning system that behaves much the same way in the course of performing information extraction, the process of creating structured data from unstructured formats such as plain text. Here are the key details from MIT's newsroom:
Most machine-learning systems work by combing through training examples and looking for patterns that correspond to classifications provided by human annotators. For instance, humans might label parts of speech in a set of texts, and the machine-learning system will try to identify patterns that resolve ambiguities — for instance, when “her” is a direct object and when it’s an adjective.
Typically, computer scientists will try to feed their machine-learning systems as much training data as possible. That generally increases the chances that a system will be able to handle difficult problems.
A machine-learning system will generally assign each of its classifications a confidence score, which is a measure of the statistical likelihood that the classification is correct, given the patterns discerned in the training data. With the researchers’ new system, if the confidence score is too low, the system automatically generates a web search query designed to pull up texts likely to contain the data it’s trying to extract.
It then attempts to extract the relevant data from one of the new texts and reconciles the results with those of its initial extraction. If the confidence score remains too low, it moves on to the next text pulled up by the search string, and so on.
The researchers compared their system’s performance to that of several extractors trained using more conventional machine-learning techniques. For every data item extracted in both tasks, the new system outperformed its predecessors, usually by about 10 percent.
MIT's new system won a best paper award at the recent Conference on Empirical Methods in Natural Language Processing academic event. Its success embodies a long-standing lesson, says Constellation Research VP and principal analyst Doug Henschen, who leads Constellation's research into data-driven decision-making.
"I've seen this human-classification versus computer-based classification problem before, and in both cases the machine won," Henschen says. "This sort of problem came up long ago in the knowledge management era, but companies were trying to build ontologies to improve search results.
"You might think human curators would have it all over computers where language is concerned," Henschen adds. "But the one thing computers can do better than any human is perform consistently. There are no good or bad days and there's no fatigue-induced error. Ultimately you do have to rely on humans to determine what is and isn't in a training set, but I'm not surprised to see higher performance from a system that relies on machine learning to spot correlations in textual information."
24/7 Access to Constellation Insights
Subscribe today for unrestricted access to expert analyst views on breaking news.