Semantic coding

The search engine revolution of the mid-90s gave us the power to sift through pages and pages of information at unheard of speeds, but increasingly there is a need for more and better means of going through information. Simple queries now yield thousands of results. The results themselves require a search engine.

There have been many people trying to solve this problem — multiple search engine startups promising new technology, but no one has delivered as of yet. The solution may lie in several projects that aim to create a way to sift through writing and gain context, much the way a person would.

IBM’s Watson, famous for its appearance on the game show Jeopardy!, is the best example of this new technology. “The computer system is based on IBM’s DeepQA project, which aims to answer “natural-language” questions in standard English,” writes ScientificAmerican.com. These questions, though easy for humans, are often impossible for computers to parse.

Scrippsnews.com writes: “The IBM computer Watson has been described as comparable to several thousand desktop devices, with capacity to review and evaluate the equivalent of two hundred million pages of information content to find answers to the complex, subtle Jeopardy! questions.”

The real advances behind Watson were “how to interpret cunning clues, create a database that is the equivalent of about a million books and find some way to get the system’s performance to shoot up — it comes up with answers in seconds.” Though IBM is not forthcoming with the details of the Watson system, the researchers envision the system being used to help medical researchers sift through pages of data.

Oren Etzioni of the Turing Center at the University of Washington argues that a search engine that can deal with natural-language queries is necessary for scientists trying to keep up with the mass of knowledge now being generated in their field. With the new language technology being developed, scientists will be able to ask questions such as “What are the top ten genes currently being studied in cancer research?” rather than having to trawl through the literature to find out.

There are a few Silicon Valley startups using similar methods that aim to revolutionize the legal industry by allowing a single lawyer to accomplish the work of hundreds. A New York Times article reports that in 1978 litigation between rival television studios involved “six million documents at a cost of more than $2.2 million, much of it to pay for a platoon of lawyers and paralegals who worked for months at high hourly rates.”

However, recent software advances, notably by the company Blackstone Discovery, “helped analyze 1.5 million documents for less than $100,000.” The new language analyzing software can “extract relevant concepts — like documents relevant to social protest in the Middle East — even in the absence of specific terms, and deduce patterns of behaviour that would have eluded lawyers examining millions of documents.”

The language analyzing process, called “E-discovery,” falls into two basic categories, “sociological” and “linguistic,” writes the Times.

The sociological approach “adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events.” The program is supposed to find the discrepancies created by white-collar criminals. According to the creators, it “doesn’t use key words at all” and can “recognize sentiment in an email message.”

Linguistic, the other approach, is spearheaded by a company called Clearwell, who have created “soft­ware that an­alyzes doc­u­ments to find concepts rather than spe­cif­ic keywords, short­ening the time required to locate rel­evant ma­te­rial in lit­igation.”

“Clearwell’s software uses language analysis and a visual way of representing general concepts found in documents to make it possible for a single lawyer to do work that might have once required hundreds.”

The software has already been tested successfully in legal proceedings. According to the Times: “Clearwell software was used by the law firm DLA Piper to search through a half-million documents under a court-imposed deadline of one week. Clearwell’s software analyzed and sorted 570,000 documents (each document can be many pages) in two days. The law firm used just one more day to identify 3,070 documents that were relevant to the court-ordered discovery motion.”
It may be years before this technology is available to the general public, but once implemented it may be the beginning of a new information revolution. What the technology can’t yet do, however, is “mimic the human ability to use common sense, make leaps of logic or imagine the future,” says Patrick Winston of Massachusetts Institute of Technology in Cambridge, in an interview with Nature.
But the way things are going, given enough time and processor cycles, it’s almost certain that computers will soon be able to do that too.