By a fortuitous coincidence, I had the opportunity to see two consecutive presentations from search engine companies banking on natural language processing (NLP) to power the next generation of search. The first was from Ron Kaplan, Chief Technology and Science Officer of Powerset, who presented at Columbia University. The second was from Christian Hempelmann, Chief Scientific Officer of hakia, who presented at New York Semantic Web Meetup.
The Powerset talk was entitled “Deep natural language processing for web-scale indexing and retrieval.” Jon Elsas, who attended the same talk earlier this week at CMU, did an excellent job summarizing it on his blog. I’ll simply express my reaction: I don’t get it. I have no reason to doubt that their NLP pipeline is best-in-class. The team has impressive credentials. But I see no evidence that they have produced better results than keyword search. After participating in their private beta for several months, I’d hoped that the presentation would help me see what I’d missed. I specifically asked Ron what measures they used to evaluate their system, and he was mum. So now I am more unconvinced that ever, though, to steal a line from a colleague, I cannot reconcile their enthusiasm with their results.
The hakia talk was entitled “Search for Meaning.” Christian started by making the case for a semantic, rather than statistical approach to NLP. He then presented hakia’s technology in a fair amount of detail, including walking through examples of worse sense disambiguation using context. I’m not convinced that semantics trump statistics, but I thoroughly enjoyed the presentation, and was intrigued enough to want to learn more. I find the company refreshingly open about its technology (not to mention that their beta is public), and I hope it works well enough to be practical.
Still, I’m not convinced the NLP is either the right answer or the right question. I’m no expert on the history of language, but it’s clear that natural languages are hardly optimal means of communication, even among human beings. Rather, they are artifacts of our satisficing and resisting change. Since we are lucky enough to not have developed expectations that people can communicate with computers using natural language (HAL and Star Trek notwithstanding), why take a step backwards now? Rather than advocating for inefficient, unreliable communication mechanisms like natural language, we should be figuring out ways to make communication more efficient.
To use an analogy, there’s a reason that programming languages have strict rules, and that compilers output errors rather than just trying to guess what you mean. The mild inconvenience upstream is a small cost, compared to the downstream benefits of unambiguous communication. I’m not suggesting that people start speaking in formal languages. But I do feel we should strive for a dialog-oriented approach where both the human and the computer have confidence in their mutual understanding. I can’t resist a plug for HCIR.