Otis just wrote a post in which he cited the Open Relevance Project, an embryonic effort by the Lucene project to build a free, public information retrieval evaluation framework analogous to the TREC conference. Not surprisingly, he sees this as an opportunity for Lucene to prove that it is just as good as the commercial enterprise search engines.
On one hand, I’m delighted to see an attempt to make a TREC-like infrastructure more widely accessible. While the Linguistic Data Consortium and the University of Glasgow may only be charging enough to cover their costs, perhaps there are more efficient ways to manage corpora today. Indeed, other alternatives include publishing torrents and Amazon’s public data sets. If the bottleneck is licensing costs, then perhaps there should be a call to donate data–or to assemble collections from public domain sources.
On the other hand, if the goal of this project is to help companies evaluate competing search offerings, then I think its proponents are chasing the wrong problem. Lest you think I’m biased because of my affiliation with one of those commercial search vendors Otis taunts in his post, I encourage you to check out a post that Jeff Dalton (who is certainly pro-Lucene) wrote a year ago, entitled “Open Source Search Engine Evaluation: Test Collections“. In it, he raises a number of issues that go beyond the issue of data availability. One of the issues he brings up is the evaluation of interactive retrieval, an area for where even TREC has struggled.
I understand the desire for Lucene advocates to prove that Lucene is just as good as or better than the commercial search engines–it’s not that different from the desire every vendor has to make competitive claims about his or her own technology. To Otis’s credit, he recognizes that relevance isn’t the only criterion worthy of assessment–he also suggests extending the Open Relevance Project to include the non-functional metrics of efficiency and scalability. But he still seems to accept an evaluation framework that would treat search engines as out-of-the-box relevance ranking engines.
I dare say I have a little bit of experience with how companies make decisions about search technnology, so let me offer my perspective. Companies build search applications to support specific tasks and information needs. For example, ecommerce sites want to help users find what they are looking for, as well as to target those users with their marketing strategies. Manufacturing companies want to optimize their own part reuse, as well as to make sense of their supply chains. Staffing agencies want to optimize utlization of their consultants and minimize their own costs. Etc.
All of the above rely on search applications to meet their needs. But I don’t think they’d be swayed by a TREC-style relevance bake-off. That’s why companies (and vendors) trumpet success in the form of metrics that reflect task performance (and there are often standard key performance indicators for the various application areas) rather than information retrieval performance. Yes, non-functional requirements like efficiency and scalability matter too–but they presume the functional requirements. If an application can’t meet the functional needs, it really doesn’t matter how quickly it processes queries, or how many documents it can index. Moreover, many companies ask for a proof of concept as part of the sales process. Why? Because they recognize that their needs are idiosyncratic, and they are even skeptical of vendors who have built similar solutions in their space. They see success stories and satisfied customers as positive–but not definitive–evidence.
To summarize: the quest to open up TREC may be of great interest to information retrieval researchers, but I’m highly skeptical that it will create a practically useful framework for comparing search technologies. I think it would be more useful to set up public frameworks where applications (both vendor-sponsored and open-source) can compete on how effectively they help users complete information seeking tasks that are representative of practical applications. I’d love to see a framework like Luis Von Ahn’s “games with a purpose” used for such an endeavor. I would happily participate in such an effort myself, and I’m pretty sure I could drag my employer into it.