Thinking about IR Evaluation

I just read the recent Information Processing & Management special issue on Evaluation of Interactive Information Retrieval Systems. The articles were a worthwhile read, and yet they weren’t exactly what I was looking for. Let me explain.

In fact, let’s start by going back to Cranfield. The Cranfield paradigm offers us a quantitative, repeatable means to evaluate information retrieval systems. Its proponents make a strong case that it is effective and cost-effective. Its critics object that it measures the wrong thing because it neglects the user.

But let’s look a bit harder at the proponents’ case. The primary measure in use today is average precision–indeed, most authors of SIGIR papers validate their proposed approaches by demonstrating increased mean average precision (MAP) over a standard test collection of queries. The dominance of average precision as a measure is no accident: it has been shown to be the best single predictor of the precision-recall graph.

So why are folks like me complaining? There are the various user studies asserting that MAP does not predict user performance on search tasks. Those have me at hello, but the studies are controversial in the information retrieval community, and in any case not constructive.

Instead, consider a paper by Harr Chen and David Karger (both at MIT) entitled "Less is more." Here is a snippet from the abstract:

Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the user’s information need.

Let me rephrase that: the precision-recall graph, which indicates how well a ranked retrieval algorithms does at ranking relevant documents ahead of irrelevant ones, does not necessarily characterize how well a system meets a user’s information need.

One of Chen and Karger’s examples is the case where the user is only interested in retrieving one relevant document. In this case, a system does well to return a diverse set of results that hedges against different possible query interpretations or query processing strategies. The authors also discuss more general scenarios, along with heuristics to address them.

But the main contribution of this paper, at least in my eyes, is a philosophical one. The authors consider the diversity of user needs and offer quantitative, repeatable way to evaluate information retrieval systems with respect to different needs. Granted, they do not even consider the challenge of evaluating interactive information retrieval. But they do set a good example.

Stay tuned for more musings on this theme…

By Daniel Tunkelang

High-Class Consultant.

5 replies on “Thinking about IR Evaluation”

Diversity seems to be an increasingly important metric for many techniques that return a ranked list of results: given that we cannot have perfect personalization and that we cannot figure out exactly what the user wants to see, let’s offer a variety of different results, and let the user pick. We may not get all the results right, but we will have something for everyone. It is very close to the idea of faceted search (or guided summarization that you mentioned in an earlier post). But instead of exposing the diversity of the results using multiple orthogonal browsing components, you try to “embed” diversity in the ranked list. (Or, even better, you expose both the facets and generate diverse results.)I also like to connect this idea (conceptually) to the ideas of minimizing risk in finance: In Finance we know that stocks have high expected performance. However, they tend to go all up or all down together (high correlation). Therefore you want to mix these with some uncorrelated or anti-correlated investments (commodities, bonds), so that you can have slightly lower expected performance, but much lower risk of complete failure.Perhaps we should start having risk-sensitive evaluations in IR, in the same way that people in Finance measure the “value at risk” in investment portfolios?


This concept of diversity in search results, i believe, is especially important in faceted search, as every result item it usually equally weighted to a facet value. so when you make a selecetion in a facet, every result is equally relevant. so – how do we organise the results?There was a paper at SIGIR07 by gary marchionini’s student that was showing how they had created a matrix of how similar/different each result was to every other result in the list. Sadly it didnt show WHY it was different.but lets take that WHY for a second. if the results list showed what made each result novel according to the rest, then could we more accurately choose what was important to show in the per-result summaries? ‘if you choose this one – you find this information that is not in the rest of the results’. it could not only affect which text snippit to show, but which other facet-values (in other facets) it belongs to that makes it unique love to see a decent representation of novelty in a results set – id be doing it if i had the time/resources. there are some people working on novelty – david losada in this IPandM article appear to be talking about this sort of topic for example. i havent read the paper yet though.


Indeed, I think diversity of results is important and underemphasized by today’s search engines, at least based on what I infer from personal experience. Interfaces that promote query refinement (e.g., faceted search, clustering) may offer a more diverse or risk-mitigating experience. And formal models like MMR or Zhai/Lafferty are certainly aiming for the benefits of diversity.But Chen and Karger, who restrict themselves to returning a list of results rather than suggesting query refinements, aren’t just talking about diversity. The measure they propose, k-call at n is binary: the measure returns 1 if at least k of the top n results are relevant, 0 otherwise. Hence, at k=1, an algorithm does well to return a diverse set of results, in the hopes that at least one will be relevant. But at k=n, the algorithm does better to return a homogeneous set of results, at least if Van Rijsbergen’s cluster hypothesis holds.


Comments are closed.