At SIGIR 2004, Ellen Voorhees presented a paper entitled “Measuring Ineffectiveness” in which she asserted:
Using average values of traditional evaluation measures [for information retrieval systems] is not an appropriate methodology because it emphasizes effective topics: poorly performing topics’ scores are by definition small, and they are therefore difficult to distinguish from the noise inherent in retrieval evaluation.
Ellen is one of the world’s top researchers in the field of information retrieval evaluation. And for those not familiar with TREC terminology, “topics” are the queries used to evaluate information retrieval systems. So what she’s saying above is that, in order to evaluate systems effectively, we need to focus more on failures than on successes.
Specifically, she proposed that we judge information retrieval system performance by measuring the percentage of topics (i.e., queries) with no relevant results in the top 10 retrieved (%no), a measure that was then adopted by the TREC robust retrieval track.
Information Retrieval in the Wild
Information retrieval (aka search) in the wild is a bit different from information retrieval in the lab. We don’t have a gold standard of human relevance judgements against which we can compare search engine results. And even if we can assemble a representative collection of test queries, it isn’t economically plausible to assemble this gold standard for a large document corpus where each query can have thousands — even millions — of relevant results.
Moreover, the massive growth of the internet and the advent of social networks have changed the landscape of information retrieval. The idea that the relationship between a document and a search query, would be sufficient to determine relevance was always a crude approximation, but now the diversity of a global user base makes this approximation even cruder.
For example, consider this query on Google for [nlp]:
Hopefully Google’s hundreds of ranking factors — and all of you — know me well enough to know that, when I say NLP, I’m probably referring to natural language processing rather than neuro-linguistic programming. Still, it’s an understandable mistake — the latter NLP sells a lot more books.
And search in the context of a social network makes the user’s identity and task context key factors for determine relevance — factors that are uniquely available to each user. For example, if I search on Linkedin for [peter kim], the search engine cannot know for certain whether I’m looking for my former co-worker, a celebrity I’m connected to, a current co-worker who is a 2nd-degree connection, or someone else entirely.
In short, we cannot rely on human relevance judgments to determine if we are delivering users the most relevant results.
From %no to WTF! @ k
But human judgments can still provide enormous value for evaluating search engine and recommender system performance. Even if we can’t use them to distinguish the most relevant results, we can identify situations where we are delivering glaringly irrelevant results. Situations where the user’s natural reaction is “WTF!“.
People understand that search engines and recommender systems aren’t mind readers. We humans recognize that computers make mistakes, much as other people do. To err, after all, is human.
What we don’t forgive — especially from computers — are seemingly inexplicable mistakes that any reasonable person would be able to recognize.
I’m not going to single out any sites to provide examples. I’m sure you are familiar with the experience of a search engine or recommender system returning a result that makes you want to scream “WTF!”. I may even bear some responsibility, in which case I apologize. Besides, everyone is entitled to the occasional mistake.
But I’m hard-pressed to come up with a better measure to optimize (i.e., minimize) than WTF! @ k — that is, the number of top-k results that elicit a WTF! reaction. The value of k depends on the application. For a search engine, k = 10 could correspond to the first page of results. For a recommender system, k is probably smaller, e.g., 3.
Also the system can substantially mitigate the risk of WTF! results by providing explanations for results and making the information seeking process more of a conversation with the user.
Measuring WTF! @ k
Hopefully you agree that we should strive to minimize WTF! @ k. But, as Lord Kelvin tells us, if you can’t measure it, then you can’t improve it. How do we measure WTF! @ k?
On one hand, we cannot rely on click behavior to measure it implicitly. All non-clicks look the same, and we can’t tell which ones were WTF! results. In fact, egregiously irrelevant results may inspire clicks out of sheer curiosity. One of the phenomena that search engines watch out for is an unusually high click-through rate — those clicks often signal something other than relevance, like a racy or offensive result.
On the other hand, we can measure WTF! @ k with human judgments. A rater does not need to have the personal and task context of a user to evaluate whether a result is at least plausibly relevant. WTF! @ k is thus a measure that is amenable to crowdsourcing, a technique that both Google and Bing use to improve search quality. As does LinkedIn, and we are hiring a program manager for crowdsourcing.
As information retrieval systems become increasingly personalized and task-centric, I hope we will see more people using measures like WTF! @ k to evaluate their performance, as well as working to make results more explainable. After all, no one likes hurting their computer’s feelings by screaming WTF! at it.