WTF! @ k: Measuring Ineffectiveness

At SIGIR 2004, Ellen Voorhees presented a paper entitled “Measuring Ineffectiveness” in which she asserted:

Using average values of traditional evaluation measures [for information retrieval systems] is not an appropriate methodology because it emphasizes effective topics: poorly performing topics’ scores are by definition small, and they are therefore difficult to distinguish from the noise inherent in retrieval evaluation.

Ellen is one of the world’s top researchers in the field of information retrieval evaluation. And for those not familiar with TREC terminology, “topics” are the queries used to evaluate information retrieval systems. So what she’s saying above is that, in order to evaluate systems effectively, we need to focus more on failures than on successes.

Specifically, she proposed that we judge information retrieval system performance by measuring the percentage of topics (i.e., queries) with no relevant results in the top 10 retrieved (%no), a measure that was then adopted by the TREC robust retrieval track.

Information Retrieval in the Wild

Information retrieval (aka search) in the wild is a bit different from information retrieval in the lab. We don’t have a gold standard of human relevance judgements against which we can compare search engine results. And even if we can assemble a representative collection of test queries, it isn’t economically plausible to assemble this gold standard for a large document corpus where each query can have thousands — even millions — of relevant results.

Moreover, the massive growth of the internet and the advent of social networks have changed the landscape of information retrieval. The idea that the relationship between a document and a search query, would  be sufficient to determine relevance was always a crude approximation, but now the diversity of a global user base makes this approximation even cruder.

For example, consider this query on Google for [nlp]:

Hopefully Google’s hundreds of ranking factors — and all of you —  know me well enough to know that, when I say NLP, I’m probably referring to natural language processing rather than neuro-linguistic programming. Still, it’s an understandable mistake — the latter NLP sells a lot more books.

And search in the context of a social network makes the user’s identity and task context key factors for determine relevance — factors that are uniquely available to each user. For example, if I search on Linkedin for [peter kim], the search engine cannot know for certain whether I’m looking for my former co-worker, a celebrity I’m connected to, a current co-worker who is a 2nd-degree connection, or someone else entirely.

In short, we cannot rely on human relevance judgments to determine if we are delivering users the most relevant results.

From %no to WTF! @ k

But human judgments can still provide enormous value for evaluating search engine and recommender system performance. Even if we can’t use them to distinguish the most relevant results, we can identify situations where we are delivering glaringly irrelevant results. Situations where the user’s natural reaction is “WTF!“.

People understand that search engines and recommender systems aren’t mind readers. We humans recognize that computers make mistakes, much as other people do. To err, after all, is human.

What we don’t forgive — especially from computers — are seemingly inexplicable mistakes that any reasonable person would be able to recognize.

I’m not going to single out any sites to provide examples. I’m sure you are familiar with the experience of a search engine or recommender system returning a result that makes you want to scream “WTF!”. I may even bear some responsibility, in which case I apologize. Besides, everyone is entitled to the occasional mistake.

But I’m hard-pressed to come up with a better measure to optimize (i.e., minimize) than WTF! @ k — that is, the number of top-k results that elicit a WTF! reaction. The value of k depends on the application. For a search engine, k = 10 could correspond to the first page of results. For a recommender system, k is probably smaller, e.g., 3.

Also the system can substantially mitigate the risk of WTF! results by providing explanations for results and making the information seeking process more of a conversation with the user.

Measuring WTF! @ k

Hopefully you agree that we should strive to minimize WTF! @ k. But, as Lord Kelvin tells us, if you can’t measure it, then you can’t improve it. How do we measure WTF! @ k?

On one hand, we cannot rely on click behavior to measure it implicitly. All non-clicks look the same, and we can’t tell which ones were WTF! results. In fact, egregiously irrelevant results may inspire clicks out of sheer curiosity. One of the phenomena that search engines watch out for is an unusually high click-through rate — those clicks often signal something other than relevance, like a racy or offensive result.

On the other hand, we can measure WTF! @ k with human judgments. A rater does not need to have the personal and task context of a user to evaluate whether a result is at least plausibly relevant. WTF! @ k is thus a measure that is amenable to crowdsourcing, a technique that both Google and Bing use to improve search quality. As does LinkedIn, and we are hiring a program manager for crowdsourcing.


As information retrieval systems become increasingly personalized and task-centric, I hope we will see more people using measures like WTF! @ k to evaluate their performance, as well as working to make results more explainable. After all, no one likes hurting their computer’s feelings by screaming WTF! at it.


By Daniel Tunkelang

High-Class Consultant.

19 replies on “WTF! @ k: Measuring Ineffectiveness”

I like the measure! Not sure if it is so easy to crowdsource that task though in terms of avoiding some sort of context.
For example, if I use “SIGIR” as an image search query on I get SIGIR pics (as I would expect), but on the top 20 results are cows and sausages. To me this is a WTF, but to someone else who knows that sausage company it is the expected outcome.
Unless the crowdsourcing task includes some sort of explanation of how the search system came up with the top ranked results or a user profile of the worker, it will still be hard to determine if these results are valid or totally insane from a human’s point of view.



Glad you like it! And you’re right there are limits to crowdsourcing, particularly when it comes to language and geographical context. Still, the context necessary to make a WTF! judgment is a lot more transferrable than the context necessary to assess relevance.

In any case, explanations are a good idea — not just for evaluators but for all users.


Great article.

But please do consider that Information Retrieval is *not* search. There is also browsing/navigation and facetted methods (as some kind of hybrid).

I’ve read many IR studies that show that search/navigation is actually the most important IR method in many aspects.


Karl, thanks! As for distinguishing between information retrieval and “traditional” search, you’re preaching to the converted. I co-organize the HCIR Symposium and wrote a book on faceted search.

The idea of WTF! @ k applies generally. For example, the choice of facets or facet values to offer as refinement options is an important one in establishing an effective conversation with the user. In fact, one of the challenges in these interfaces is that the faceted refinement options only consider set of the documents that match the query, even when most of those documents are completely irrelevant to the user’s information need.

I discuss some of these issues in my book — you can read a free chapter online.


Hi Daniel!

Don’t worry, I know you, have implemented TunkRank by myself, and your Faceted Search book is within my arm length 🙂

But the article above contains “Information retrieval (aka search)” which is not quite true (IMHO) and you have linked the Wikipedia-article which also lacks navigation/search/facets.

Readers of your article have to assume that IR is search and nothing else.

I totally agree to the rest of the article and I also see many evaluation issues in IR and PIM by myself! These days, I write my PhD on PIM and there are awful many ways to screw up evaluation :-O


Point taken, and thanks for raising the distinction.

Both terms, “information retrieval” and “search” have pretty fuzzy scope. Depending who you ask, you get a different answer as to whether one is subsumed by other — and even which is the more general term!

My preferred term is “information-seeking support system“. But that’s not exactly mellifluous. 🙂


Hi Daniel,

I wonder if the reason I find myself so interested in this posting is due to the level of frustration I’ve encountered in recent, unsuccessful searches;-)

I have not (yet) studied IR, so my comments are coming more from the ‘common sense’ aspect of my thinking vs. domain expertise.

Just a couple of quick comments:

1) I’ve noticed that for your second example [peter kim] using the LinkedIn People Search system, if I enter values for two different attributes/features of the search target, say, ‘’ and ‘user.employer’ for example [peter kim linkedin], that the LinkedIn search engine frequently does a good job of sifting through all of the name collisions and provides the correct person.

I’ve also noticed that other social networking providers don’t seem to do this as well, possibly because many of their users do not fill in the employer field of their profile.

This makes me wonder if special purpose search engines (such as LI People Search) ever intentionally add in human SME-motivated rules into their search algorithm logic or if they solely apply machine algorithms.

One example of a hybrid (human SME + algorithmic) approach: In the LinkedIn People Search system, if any of the tokens in the search box match a company name, then some sort of query equivalent to the below could be performed:

SELECT user WHERE LIKE ${non-company tokens} AND
( user.currentEmployer LIKE ${company name token} OR
user.formerEmployers LIKE ${company name token} )

2) You may have addressed this question in your discussion above, but I wasn’t sure: Do search engines ever monitor and react to the user’s UX actions after the first page of query hits/results have been displayed? (i.e., a UX event feedback loop into the search engine results sifting logic)

It seems to me that if the user’s very first click upon viewing the first page of search results is to immediately jump to page 2 of the search results, this may be a big clue to the search engine that it did a poor job (bad assumptions). For instance, it may have returned results from the wrong domain.

It seems plausible to me that, under the hood, search engines likely have multiple ranking metrics and could, upon learning that the first results were irrelevant, switch to the next most likely ranking metric applicable given the input search words. For example, they might increase the weighting of synonyms, phonetic equivalence, or even geographic proximity, etc. vs. mainstream word-match-based approaches.

Has something like this already been attempted? Might this ‘adaptive’ sort of approach be helpful for certain categories of searches?

Thank you for sharing an interesting topic.


I have been trying to do basically what you suggest by going through and rerunning user queries on this new search service (similar to google scholar) for the library (academic) we just stealth launched.

I do what you call WTF!@5 , because we have data that the majority of people don’t click anything beyond the top 5.

Identifying what you call WTF! is a bit difficult. Sometimes queries are clearly about looking for a known book or article and that’s easy to detect pass/fails, don’t really need a WTF criteria.

Topic searches equalvant to your NLP search on this new search service is where I apply your WTF! idea.

Sometimes it’s easy, other-times the term is so technical, I have to do a google search to figure out what the term was supposed to be about.

I am a bit surprised that well tuned commercial systems like google can benefit from WTF type criteria.


Add on:

According to Cutrell2006[1] it might depend on the type of data, whether or not users are willing to scroll further down in result sets:

“In Web search, the top 3 positions account for more than 70% of all items opened [10]. As we expected, this is less true for personal search using Phlat. In Phlat, only 30% of all
invoked items are in the top 3 positions, and there is an extremely long tail (see Figure 5a). So for example, if a user is looking for a very old item, he has a good idea as to relative position of the target in the current result set.
Rather than re-query to bring the item closer to the top (and run the risk of over-specifying his query), he can quickly scroll to the end of the result set to find the information.”

I found this notion very interesting.

[1]”Fast, Flexible Filtering with Phlat” (CHI2006)–phlat.pdf


Richard, people search at LinkedIn does take advantage of some domain knowledge — for example, we know what kinds of queries people make, and we are usually able to parse queries to determine if a token corresponds to a first/last name, company name, etc. But we do rely heavily on machine learning. Check out a recent presentation on Related Searches at LinkedIn” by Mitul Tiwari, Azarias Reda, Yubin Park, Christian Posse, and Sam Shah. As for the more adaptive approach you suggest, it sounds a bit like dynamic ranked retrieval.

Aaron, the reason that even Google benefits from a WTF! approach is that, for many queries — especially non-navigational queries — no answer is perfect but some answers are flat-out wrong. As Matt Cutts describes in the linked post, Google use human raters to supplement what it learns from click behavior.

Karl, thanks for the link. I agree that PIM has unique characteristics that distinguish it from most other information retrieval scenarios. But I do find myself saying WTF! on occasion when I use Spotlight and Outlook search, so perhaps the concept still applies.

Chris, I prefer the definition of cuil as a verb: “to look for something, but to never find it”.


Thank you for your feedback on my comments, Daniel.

I looked at the two links you provided. The Metaphor work is impressive.

The dynamic ranked retrieval is close to what I had in mind.

I suppose what I really had in mind was that a search engine could maintain some sort of backing taxonomy of concepts and predicates, especially search engines working in constrained domains where construction of a taxonomy is more practical.

Then, as a user’s UI event stream is fed back into a search engine possessing a backing taxonomy, the search engine could iteratively be trying to continually refine its estimates (and rankings) as to which nodes in the taxonomy the user is targeting using some sort of back-tracking algorithm.

Oh, and on another related subject (you mentioned Machine Learning above), I had some fun a couple of days ago answering a LinkedIn-posed question (LinkedIn Connections: When to Accept or Reject). Instead of using words, I decided to respond to the question with a graphical decision tree:

This, in turn, made me wonder if LinkedIn is using ML decision tree inference techniques to answer questions like the one posed.


How about extending the WTF!@k from ‘what the f*’ to include also ‘where the f*’. Some queries have such an obvious item to retrieve, that having it not appear at the top k is a glaring error. These may be easier to identify using click behaviour.


Not trying to call you out on scooping. Just pointing to something you might not have seen. I’ve been quoting Lamere points on this blog for years now.. good to know he’s found a nice home at RecSys 🙂

I think there is something about music information retrieval that gets people to think in more HCIR, set-oriented, non-McDonalds-ified, non-NDCG@3 ways. I think I think the way I do, and Paul thinks the way he does, because the field as a whole, the domain as a domain, only works if you think about things that way. It’s very non-web-ian.



That’s ok, you can call me out. 🙂

Anyway, I agree that music IR seems to inspire different values than web IR. Maybe because the utility is more experiential and less transactional. Or maybe it’s just a hipper community.


Comments are closed.