Precision and Recall

This month’s issue of IEEE Computer is a special issue featuring information seeking support systems, edited by Gary Marchionini and Ryen White. You can read their introduction for free here; unfortunately, the articles, while available online, are only free for IEEE Xplore subscribers.

What I can share is a 500-word sidebar I wrote that appears on p. 39, in an article by Peter Pirolli entitled “Powers of 10: Modeling Complex Information-Seeking Systems at Multiple Scales“.

Precision and Recall

Information retrieval (IR) research today emphasizes precision at the expense of recall. Precision is the number of relevant documents a search retrieves divided by the total number of documents retrieved, while recall is the number of relevant documents retrieved divided by the total number of existing relevant documents that should have been retrieved.

These measures were originally intended for set retrieval, but most current research assumes a ranked retrieval model, in which the search returns results in order of their estimated likelihood of relevance to a search query. Popular measures like mean average precision (MAP) and normalized discounted cumulative gain (NDCG) [1] mostly reflect precision for the highest-ranked results.

For the most difficult and valuable information-seeking problems, however, recall is at least as important as precision. In particular, for tasks that involve exploration or progressive elaboration of the user’s needs, a user’s progress depends on understanding the breadth and organization of available content related to those needs. Techniques designed for interactive retrieval, particularly those that support iterative query refinement, rely on communicating to the user the properties of large sets of documents and thus benefit from a retrieval approach with a high degree of recall [2].

The extreme case for the importance of recall is the problem of information availability, where the seeker faces uncertainty as to whether the information of interest is available at all. Instances of this problem include some of the highest-value information tasks, such as those facing national security and legal/patent professionals, who might spend hours or days searching to determine whether the desired information exists.

The IR community would do well to develop benchmarks for systems that consider recall at least as important as precision. Perhaps researchers should revive the set retrieval models and measures such as the F1 score, which is the harmonic mean of precision and recall.

Meanwhile, information scientists could use information availability problems as realistic tests for user studies of exploratory search systems, or interactive retrieval approaches in general. The effectiveness of such systems would be measured in terms of the correctness of the outcome (does the user correctly conclude whether the information of interest is available?); user confidence in the outcome, which admittedly may be hard to quantify; and efficiency—the user’s time or labor expenditure.

Precision will always be an important performance measure, particularly for tasks like known-item search and navigational search. For more challenging information-seeking tasks, however, recall is as or more important, and it is critical that the evaluation of information-seeking support systems take recall into account.


  1. K. Järvelin and J. Kekäläinen, “Cumulated Gain-Based Evaluation of IR Techniques,” ACM Trans. Information Systems, Oct. 2002, pp. 422-446.
  2. R. Rao et al., “Rich Interaction in the Digital Library,” Comm. ACM, Apr. 1995, pp. 29-39.

By Daniel Tunkelang

High-Class Consultant.

7 replies on “Precision and Recall”

I find it refreshing to see a reminder that precision is not the only factor that matters in search from someone other than me. Indeed, most of the “discussions” I now have regarding retrieval evaluation involve the claim that MAP is too *recall*-oriented to be a useful measure.

One of the themes of the TREC legal track, which focuses on supporting legal discovery, has been fair evaluation of recall-oriented searches in massive collections. It is a thorny problem that remains mostly unsolved for the general case where you would like reusable test collections as a result.



Ellen, I’ve been beating that drum in the online world for at least 5 or 6 years now. Longer in the offline world (starting about 11 years ago I was doing music information retrieval, and that’s a perfect example of a vertical in which recall is the more important metric.)

Maybe it’s just that our voices are too small, too far out in the wilderness of the web landscape? But we do exist 🙂


Daniel I use F1 often. I am not sure people ignore the trade offs just don’t report them in strict IR papers because they are often selling an algorithm that optimizes one. Look at the ML research where those trad-offs are considered and the balance of those competing goals are articulated much more. Keep up the good posts…


Abdur, that’s a point I hadn’t considered. I suppose I shouldn’t make a blanket assumption that everyone is teaching to the test, so to speak. Still, my experience in the non-academic world suggests a need to invest more effort on the recall side, particularly for applications that offer any kind of summarization of result sets.

In any case, thanks for the encouragement. I’m looking forward to seeing where you and your team go with Twitter search.


Comments are closed.