Categories
General

In Defense Of Recall

It’s not everyday that you see an essay in SIGIR Forum entitled “Against Recall”. Well, to be fair, the full title is “Against Recall: Is it Persistence, Cardinality, Density, Coverage, or Totality?” In it Justin Zobel, Alistair Moffat, and Laurence Park, all researchers at the University of Melbourne, conclude that “the use of recall as a measure of the effectiveness of ranked querying is indefensible.”

It’s a well-written and well-argued essay, and I think the authors at least have it half-right. I agree with their claim that, while precision dominates quantitative analysis of search effectiveness in the research literature, the expressed concerns about recall tend to be more qualitative. Part of the problem, as they note, is that recall is much harder to evaluate than precision (assuming the Cranfield perspective that the relevance of a document to a query is objective).

The authors propose a variety of alternate measures that, in their view, are more useful than recall and are actually what authors really mean when they allude to recall. The most interesting of these, in my view, is what they call “totality”. Indeed, I thought the authors were addressing me personally when they wrote:

It is usual for certain “high recall applications” to be cited to rebut suggestions that recall is of little importance. Examples that are routinely given include searching for precedents in legal cases; searching for medical research papers with results that relate to a particular question arising in clinical practice; and searching to recover a set of previously observed documents.

Yup, I’m listening. They continue:

While we agree that these are plausible search tasks, we dispute that they are ones in which recall provides an appropriate measurement scale. We argue that what distinguishes these scenarios is that the retrieval requirement is binary: the user seeks total recall, and will be (quite probably) equally dissatisfied by any approach that leaves any documents unfound. In such a situation, obtaining (say) 90% recall rather than a mere 80% recall is of no comfort to the searcher, since it is the unretrieved documents that are of greatest concern to them, rather than the retrieved ones.

Whoa there, that’s quite a leap! Like total precision, total recall is certainly an aspiration (and a great Arnie flick), but not a requirement. There are lots of information retreival applications where false negatives matter a lot to us a lot more than false positives–notably in medicine, intelligence, and law. But often what is binary for us is not whether we find all of the “relevant” documents for each individual query–and here I use the scare quotes to assert the subjectivity and malleability of relevance–but rather whether or not we ultimately resolve our overall information need.

Let me use a concrete example from my own personal experience. When my wife was pregnant, she had gestational diabetes. She treated it through diet, and up through week 36 or so things were fine (modulo the trauma of a Halloween without candy). And then one of her doctors made an off-hand allusion to the risk of shoulder dystocia. She came home and told me this, and of course we spent the next several hours online trying to learn more. We had a very specific question: should we opt for a Cesarean section?

I can tell you that no search engine I used was particuarly helpful in making this decision. I was hoping there might be analysis out there comparing the risks of shoulder dystocia with the risks associated with a Cesarean, particularly for women who have gestational diabetes. I couldn’t find any. But worse, I had no idea if there was helpful information out there, and I had no idea when to stop looking. Ultimately we took our chances, and everything turned out great–no shoulder dystocia, no Cesarean, and a beautiful, healthy baby and mother. But it would have been nice to feel that our decision was informed, rather than a nerve-wracking coin toss.

Let’s abstract from this concrete example and consider what I characterize as the information availability problem, where the information seeker faces uncertainty as to whether the information of interest is available at all. The natural evaluation measures associated with information availability are the correctness of the outcome (does the user correctly conclude whether the information of interest is available?); efficiency, i.e., the user’s time or labor expenditure; and the user’s confidence in the outcome.

It’s worth noting that recall is not on the list. But neither is precision. We’re trying to measure the effectiveness of information seeking at a task level, not a query level. But it’s pretty easy to see how precision and recall fit into this scenario. Precision at a query level helps most with improving effeciency at a task level, while recall helps improve correctness of outcome. Finally, perceived recall should help inspire user confidence in the outcome.

To circle back to the essay, I said that the authors were at least half right. They criticize the usefulness of recall for measuring ranked retrieval, and I think they have a point there–ranked retrieval inherently is more about precision than recall. Recall is much more useful as a set retrieval measure. The authors also note that “the idea that a single search will be used to find all relevant documents is simplistic.”

Indeed, I’d go beyond the authors and assert, straight from the HCIR gospel, that the idea that a single search will be used to fully address an information seeking problem is simplistic. But that assumption is the rut where most information retrieval research is stuck. The authors make legitimate points about the problems of recall as a measure, but I think they are missing the big picture. They do cite Tefko Saracevic; perhaps they should look more at his communication-based framework for thinking about relevance in the information seeking process.

By Daniel Tunkelang

High-Class Consultant.

19 replies on “In Defense Of Recall”

Thanks. The library and information scientists get that too, and indeed they take a very cognitive / user-centric approach to information seeking. I think most information retrieval researchers view the batch retrieval model is simply a necessary evil to make their work measurable. But assumptions have consequences, and I think it’s a real problem that the IR community hasn’t established a consensus on how–or ever whether–to measure retrieval effectiveness when multiple queries are involved.

Like

I think the single-shot/batch issue is a bit of a red herring.

You can still do multi-shot query evaluation and measure it in terms of precision and recall. You might have to ask users to trawl through real results until they’ve completed their task, but at the end of the day the entire set of information that the user has examined can be thought of as a sequence or a list. And with that list, you can perform precision and recall calculations.

Well, you might have to change the name of the metric to “viewed precision” and “viewed recall”. But the overall concept still applies.

However, even if you do measure it this way, there is still the issue of how well recall and/or precision approximate task completion (aka “information need satisfaction”), which we all agree is what we’re really after.

Yes, there is a danger of reifying the evaluation metric, whether that metric is precision, recall, ndcg or whatever. But that temptation to reify is going to be true of any and every evaluation metric, no matter what it is. All metrics are only approximations. And as long as we don’t forget that, the approach that I favor is to let metrics proliferate. The more metrics you use, the more approximations, the more samples points you have, and that can only be a good thing.

That is, instead of being “against recall” because it is unclear whether one means cardinality, coverage, density, etc…why not be pro-recall, pro-cardinality, pro-coverage, pro-density, pro-precision, pro-etc.?

Like

You can still do multi-shot query evaluation and measure it in terms of precision and recall.

I think it might be a bit more difficult to execute with rigor. I’d be interested in any reference.

All metrics are only approximations. And as long as we don’t forget that, the approach that I favor is to let metrics proliferate.

It would be interesting to do a survey and see whether metrics “proliferate”. If they do, then I agree we do not have a problem.

Like

It would be interesting to do a survey and see whether metrics “proliferate”. If they do, then I agree we do not have a problem.

I don’t quite understand what you mean by doing a survey. Mine was a proscriptive suggestion, not a descriptive.

But the suggestion does have precedent. In the early days of TREC (early-mid 90s), there wasn’t just one way of presenting precision or recall. You have everything from MAP, to Prec@n, to R-prec, to recall@1000, to interpolated precision/recall (e.g. prec at 0.0 interpolated recall, at 0.1, etc.) .. and then later GMAP, and ndgc, and.. and..

And a lot of the early studies didn’t just report one number, one metric. A lot of the early studies reported *lots* of metrics.

It is only relatively recently.. really since the popularization of the web and web search engines, that the number of reported metrics per paper seems to have decreased.

I still say, go with all of ’em. Give the biggest, most robust picture available.

And a lot

Like

@jeremy

My point was precisely as stated. If we have a proliferation of metrics, then we are not likely to have a problem.

Thus, in any field, not just IR, it would be interesting to do surveys to see whether people use and develop a wide range of metrics.

Then, if they don’t, we should investigate and determine whether it is a problem.

I think it is an interesting, generic, research program.

Like

Area under the ROC is my favorite Precision/Recall mix metric at the moment.

Beyond that I think we still are dealing with context. Even the results are contextual.

Like

My point was precisely as stated. If we have a proliferation of metrics, then we are not likely to have a problem.

Your point (I thought) was that if in the historical literature there is a proliferation, then we do not have a problem. (It would be interesting to do a survey and see whether metrics “proliferate”.)

My counterpoint is that we don’t have to do a survey, because we can proliferate the metrics ourselves. By writing papers that propose, and utilize, new metrics.

Both recall and precision aside, one of the things I learned very early on in my IR career is that there is a strong (at least theoretical) willingness in the general community to accept any paper that uses a non-traditional evaluation metric, as long as that metric is justified and objective (not just made up to fit the data).

Granted, not every single reviewer feels this way, but I’ve had enough private conversations over the years to know that a lot of people do. In fact, I’ll venture out on a limb and make the observation that the younger the researcher, the more dogmatic he or she tends to be about the evaluation metric used. I’ve had many more conversations with IR “old timers” in which they’ve expressed willingness to let metrics proliferate.

And hypothetically, suppose almost no one allowed metrics to proliferate. If that were the case, it doesn’t mean we can’t do good science and use good metrics anyway. Yes, it might be much more difficult to get papers accepted, and to get funding, etc. But that’s a constant problem that we face either way.. convincing others of the value of our work.

But luckily, that’s not the case. I do find a willingness in the IR community to accept papers with non-traditional metrics.

Like

I’ve mentioned (to Dan) before that faceting on some sites I use (e.g. NewEgg or Amazon) induces low recall because of data entry errors (e.g. listing the wrong DDR type for a memory chip, so I can’t find it when I search under the DDR3 facet).

I’ve blogged about high recall (and here and here).

Our NIH grant is focused on high recall linkage of the research literature to gene databases (and other databases). In the gene case, for any given release of the databases, the queries are fixed — they’re just the genes in the Entrez-Gene database. And we just want to find articles that mention them.

A serious issue is results diversity when trying to formalize useful recall. Both micro- and macro-averages are misleading, as is the pooled evaluation strategy of TREC, which tends to overestimate true recall (by assuming each true positive was found by at least one team in the top 1000 results).

It’s very easy to find some genes and very hard to find others. We can find all gazillion mentions of “p53” with high precision, but have a harder time with the mouse gene “at”. And that applies across aliases within genes, too; Alpha-1-antichymotrypsin is easy to find when referred to as “SERPINA3”, but much harder when called “ACT” (there’s lots of all caps titles, etc. in MEDLINE).

Like

Bob, sorry not to respond to this sooner–was a bit occupied with SIGIR. Your posts reinforce my feeling that the IR community has under-emphasized recall relative to its value in real-world problems. Of course, I blame this state of affairs on the laziness of people to think beyond ranked retrieval as experienced in web search. But I’m glad to see more people questioning the status quo, e.g., two of the speakers at the SIGIR Industry Track (whom I didn’t pay off!).

Like

At the risk of being off topic, I want to point out that for technical questions like the medical example in the article, I would switch to something like Yahoo! Answers, after failing to find useful information for some time.

In library science, the first question to ask is not how to formulate a query, but where to look for the information. Cheers.

Like

Le, not off topic at all–in fact, I did consider switching modes from information search to expert search, and would have done so in this case had there been time. Earlier in the pregnancy, we did consult doctor in our social network to help make sense of literature about various risks and trade-offs.

Like

Dan, glad to know you already considered that option. When Googling fails, it is indeed difficult to find accurate and affordable answers quickly. Glad to know mother and baby are fine!

Like

Comments are closed.