It’s not everyday that you see an essay in SIGIR Forum entitled “Against Recall”. Well, to be fair, the full title is “Against Recall: Is it Persistence, Cardinality, Density, Coverage, or Totality?” In it Justin Zobel, Alistair Moffat, and Laurence Park, all researchers at the University of Melbourne, conclude that “the use of recall as a measure of the effectiveness of ranked querying is indefensible.”
It’s a well-written and well-argued essay, and I think the authors at least have it half-right. I agree with their claim that, while precision dominates quantitative analysis of search effectiveness in the research literature, the expressed concerns about recall tend to be more qualitative. Part of the problem, as they note, is that recall is much harder to evaluate than precision (assuming the Cranfield perspective that the relevance of a document to a query is objective).
The authors propose a variety of alternate measures that, in their view, are more useful than recall and are actually what authors really mean when they allude to recall. The most interesting of these, in my view, is what they call “totality”. Indeed, I thought the authors were addressing me personally when they wrote:
It is usual for certain “high recall applications” to be cited to rebut suggestions that recall is of little importance. Examples that are routinely given include searching for precedents in legal cases; searching for medical research papers with results that relate to a particular question arising in clinical practice; and searching to recover a set of previously observed documents.
Yup, I’m listening. They continue:
While we agree that these are plausible search tasks, we dispute that they are ones in which recall provides an appropriate measurement scale. We argue that what distinguishes these scenarios is that the retrieval requirement is binary: the user seeks total recall, and will be (quite probably) equally dissatisfied by any approach that leaves any documents unfound. In such a situation, obtaining (say) 90% recall rather than a mere 80% recall is of no comfort to the searcher, since it is the unretrieved documents that are of greatest concern to them, rather than the retrieved ones.
Whoa there, that’s quite a leap! Like total precision, total recall is certainly an aspiration (and a great Arnie flick), but not a requirement. There are lots of information retreival applications where false negatives matter a lot to us a lot more than false positives–notably in medicine, intelligence, and law. But often what is binary for us is not whether we find all of the “relevant” documents for each individual query–and here I use the scare quotes to assert the subjectivity and malleability of relevance–but rather whether or not we ultimately resolve our overall information need.
Let me use a concrete example from my own personal experience. When my wife was pregnant, she had gestational diabetes. She treated it through diet, and up through week 36 or so things were fine (modulo the trauma of a Halloween without candy). And then one of her doctors made an off-hand allusion to the risk of shoulder dystocia. She came home and told me this, and of course we spent the next several hours online trying to learn more. We had a very specific question: should we opt for a Cesarean section?
I can tell you that no search engine I used was particuarly helpful in making this decision. I was hoping there might be analysis out there comparing the risks of shoulder dystocia with the risks associated with a Cesarean, particularly for women who have gestational diabetes. I couldn’t find any. But worse, I had no idea if there was helpful information out there, and I had no idea when to stop looking. Ultimately we took our chances, and everything turned out great–no shoulder dystocia, no Cesarean, and a beautiful, healthy baby and mother. But it would have been nice to feel that our decision was informed, rather than a nerve-wracking coin toss.
Let’s abstract from this concrete example and consider what I characterize as the information availability problem, where the information seeker faces uncertainty as to whether the information of interest is available at all. The natural evaluation measures associated with information availability are the correctness of the outcome (does the user correctly conclude whether the information of interest is available?); efficiency, i.e., the user’s time or labor expenditure; and the user’s confidence in the outcome.
It’s worth noting that recall is not on the list. But neither is precision. We’re trying to measure the effectiveness of information seeking at a task level, not a query level. But it’s pretty easy to see how precision and recall fit into this scenario. Precision at a query level helps most with improving effeciency at a task level, while recall helps improve correctness of outcome. Finally, perceived recall should help inspire user confidence in the outcome.
To circle back to the essay, I said that the authors were at least half right. They criticize the usefulness of recall for measuring ranked retrieval, and I think they have a point there–ranked retrieval inherently is more about precision than recall. Recall is much more useful as a set retrieval measure. The authors also note that “the idea that a single search will be used to find all relevant documents is simplistic.”
Indeed, I’d go beyond the authors and assert, straight from the HCIR gospel, that the idea that a single search will be used to fully address an information seeking problem is simplistic. But that assumption is the rut where most information retrieval research is stuck. The authors make legitimate points about the problems of recall as a measure, but I think they are missing the big picture. They do cite Tefko Saracevic; perhaps they should look more at his communication-based framework for thinking about relevance in the information seeking process.