Categories
General

CIKM 2011 Industry Event: Stephen Robertson on Why Recall Matters

On October 27th, I had the pleasure to chair the CIKM 2011 Industry Event with former Endeca colleague Tony Russell-Rose. It is my pleasure to report that the program, held in parallel with the main conference sessions, was a resounding success. Since not everyone was able to make it to Glasgow for this event, I’ll use this and subsequent posts to summarize the presentations and offer commentary. I’ll also share any slides that presenters made available to me.

Microsoft researcher Stephen Robertson, who may well be the world’s preeminent living researcher in the area of information retrieval, opened the program with a talk on “Why Recall Matters“. For the record, I didn’t put him up to this, despite my strong opinions on the subject.

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

He went on to remind us that there is information retrieval beyond web search. He hauled out the usual examples of recall-oriented tasks: e-discovery, prior art search, and evidence-based medicine. But he then made the case that not only the web not the only problem in information retrieval, but that “it’s the web that’s strange” relative to the rest of the information retrieval landscape in so strongly favoring precision over recall. He enumerated some of the peculiarities of the web, including its size (there’s only one web!), the extreme variation in authorship and quality, the lack of any content standardization (efforts like schema.org notwithstanding), and the advertising-based monetization model that creates an unusual and sometimes adversarial relationships between content owners and search engines. In particular, he cited enterprise search as an information retrieval domain that violates the assumptions of web search and calls for more emphasis on recall.

Stephen suggested that, rather than thinking in terms of the precision-recall curve, we consider the recall-fallout curve. Fallout is a relatively unknown measure that represents the probability that a non-relevant document is retrieved by the query. He noted that fallout offered little practical use in IR, given that the corpus is populated almost entirely by non-relevant documents. Still, he made the case that the recall-fallout trade-off might be more conceptually appropriate than the precision-recall curve in order to understand the value of recall.

In particular, we can generalize the traditional inverse precision-recall relationship to the hypothesis that the recall-fallout curve is convex (details in “On score distributions and relevance“). We can then calculate instantaneous precision at any point in the result list as the gradient of the recall-fallout curve. Going back to the notion of devices, we can now replace precision devices with fallout devices.

Stephen wrapped up his talk by emphasizing the user of information retrieval systems — as aspect of IR that is too often neglected outside HCIR circles. He advocated that systems provide user with evidence of recall, guidance of how far to go down ranked results, and prediction of the recall at any given stopping point.

It was an extraordinary privilege to have Stephen Robertson present at the CIKM Industry Event, and even better to have him make a full-throated argument in favor of recall. I can only hope that researchers and practitioners take him up on it.

Categories
General

Entities, Relationships, and Semantics: Strata NY Panel on the State of Structured Search

Earlier this year, I had the privilege to moderate a panel at Strata New York 2011 on Entities, Relationships, and Semantics: the State of Structured Search. The four panelists are people I’ve had the pleasure to work with over the years: Andrew Hogue (Google), Breck Baldwin (Alias-i), Evan Sandhaus (New York Times), Wlodek Zadrozny (IBM Research). They work on some of the world’s largest structured search problems — from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding.

O’Reilly has compiled the nearly 50 hours of video from the conference and made the collection available for purchase. I was lucky to attend all of the keynotes and many of the breakout sessions, and I highly recommend them. In the meantime, you can see a recording of the panel I moderated.

Categories
General

Interview in Forbes: What is a Data Scientist?

Dan Woods has been interviewing a variety of folks to answer the question: “What is a data scientist?“, and I had the honor to participate in his series.

Here is a teaser of my interview:

Above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve that will create value for users and drive business decisions.

Read the rest on Forbes.com. And thanks to Drew Conway for the awesome data science Venn diagram above.