Set Retrieval vs. Ranked Retrieval

Post author By Daniel Tunkelang
Post date August 24, 2008
30 Comments on Set Retrieval vs. Ranked Retrieval

After last week’s post about a racially targeted web search engine, you’d think I’d avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I’ll start by introducing the terms.

In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.
In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.

An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:

The number of documents reported to match my search should be meaningful–or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.
Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

By Daniel Tunkelang

High-Class Consultant.

View Archive

30 replies on “Set Retrieval vs. Ranked Retrieval”

See this is interesting. One of the things that set retrieval would enable is ranking by novelty. The idea, to be clear, is to put in the first page of results, the 10 that, together, cover the most breadth of a topic. I think it would be really interesting, as part of this, to see novelty oriented result information, so that the snippit of text you see covers what that result covers that the other results do not. if your dataset has keywords etc too, then you could should the keywords that it has that the other results do not. perhaps.ive not done anything specific in this area myself. but i've seen some guys from spain send out a few papers including some IP&M papers. In particular, i spoke to David Losisa‘s student about the idea at SIGIR07.i suspect it might be more applicable to vertical search. i guess it has most applicability to domain learning tasks, as prioritizing novelty might exclude your answer that was in result 2 from the relevance ranking.

Share this:

Related

By Daniel Tunkelang

30 replies on “Set Retrieval vs. Ranked Retrieval”