I haven’t posted any ramblings about information retrieval theory in a while. Some of you might be grateful for this lull, but this post is for those of you who miss such thoughts. Everyone else: you’ve been warned!
Here’s what I’ve been thinking about. At one extreme, we have set retrieval, which, given a query, divides a corpus into two subsets corresponding to those documents the system believes to be relevant and those it does not–a binary split. At the other extreme, we have ranked retrieval, which orders documents according to their estimated likelihood of relevance. Given the poor reputation of extremism, I want to explore the space between these extremes.
In both extreme cases, the system returns an ordered sequence of subsets of the corpus, and I propose we consider this as a general framework, which we might call ranked set retrieval. In the first case, the system returns two sets; in the second case, it returns as many singleton sets as there are documents in the corpus. In practice, of course, even ranked retrieval systems tend to dismiss some subset of the corpus as irrelevant, which we can model in our ranked set retrieval framework by appending that subset to the end of the ranked sequence of singletons.
Now that we can consider set retrieval and ranked retrieval in the same framework, we can ask interesting questions and reason about how they should inform the evaluation criteria for information retrieval systems.
For example, when is set retrieval a more appropriate response to a query than ranked retrieval? An easy–though only partial–answer there is evident from symmetry: set retrieval is more appropriate in cases where our estimates of relevance are themselves binary, and where we thus have no principled basis for a finer-grained partition. Hence, given such binary relevance assessments, our retrieval algorithm should recognize that our optimal response is to return two subsets. Conversely, the more fine-grained our estimates of relevance, the greater a basis we have for returning more subsets and including those documents estimated to be more relevant in earlier subsets. At the extreme, the relevance estimates for all documents may be so well separated that the optimal response is, in fact, to return a sequence of singleton sets as per conventional ranked retrieval.
Of course, the interesting cases are in between, i.e., where the optimal response to a query is a collection of subsets corresponding to varying ranges of relevance assessment. Or perhaps we should go beyond bucketing by relevance estimates, and instead optimize for the probability that one of the offered subsets has a high utility reflecting a combination of precision and recall. We could then ordering the subsets by their utility. In fact, a utility measure for such an approach could be recursive–since each subset is really a subquery or query refinement that can then be partitioned into ranked subsets. Indeed, such a recursive approach closely models the behavior we see with information retrieval systems that support interaction.
Why does this subject concern me so much? It’s not just that I’d like to see robust evaluation measures for faceted search and clustering–I’d like to see measures that are able to compare them against ranked retrieval in a common framework, without having to depend on user studies.
Perhaps I’m naively rediscovering paths already explored by folks like Yi Zhang and Jonathan Koren. Their notion of “expected utility based evaluation” does strike a chord. But I don’t see them or anyone else taking the next step and using such an approach to compare the apples and oranges of set and ranked retrieval methods. It’s a missed opportunity, and maybe even a way to bring IR respectability to approaches designed for interactive and exploratory search. If IR can’t come to HCIR, perhaps HCIR can come to IR.