In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.
Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.
For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.
Precision / Recall Asymmetry
Faceted search is a great tool for information seeking systems. But it offers a flow that is asymmetric with respect to precision and recall.
Let’s invert the flow of faceted search. Rather than starting from a large, imprecise result set and progressively narrowing it; let’s start from a small, precise result set and progressively expand it. Since faceted search is often called “guided navigation” (a term Fritz Knabe and I coined at Endeca), let’s call this approach “guided exploration” (which has a nicer ring than “guided expansion”).
Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.
That sounds great in theory, but how can we implement guided exploration in practice?
Let’s remind ourselves why faceted search works so well. Faceted search offers the user information scent: the facet values help the user identify regions of higher precision relative to his or her information need. By selecting a sequence of facet values, the user arrives at a non-empty set that consists entirely or mostly of relevant results.
How to Expand a Result Set
How do we invert this flow? Just as enlarging an image is more complicated than reducing one, increasing recall is more complicated than increasing precision.
If our initial set is the result of selecting multiple facet values, then we may be able to increase recall by de-selecting facet values (e.g., de-selecting San Francisco Bay Area and CMU in my previous example). If we are using hierarchical facets, then rather than de-selecting a facet value, we may be able to replace it with a parent value (e.g., replacing San Francisco Bay Area with California). We can also remove one or more search keywords to broaden the results (e.g., information or retrieval).
Those are straightforward query relaxations. But there are more interesting ways to expand our results:
- We can replace a facet value with the union or that value and similar values (e.g., replacing CMU with CMU OR MIT).
- We can replace the entire query (or any subquery) with a union of that query and the results for selecting a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR Company: Google)
- We can replace the entire query (or any subquery) with a union of that query and the results for a keyword search a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR [faceted search]).
As we can see, there are many ways to progressively refine a query in a way that expands the result set. The question is how we provide users with options that increase recall while preserving as much precision as possible.
Frequency : Recall :: Similarity : Precision
Developers of faceted search systems don’t necessarily invest much thought into deciding which faceted refinement options to present to users. Some systems simply avoid dead ends, offer user all refinement options that least to a non-zero result set. This approach breaks down when there are too many options, in which case most systems offer users the most frequent facet values. A chapter in my faceted search book discusses some other options.
Unfortunately, the number of options for guided exploration — at least if we go beyond the very limited basic options — is too vast to apply such a naive approach. Unions never lead to dead ends, and we don’t have a simple measure like frequency to rank our options.
Or perhaps we do. A good reason to favor frequent values as faceted refinement options is that they tend to preserve recall. What we need is a measure that tends to preserving precision when we expand a result set.
That measure is set similarity. More specifically, it is the asymmetric similarity between a set and a superset containing it, which we can think of as the former’s representativeness of the latter. If we are working with facets, we can measure this similarity in terms of differences between distributions of the facet values. If the current set has high precision, we should favor supersets that are similar to it in order to preserve precision.
I’ll spare readers the math, but I encourage you to read about Kullback-Leibler divergence and Jensen-Shannon divergence if you are not familiar with measures of similarity between probability distributions. I’m also glossing over key implementation details — such as how to model distributions of facet values as probability distributions, and how to handle smoothing and normalization for set size. I’ll try to cover these in future posts. But for now, let’s assume that we can measure the similarity between a set and a superset.
Guided Exploration: A General Framework
We now have the elements to put together a general framework for guided exploration:
- Generate a set of candidate expansion options from the current search query using operations such as the following:
- De-select a facet value.
- Replace a facet value with its parent.
- Replace a facet value with the union of it and other values from that facet.
- Remove a search keyword.
- Replace a search keyword with the union of it and related keywords.
- Replace the entire query with the union of it and a related facet value selection.
- Replace the entire query with the union of it and a related keyword search.
- Evaluate each expansion option based on the similarity of the resulting set to the current one.
- Present the most similar sets to the user as expansion options.
It’s one thing to tell a user that two sets are distributionally similar based on an information-theoretic measure, and another to communicate that similarity in a language the user can understand. Here is an example of visualizing the similarity between [information retrieval] AND School: CMU and [information retrieval] AND School: (CMU or MIT):
As we can see from even this basic visualization, replacing CMU with (CMU OR MIT) increases the number of results by 70% while keeping a similar distribution of current companies — the notable exception being people who work for their almae matres.
Faceted search offers some of the most convincing evidence in favor of Gary Marchionini‘s advocacy that we “empower people to explore large-scale information bases but demand that people also take responsibility for this control”. Guided exploration aims to generalize the value proposition of faceted search by inverting the roles of precision and recall. Given the importance of recall, I hope to see progress in this direction. If this is a topic that interests you, give me a shout. Especially if you’re a student looking for an internship this summer!