At the HCIR workshop last month, one of the posters was from Microsoft researchers Jaime Teevan, Susan Dumais, and Zachary Gutt, entitled “Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web“.
From the abstract:
Those [challenges] that we have identified stem from the fact that such datasets are 1) very large, making it difficult to assign quality meta-data to every document and to retrieve the full set of results and associated metadata at query time, and 2) heterogeneous, making it difficult to apply the same metadata to every result or every query.
Drilling further into their position paper reveals three challenges:
- A lack of good, automatically generated metadata.
- Uncertainty as to which facets will be most valuable for particular information needs.
- The cost of dynamically computing facet distributions for result sets.
While these are all serious challenges, I feel that the position paper overstates them. Not all of the metadata needs to be generated automatically, and in any case there are lots of opporutnities to crowd-source metadata creation. Facet selection is more challenging, but the work we’ve done at Endeca on query clarification suggest that this problem is tractable. Finally, the computational challenge strikes me as an artifact of today’s ranked retrieval systems for web search, which are a bad fit for what is essentially a set retrieval problem.
I’m not saying that this problem isn’t hard. In fact, I think that the authors neglect the biggest challenge, which is the adversarial nature of web search. Arguably this problem is an implicit (but unstated) aspect of the metadata problem–there will be too much of an incentive to game it.
Nonetheless, I think the time is ripe to consider faceted search approaches for large, heterogeneous corpora like the web. And perhaps we can work around the adversarial model while we’re at it. But that’s the subject for another post.

