Faceted Search for the Web: A Grand Challenge?

At the HCIR workshop last month, one of the posters was from Microsoft researchers Jaime Teevan, Susan Dumais, and Zachary Gutt, entitled “Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web“.

From the abstract:

Those [challenges] that we have identified stem from the fact that such datasets are 1) very large, making it difficult to assign quality meta-data to every document and to retrieve the full set of results and associated metadata at query time, and 2) heterogeneous, making it difficult to apply the same metadata to every result or every query.

Drilling further into their position paper reveals three challenges:

  • A lack of good, automatically generated metadata.
  • Uncertainty as to which facets will be most valuable for particular information needs.
  • The cost of dynamically computing facet distributions for result sets.

While these are all serious challenges, I feel that the position paper overstates them. Not all of the metadata needs to be generated automatically, and in any case there are lots of opporutnities to crowd-source metadata creation. Facet selection is more challenging, but the work we’ve done at Endeca on query clarification suggest that this problem is tractable. Finally, the computational challenge strikes me as an artifact of today’s ranked retrieval systems for web search, which are a bad fit for what is essentially a set retrieval problem.

I’m not saying that this problem isn’t hard. In fact, I think that the authors neglect the biggest challenge, which is the adversarial nature of web search. Arguably this problem is an implicit (but unstated) aspect of the metadata problem–there will be too much of an incentive to game it.

Nonetheless, I think the time is ripe to consider faceted search approaches for large, heterogeneous corpora like the web. And perhaps we can work around the adversarial model while we’re at it. But that’s the subject for another post.

By Daniel Tunkelang

High-Class Consultant.

12 replies on “Faceted Search for the Web: A Grand Challenge?”

We’ve been thinking about this problem of heterogenous documents for faceted search for a bit now, only with respect to filesystem search, rather than web. It’s pretty cool that they cited me, but they didn’t use this paper, which would have been more relevant.


Jonathan, thanks for posting the reference; I’m sorry I didn’t think to include it in the post myself. And I forgot that you are working with Ethan–please tell him I said hi!

And I agree that the problems of heterogeneity in file systems are comparable to those for the web. In fact, the file system case is harder in practice, as we see in the challenges of enterprise search vs. web search. Which makes me all the happier to see you and your colleagues offering constructive suggestions rather than just bailing on the problem.


I haven’t read that paper yet (i’ve been laying pretty low for a bit working on my thesis). did it cite Bill Kules work on SERVICE ( He had been working on faceted representations of web search. What i’ve seen though, including Exalead, its very hard to get past simple things like document type and domain suffix etc. The faceted approach that EBay put across their fairly broad set of content is to narrow down a particular domain before then showing domain-centric facets etc. It is an interesting topic though.


We often take a similar approach at Endeca to what you describe at eBay. But I’ll be the first to admit that the web is more than a site with a “fairly broad set of content”.

But I agree that Exalead’s facets (at least from my experience with their public web search engine – seem a bit superficial. And I’ve seen clever use of resources like Wikipedia (e.g., by Duck Duck Go) to offer meaningful interaction to users based on aspects of the results.

So, while I concede that faceted search for the web is hard, I think it’s possible–and that we could do a lot better today.


[…] On one hand, consider two search engines whose interfaces are designed to support exploratory search: Cuil and Kosmix. Sometimes they’re great, e.g., [michael jackson] on Cuil and [iraq] on Kosmix. But look what can happen for queries that are further out in the tail, e.g. [faceted search] on Cuil [real time search] on Kosmix. Yes, the kinds of queries I make. I don’t mean to knock these guys–they’re trying, and their efforts are admirable. Moreover, both generally return respectable search results on the first pages (in Kosmix’s case, through federation). But the search refinements can be way off, and that undermine the overall experience. I strongly suspect that the problem is one of data quality, along the lines of what others have argued. […]


[…] Overall it is a slick interface, and it’s nice seeing the various ideas Mohan and his colleagues put together. There’s certainly room for improvement–particularly in the quality of the categories, which sometimes feel like victims of polysemy. Open-domain information extraction is hard! Some would even call it a grand challenge. […]


Comments are closed.