Search User Interfaces and Data Quality

Post author By Daniel Tunkelang
Post date December 3, 2009
29 Comments on Search User Interfaces and Data Quality

One of the many things I’ve enjoyed in my first few weeks of working at Google is the opportunity to talk with many people who care about user interfaces and think about HCIR. Indeed, some of the folks working on “more and better search refinements” are just steps away from my desk. Very cool!

But working on the inside has also help me appreciate what Bob Wyman tried to tell me months ago–that Google has no philosophical predilection towards black box approaches, but rather is only limited by what technology makes possible and what its engineers can implement. I’d qualify that slightly by saying that I perceive an additional constraint: Google does have a strong predilection towards data-driven decisions. Some folks have found that approach objectionable in the context of interface design.

Anyway, if you’re a regular here, then you’re probably predisposed towards HCIR and exploratory search. In that case, I’d like to take a moment to help you appreciate the challenge I face on a day-to-day basis.

Which one of these two statements do you most agree with?

We need better data quality in order to support richer search user interfaces.
Richer search user interfaces allow us to overcome data quality limitations.

On one hand, consider two search engines whose interfaces are designed to support exploratory search: Cuil and Kosmix. Sometimes they’re great, e.g., [michael jackson] on Cuil and [iraq] on Kosmix. But look what can happen for queries that are further out in the tail, e.g. [faceted search] on Cuil [real time search] on Kosmix. Yes, the kinds of queries I make. 🙂 I don’t mean to knock these guys–they’re trying, and their efforts are admirable. Moreover, both generally return respectable search results on the first pages (in Kosmix’s case, through federation). But the search refinements can be way off, and that undermine the overall experience. I strongly suspect that the problem is one of data quality, along the lines of what others have argued.

On the other hand, some of the work that I did with colleagues at Endeca (e.g., work presented at HCIR 2008 on “Supporting Exploratory Search for the ACM Digital Library”) at least dangles the possibility that the second statement holds–namely, a richer user interface could help overcome data quality limitations. Interaction draws more of the information need out of the user, and the process may be able to mask imperfection in the data. For example, it’s clear to users–and clear from the search refinements–that [michael jackson beer] and [michael jackson -beer] are about different people. If we can just get that incremental information from the user, we don’t have to achieve perfection in named entity recognition and disambiguation.

I think there’s some truth in both arguments. Data quality is a major bottleneck for effectively delivering an exploratory search experience, and data quantity, much as it helps, is not a guarantee of quality. Richer interfaces offer the enticing possibility of leveraging human computation, but they also introduce the risk of disappointing and alienating users. Even for an HCIR zealot like me, the constraints of reality are sobering.

And yes, speed and computational cost matter too. But hey, it wouldn’t be a grand challenge if it were easy!

By Daniel Tunkelang

High-Class Consultant.

View Archive

29 replies on “Search User Interfaces and Data Quality”

Daniel, another approach to richer search user interfaces is to turn a certain class of data-quality problems — search-user errors — to your advantage. Google clearly does this with “Did you mean:” suggestions, presumably derived by mining query logs. This business is described by Marti Hearst in her book Search User Interfaces, although rather than pointing you directly there, I’ll point you to a recent article of mine, Text Data Quality, because I look at a selection of larger issues.

But regarding “better data quality in order to support richer search user interfaces,” it seems to me that the limitations shown by your Cuil and Kosmix examples are *analytical* limitations related to semantic integration of information from disparate, distributed sources. That is, their data seems not to be of low quality, rather the assembly of search findings can be questionable.

I give a similar example, the ability of Nielsen Buzzmetrics’ Blogpulse application to handle search-term variants, in another article, Text Data Quality: Mistakes and More.

The other search data quality issue is in selecting source materials and processing them to create an index in order to avoid, taking a recent example, the situation where Jews are listed as a cause for AIDS, in that case because the engine didn’t distinguish “aids” and “AIDS”. I have an article coming out next week on this subtopic.

Share this:

Related

By Daniel Tunkelang

29 replies on “Search User Interfaces and Data Quality”