It’s been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I’m very happy to see this level of interest, and I hope to continue catalyzing such discussions.
Today, I’d like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search–perhaps not surprising, since users are finally starting to question the idea of search as a black box.
The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on “why” rather than “how”. Most users don’t care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately “understood” their query–in other words, what question the engine thinks it’s answering.
Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds–never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.
What does this have to do with set retrieval vs. ranked retrieval? Plenty!
Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.
The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.
In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.
But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users–who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.
Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.
If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn’t optimal, users may do well to satisfice. But when the search engine fails to read the user’s mind, transparency offer the best hope of recovery.
But, as I mentioned earlier, users aren’t great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?
To be continued…