Author: Daniel Tunkelang

High-Class Consultant.

Migrating Soon

Just another reminder that I expect to migrate this blog to a hosted WordPress platform in the next days. If you have opinions about hosting platforms, please let me know by commenting here. Right now, I’m debating between DreamHost and GoDaddy, but I’m very open to suggestions.

I will do everything in my power to minimize disruption–not sure how easy Blogger will make it to redirect users to the new site. I’ll probably post here for a while after to the move to try to direct traffic.

I do expect the new site to be under a domain name I’ve already reserved: http://thenoisychannel.com. It currently forwards to Blogger.

Uncategorized

Back from the Endeca Government Summit

Post author By Daniel Tunkelang
Post date September 6, 2008
2 Comments on Back from the Endeca Government Summit

I spent Thursday at the Endeca Government Summit, where I had the privilege to chat face-to-face with some Noisy Channel readers. Mostly, I was there to learn more about the sorts of information seeking problems people are facing in the public sector in general, and in the intelligence agencies in particular.

While I can’t go into much detail, the key concern was exploration of information availability. This problem is the antithesis of known-item search: rather than you are trying to retrieve information you know exist (and which you know how to specify), you are trying to determine if there is information available that would help you with a particular task.

Despite being lost in a sea of TLAs, I came away with a deepened appreciation of both the problems the intelligence agencies are trying to address and the relevance of exploratory search approaches to those problems.

General

Query Elaboration as a Dialogue

Post author By Daniel Tunkelang
Post date September 3, 2008
7 Comments on Query Elaboration as a Dialogue

I ended my post on transparency in information retrieval with a teaser: if users aren’t great at composing queries for set retrieval, which I argue is more transparent than ranked retrieval, then how will we ever deliver an information retrieval system that offers both usefulness and transparency?

The answer is that the system needs to help the user elaborate the query. Specifically, the process of composing a query should be a dialogue between the user and the system that allows the user to progressively articulate and explore an information need.

Those of you who have been reading this blog for a while or who are familiar with what I do at Endeca shouldn’t be surprised to see dialogue as the punch line. But I want to emphasize that the dialogue I’m describing isn’t just a back-and-forth between the user and the system. After all, there are query suggestion mechanisms that operate in the context of ranked retrieval algorithms–algorithms which do not offer the user transparency. While such mechanisms sometimes work, they risk doing more harm than good. Any interactive approach requires the user to do more work; if this added work does not result in added effectiveness, users will be frustrated.

That is why the dialogue has to be based on a transparent retrieval model–one where the system responds to queries in a way that is intuitive to users. Then, as users navigate in query space, transparency ensures that they can make informed choices about query refinement and thus make progress. I’m partial to set retrieval models, though I’m open to probabilistic ones.

But of course we’ve just shifted the problem. How do we decide what query refinements to offer to a user in order to support this progressive refinement process? Stay tuned…

Uncategorized

Migrating to WordPress

Just a quick note to let folks know that I’ll be migrating to WordPress in the next days. I’ll make every effort to have to move be seamless. I have secured the domain name http://thenoisychannel.com, which currently forwards Blogger, but will shift to wherever the blog is hosted. I apologize in advance for any disruption.

Uncategorized

E-Discovery and Transparency

Post author By Daniel Tunkelang
Post date September 1, 2008
2 Comments on E-Discovery and Transparency

One change I’m thinking of making to this blog is to introduce “quick bites” as a way of mentioning interesting sites or articles I’ve come across without going into deep analysis. Here’s a first one to give you a flavor of the concept. Let me know what you think.

I just read an article on how courts will tolerate search inaccuracies in e-Discovery by way of Curt Monash. It reminded me of our recent discussion of transparency in information retrieval. I agree that “explanations of [search] algorithms are of questionable value” for convincing a court of the relevance and accuracy of the results. But that’s because those algorithms aren’t sufficiently intuitive for those explanations to be meaningful except in a theoretical sense to an information retreival researcher.

I realize that user-entered Boolean queries (the traditional approach to e-Discovery) aren’t effective because users aren’t great at composing queries for set retrieval. But that’s why machines need to help users with query elaboration–a topic for an upcoming post.

Uncategorized

POLL: Blogging Platform

I’ve gotten a fair amount of feedback suggesting that I switch blogging platforms. Since I’d plan to make such changes infrequently, I’d like to get input from readers before doing so, especially since migration may have hiccups.

I’ve just posted a poll on the home page to ask if folks here have a preference as to which blogging platform I use. Please vote this week, and feel free to post comments here.

Uncategorized

Improving The Noisy Channel: A Call for Ideas

Post author By Daniel Tunkelang
Post date August 28, 2008
5 Comments on Improving The Noisy Channel: A Call for Ideas

Over the past five months, this blog has grown from a suggestion Jeff Dalton put in my ear to a community to which I’m proud to belong.

Some milestones:

Over 70 posts to date.
94 subscribers, as reported by Google Reader.
100 unique visitors on.a typical day.

To be honest, I thought I’d struggle to keep up with posting weekly, and that I’d need to convince my mom to read this blog so that I wouldn’t be speaking to an empty room. The results so far have wildly exceeded the expectations I came in with.

But now that I’ve seen the potential of this blog, I’d like to “take it to the next level,” as the MBA types say.

My goals:

Increase the readership. My motive isn’t (only) to inflate my own ego. I’ve seen that this blog succeeds most when it stimulates conversation, and a conversation needs participants.
Increase participation. Given the quantity and quality of comments on recent posts, it’s clear that readers here contribute the most valuable content. I’d like to step that up a notch by having readers guest-blog and perhaps going as far as to turning The Noisy Channel into a group blog about information seeking that transcends my personal take on the subject. I’ve very open to suggestions here.
Add some style. Various folks have offered suggestions for improving the blog, such as changing platforms to WordPress, modifying the layout to better use screen real estate, adding more images, etc. I’m the first to admit that I am not a designer, and I’d really appreciate ideas from you all on how to make this site more attractive and usable.

In short, I’m asking you to help me help you make The Noisy Channel a better and noisier place. Please post your comments here or email me if you’d prefer to make suggestions privately.

General

Transparency in Information Retrieval

Post author By Daniel Tunkelang
Post date August 27, 2008
12 Comments on Transparency in Information Retrieval

It’s been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I’m very happy to see this level of interest, and I hope to continue catalyzing such discussions.

Today, I’d like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search–perhaps not surprising, since users are finally starting to question the idea of search as a black box.

The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on “why” rather than “how”. Most users don’t care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately “understood” their query–in other words, what question the engine thinks it’s answering.

Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds–never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.

What does this have to do with set retrieval vs. ranked retrieval? Plenty!

Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.

The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.

In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.

But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users–who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.

Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.

If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn’t optimal, users may do well to satisfice. But when the search engine fails to read the user’s mind, transparency offer the best hope of recovery.

But, as I mentioned earlier, users aren’t great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?

To be continued…

General

Set Retrieval vs. Ranked Retrieval

Post author By Daniel Tunkelang
Post date August 24, 2008
30 Comments on Set Retrieval vs. Ranked Retrieval

After last week’s post about a racially targeted web search engine, you’d think I’d avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I’ll start by introducing the terms.

In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.
In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.

An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:

The number of documents reported to match my search should be meaningful–or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.
Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

Uncategorized

Back from the Cone of Silence

Post author By Daniel Tunkelang
Post date August 23, 2008

Regular readers may have noticed the lack of posts this week. My apologies to anyone who was waiting by the RSS feed. Yesterday was the submission deadline for HCIR ’08, which means that today is a new day! So please stay tuned for your regularly scheduled programming.