Categories
Uncategorized

Your Input Really is Relevant!

For those who haven’t been following the progress on the Wikipedia entry for “Relevance (Information Retrieval)“, I’d like to thank Jon Elsas, Bob Carpenter, and Fernando Diaz for helping turn lead into gold.

Check out:

I’m proud of The Noisy Channel community for fixing one of the top two hits on Google for “relevance”.

Categories
General

Is Search Broken?

Last night, I had the privilege of speaking to fellow CMU School of Computer Science alumni at Fidelity’s Center for Advanced Technology in Boston. Dean Randy Bryant, Associate Director of Corporate Relations Dan Jenkins, and Director of Alumni Relations Tina Carr, organized the event, and they encouraged me to pick a provocative subject.

Thus encouraged, I decided to ask the question: Is Search Broken?

Slides are here as a PowerPoint show for anyone interested, or use the embedded SlideShare show below.

Categories
General

Another HCIR Game

I just received an announcement from the SIG-IRList about the flickling challenge, a “game” designed around known-item image retrieval from Flickr. The user is given an image (not annotated) and the goal is to find the image again from Flickr using the system.

I’m not sure how well it will catch on with casual gamers–but that is hardly its primary motivation. Rather, the challenge was designed to help provide a foundation for evaluating interactive information retrieval–in a cross-language setting, no less. Details available at the iCLEF 2008 site or in this paper.

I’m thrilled to see efforts like these emerging to evaluate interactive retrieval–indeed, this feels like a solitaire version of Phetch.

Categories
Uncategorized

The Magic Shelf

I generally shy away from pimping Endeca‘s customers here at The Noisy Channel, but occasionally I have to make an exception. As some of you may remember, Borders made a deal several years ago to have Amazon operate their web site. Last year, they decided to reclaim their site. And today they are live, powered by Endeca! For more details, visit http://blog.endeca.com.

Now back to our commercial-free programming…

Categories
Uncategorized

Your Input is Relevant!

The following is a public service announcement.

As some of you may know, I am the primary author of the Human Computer Information Retrieval entry on Wikipedia. I created this entry last November, shortly after the HCIR ’07 workshop. One of the ideas we’ve tossed around for HCIR ’08 is to collaboratively edit the page. But why wait? With apologies to Isaac Asimov, I/you/we are Wikipedia, so let’s improve the entry now!

And, while you’ve got Wikipedia on the brain, please take a look at the Relevance (Information Retrieval) entry. After an unsuccessful attempt to have this entry folded into the main Information Retrieval entry, I’ve tried to rewrite it to conform to what I perceive as Wikipedia’s standards of quality and non-partisanship. While I tried my best, I’m sure there’s still room for improving it, and I suspect that some of you reading this are among the best qualified folks to do so!

As Lawrence Lessig says, it’s a read-write society. So readers, please help out a bit with the writing.

Categories
General

Games With an HCIR Purpose?

A couple of weeks ago, my colleague Luis Von Ahn at CMU launched Games With a Purpose,

Here is a brief explanation from the site:

When you play a game at Gwap, you aren’t just having fun. You’re helping the world become a better place. By playing our games, you’re training computers to solve problems for humans all over the world.

Von Ahn has made a career (and earned a MacArthur Fellowship) from his work on such games, most notably the ESP Game and reCAPTCHA. His games emphasize tagging tasks that are difficult for machines but easy for human beings, such as labeling images with high-level descriptors.

I’ve been interested in Von Ahn’s work for several years, and most particularly in a game called Phetch, a game which never quite made it out of beta but strikes me as one of the most ambitious examples of “human computation”. Here is a description from the Phetch site:

Quick! Find an image of Michael Jackson wearing a sailor hat.
Phetch is like a treasure hunt — you must find or help find an image from the Web.

One of the players is the Describer and the others are Seekers. Only the Describer can see the hidden image, and has to help the Seekers find it by giving them descriptions.

If the image is found, the Describer wins 200 points. The first to find it wins 100 points and becomes the new Describer.

A few important details that this description leaves out:

  • The Seeker (but not the Describer) has access to search engine that has indexed the images based on results from the ESP Game.
  • A Seeker loses points (I can’t recall how many) for wrong guesses.
  • The game has a time limit (hence the “Quick!”).

Now, let’s unpack the game description and analyze it in terms of the Human-Computer Information Retrieval (HCIR) paradigm. First, let us simplify the game, so that there is only one Seeker. In that case, we have a cooperative information retrieval game, where the Describer is trying to describe a target document (specifically, an image) as informatively as possible, while the Seeker is trying to execute clever algorithms in his or her wetware to retrieve it. If we think in terms of a traditional information retrieval setup, that makes the Describer the user and the Seeker the information retrieval system. Sort of.

A full analysis of this game is beyond the scope of a single blog post, but let’s look at the game from the Seeker’s perspective, holding our assumption that there is only one Seeker, and adding the additional assumption that the Describer’s input is static and supplied before the Seeker starts trying to find the image.

Assuming these simplifications, here is how a Seeker plays Phetch:

  • Read the description provided by the Describer and uses it to compose a search.
  • Scan the results sequentially, interrupting either to make a guess or to reformulate the search.

The key observation is that Phetch is about interactive information retrieval. A good Seeker recognizes when it is better to try reformulating the search than to keep scanning.

Returning to our theme of evaluation, we can envision modifying Phetch to create a system for evaluating interactive information retrieval. In fact, I persuaded my colleague Shiry Ginosar, who worked with Von Ahn on Phetch and is now a software engineer at Endeca, to elaborate such an approach at HCIR ’07. There are a lot of details to work out, but I find this vision very compelling and perhaps a route to addressing Nick Belkin’s grand challenge.

Categories
Uncategorized

Back from Orlando

I’m back from Endeca Discover ’08: two and a half days of presentations, superheroic attractions, and, in the best tradition of The Noisy Channel, karaoke. A bunch of us tried our best to blog the presentations at http://blog.endeca.com/.

All in all, a fun exhausting time, but it’s good to be back home. So, for those who have noticed the lack of posts in your RSS feeds, I promise I’ll start making it up to you in the next few days.

Categories
Uncategorized

Attending Endeca Discover ’08

I’ll be attending Endeca Discover ’08, Endeca’s annual user conference, from Sunday, May 18th to Wednesday, May 21st, so you might see a bit of a lull in my verbiage here while I live blog at http://blog.endeca.com and hang out in sunny Orlando with Endeca customers and partners.

If you’re attending Discover, please give me a shout and come to my sessions:

Otherwise, I’ll do my best to sneak in a post or comment, and I’ll be back in full force later next week.

Categories
General

A Utilitarian View of IR Evaluation

In many information retrieval papers that propose new techniques, the authors validate those techniques by demonstrating improved mean average precision over a standard test collection. The value of such results–at least to a practitioner–hinges on whether mean average precision correlates to utility for users. Not only do user studies place this correlation in doubt, but I have yet to see an empirical argument defending the utility of average precision as an evaluation measure. Please send me any references if you are aware of them!

Of course, user studies are fraught with complications, the most practical one being their expense. I’m not suggesting that we need to replace Cranfield studies with user studies wholesale. Rather, I see the purpose of user studies as establishing the utility of measures that can then be evaluated by Cranfield studies. As with any other science, we need to work with simplified, abstract models to achieve progress, but we also need to ground those models by validating them in the real world.

For example, consider the scenario where a collection contains no documents that match a user’s need. In this case, it is ideal for the user to reach this conclusion as accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? Alternatively, can we demonstrate that some interfaces lead to better user performance than others? If so, can we establish measures suitable for those interfaces?

The “no documents” case is just one of many real-world scenarios, and I don’t mean to suggest we should study it at the expense of all others. That said, I think it’s a particularly valuable scenario that, as far as I can tell, has been neglected by the information retreival community. I use it to drive home the argument that practical use cases should drive our process of defining evaluation measures.

Categories
General

Thinking about IR Evaluation

I just read the recent Information Processing & Management special issue on Evaluation of Interactive Information Retrieval Systems. The articles were a worthwhile read, and yet they weren’t exactly what I was looking for. Let me explain.

In fact, let’s start by going back to Cranfield. The Cranfield paradigm offers us a quantitative, repeatable means to evaluate information retrieval systems. Its proponents make a strong case that it is effective and cost-effective. Its critics object that it measures the wrong thing because it neglects the user.

But let’s look a bit harder at the proponents’ case. The primary measure in use today is average precision–indeed, most authors of SIGIR papers validate their proposed approaches by demonstrating increased mean average precision (MAP) over a standard test collection of queries. The dominance of average precision as a measure is no accident: it has been shown to be the best single predictor of the precision-recall graph.

So why are folks like me complaining? There are the various user studies asserting that MAP does not predict user performance on search tasks. Those have me at hello, but the studies are controversial in the information retrieval community, and in any case not constructive.

Instead, consider a paper by Harr Chen and David Karger (both at MIT) entitled "Less is more." Here is a snippet from the abstract:

Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the user’s information need.

Let me rephrase that: the precision-recall graph, which indicates how well a ranked retrieval algorithms does at ranking relevant documents ahead of irrelevant ones, does not necessarily characterize how well a system meets a user’s information need.

One of Chen and Karger’s examples is the case where the user is only interested in retrieving one relevant document. In this case, a system does well to return a diverse set of results that hedges against different possible query interpretations or query processing strategies. The authors also discuss more general scenarios, along with heuristics to address them.

But the main contribution of this paper, at least in my eyes, is a philosophical one. The authors consider the diversity of user needs and offer quantitative, repeatable way to evaluate information retrieval systems with respect to different needs. Granted, they do not even consider the challenge of evaluating interactive information retrieval. But they do set a good example.

Stay tuned for more musings on this theme…