Month: May 2011

I’d Like To Have An Argument Please

Post author By Daniel Tunkelang
Post date May 30, 2011
9 Comments on I’d Like To Have An Argument Please

If you Google [relevance theory], you’ll discover this Wikipedia entry about a theory proposed by Dan Sperber and Deirdre Wilson arguing that, in any given communication situation, the listener will stop processing as soon as he or she has found meaning that fits his or her expectation of relevance. The Wikipedia entry offers the following example of this principle:

Mary: Would you like to come for a run?

Bill: I’m resting today.

We understand from this example that Bill does not want to go for a run. But that is not what he said. He only said enough for Mary to add the context-mediated information: i.e. someone who is resting doesn’t usually go for a run. The implication is that Bill doesn’t want to go for a run today.

This theory may call to mind the Gricean Maxims — indeed, Sperber and Wilson borrow heavily from Grice’s work.

But I mainly bring up relevance theory to introduce Sperber to those unfamiliar with him. My friend (and Endeca co-founder) Pete Bell recently called to my intention an article by neuroscientist Jonah Lehrer entitled “The Reason We Reason“. The article reviews the “hot hand” fallacy and then proceeds to cite a new theory by Sperber and Hugo Mercier:

Reasoning is generally seen as a means to improve knowledge and make better decisions. Much evidence, however, shows that reasoning often leads to epistemic distortions and poor decisions. This suggests rethinking the function of reasoning. Our hypothesis is that the function of reasoning is argumentative. It is to devise and evaluate arguments intended to persuade.

The full article by Mercier and Sperber runs over 17K works and is entitled “Why do humans reason? Arguments for an argumentative theory“.

As someone who has spent most of his professional life thinking about information retrieval in practical contexts, I automatically relate relevance theory to relevance in the context of information retrieval. Relevance has been a subject of intense debate in the information science community (Tefko Saracevic tells the story wonderfully). Indeed, a key reason that I created the HCIR workshop was the belief that information retrieval researchers and practitioners (i.e., search engine developers) were placing too much emphasis on an objective notion of topical relevance, and not enough focus on the user.

Mercier and Sperber’s theory offers an interesting challenge to information retrieval researchers: perhaps a user’s information need is less about arriving at the truth and more about finding confirmatory evidence to support a preconceived conclusion. If so, should we adjust our notions of relevance accordingly? Also, if we evaluate or inform search quality based on observed user behavior (such as click-through behavior), then are we already inadvertently conflating topical relevance with users’ confirmatory bias?

Many people have noted that personalization gives us the truth we want: recent examples include Robin Sloan and Matt Thompson’s EPIC 2014 and Eli Pariser’s The Filter Bubble. Despite the consensus that over-fitting information access to our personal tastes is a bad thing (perhaps even dystopian), technology seems to relentlessly push us in this direction. Moreover, some degree of personalization is clearly useful — such as prioritizing information that relates to our personal and professional interests.

Nonetheless, anyone working in the area of information seeking systems should be concerned with the question of the user’s goal in using that system. Many of us take for granted that the user’s main goal is truth seeking, and we design our systems accordingly. What can or should we do differently if the user’s main goal is not informative but persuasive? Is the user looking for an answer…or an argument?

General

Going Public

What a day! I’ve been excited about LinkedIn from the moment I joined — and for several years before that — but today has been a unique experience. I hope our celebration extends beyond LinkedIn’s employees and investors — this is a great day for Silicon Valley, for the data scientists who are building its most valuable companies, and for the users who are benefiting from it all. I am proud and deeply grateful to be a part of this extraordinary adventure. My thanks to my hundreds of incredible colleagues and to the 100M users who have made it possible.

ps. Yes, we are still hiring, so please contact me if you’re the kind of person who loves turning data into gold. And if you are local, check out Christos Faloutsos’s upcoming tech talk on Mining Billion Node Graphs, which will take place at LinkedIn on June 2 and is open to the public.

General

In Search Of Structure

A couple of weeks ago, I participated in a summit that Greylock Partners organized for its portfolio companies at LinkedIn to discuss the power of data. Invited participants represented some of the most interesting “big data” companies in Silicon Valley, including Google, Facebook, Pandora, Cloudera, and Zynga. Discussion took place under the Chatham House Rule, so I’m not at liberty to share much detail. But I can say that there were energetic conversations about metrics, tools, and (of course) hiring.

One of the participants was Google researcher Alon Halevy, who generously shared his presentation on Fusion Tables with me with permission to re-share it here.

Fusion Tables allow the general public to upload, visualize, and share structured data. They are particularly useful for journalists who want to distill compelling stories from data — indeed, The Guardian‘s Simon Rogers has used Fusion Tables to visualize and interpret everything from nuclear power plant accidents to Wikileaks.

After his presentation, I asked Alon for his thoughts on why haven’t we seen an encyclopedic structured data repository comparable in scope, scale to Wikipedia? Alon offered that structured data is brittle — its value tends depend more on context than the unstructured content that populates Wikipedia. I agree in part — for example, consider this map of Brooklyn bus stops that were slated for elimination last summer. Such data is useful in a narrow context, but hardly encyclopedic.

But what about Freebase and DBpedia? Freebase is an open repository of structured data associated with about 20 million topics. DBpedia describes itself as “a community effort to extract structured information from Wikipedia and to make this information available on the Web.” While these tools have seen some use by developers (especially in the semantic web community), they have not achieved mainstream adoption. Perhaps data marketplaces like Factual and Infochimps will be successful as for-profite businesses, but the question remains why we don’t have a Wikipedia-scale success story for public structured data.

I think the problem is easiest to frame in information retrieval terms. Wikipedia is all about precision, but not so much about recall. Let me elaborate.

Wikipedia represents a collective attempt to achieve precision at the level of individual entries. Contributor / editors correct mistakes and argue over the details of content and tone. But coverage is a much lower priority. When in doubt, the Wikipedia collective assumes that information is not notable enough to justify inclusion. Thus Wikipedia errs on the side of precision rather than recall when it comes to meeting the information needs of its users.

This arrangement works well for a typical web user who seeks out information by using Google web search as an interface to discover Wikipedia articles. But structured data is about sets, not just individuals. It does me no good to see aggregate statistics about a set of entities if the set is erratically populated (e.g., Wikipedia’s list of companies established in 1999 or Freebase’s list of those founded after 2000).

In the June 2009 SIGIR Forum, University of Melbourne researchers Justin Zobel, Alistair Moffat, and Laurence Park argued “against recall“, concluding that they could find “no justification for implicit or explicit use of recall as a measure of search satisfaction.” I posted a rebuttal entitled “In Defense of Recall“, arguing that recall is much more useful as a measure for set retrieval than for ranked retrieval. Revisiting this argument two year later, I can see that it holds even more strongly if we are interested in structured data where we want to reason about aggregate properties of sets.

Back when we both worked at Endeca, my colleague Rob Gonzalez described structured data repositories to be as a public good that no one is ever willing to pay for. I’m an optimist by nature, but in this case I fear he has a point. It takes a lot of work to build something useful, and no one seems to have addressed the challenge of incenting people to contribute this work for either economic or altruistic motives.

Or perhaps we’ll just have to wait for the holy grail of information extraction algorithms to structure the world’s information for us? Ironically, that’s not even included on Wikipedia’s list of AI-complete problems.

General

Announcing HCIR 2011!

As regular readers know, I’ve been co-organizing annual workshops on Human-Computer Interaction and Information Retrieval since creating the first HCIR workshop in 2007. These have been a huge success, not only bridging the gap between IR and HCI, but also bringing together researchers and practitioners to address concerns shared by both communities. Past keynote speakers have included such information science luminaries as Susan Dumais, Ben Shneiderman, and Dan Russell.

Every workshop has improved on the previous year’s, and HCIR 2011, which will take place on Thursday, October 20, will be no exception.

Our venue will be Google’s headquarters in Mountain View, California. We could hardly imagine a more appropriate venue: Google has done more than any another company to contribute to everyday information access. Google has been extremely generous as a host and sponsor (other sponsors include Endeca and Microsoft Research), and its location in the heart of Silicon Valley is ideal for attracting researchers and practitioners building the future of HCIR.

Our keynote speaker will be Gary Marchionini, Dean of the School of Information and Library Science at the University of North Carolina at Chapel Hill. Gary coined the phrase “human–computer information retrieval” in a lecture entitled “Toward Human-Computer Information retrieval“, in which he asserted that “HCIR aims to empower people to explore large-scale information bases but demands that people also take responsibility for this control by expending cognitive and physical energy.” We are honored to have Gary deliver this year’s keynote.

But of course the main attraction is the contribution of participants. This year we invite three types of papers: position papers, research papers and challenge reports. Possible topics for discussion and presentation at the workshop include, but are not limited to:

Novel interaction techniques for information retrieval.
Modeling and evaluation of interactive information retrieval.
Exploratory search and information discovery.
Information visualization and visual analytics.
Applications of HCI techniques to information retrieval needs in specific domains.
Ethnography and user studies relevant to information retrieval and access.
Scale and efficiency considerations for interactive information retrieval systems.
Relevance feedback and active learning approaches for information retrieval.

Demonstrations of systems and prototypes are particularly welcome.

Building on the success of the last year’s HCIR Challenge to address historical exploration of a news archive, this year’s HCIR Challenge will focus on the problem of information availability. The corpus for the Challenge will be the CiteSeer digital library of scientific literature.

For more information about the workshop, including how to submit papers or participate in the challenge, please visit the HCIR 2011 website.

Here are the key dates for submitting position and research papers:

Submission deadline (position and research papers): July 31
Notification of acceptance decision: September 8
Presentations and poster session at workshop: October 20

Key dates for Challenge participants:

Request access to corpus (contact me) deadline: June 19
Freeze system and submit brief description: September 25
Submit videos or screenshots demonstrating systems on example tasks: October 9
Live demonstrations at workshop: October 20

I’m looking forward to this year’s submissions, and to a great workshop in October. I hope to see many of you there!