Month: August 2010

HCIR 2010: Bigger and Better than Ever!

Post author By Daniel Tunkelang
Post date August 27, 2010
6 Comments on HCIR 2010: Bigger and Better than Ever!

Last Sunday was HCIR 2010, the Fourth Annual Workshop on Human-Computer Interaction and Information Retrieval, held at Rutgers University in New Brunswick, collocated with the Information Interaction in Context Symposium (IIiX 2010).

With 70 registered attendees, it was the biggest HCIR workshop we have held. Rutgers was a gracious host, providing space not only for the all-day workshop but also for a welcome reception the night before.

And, based on an informal survey of participants, I can say with some semblance of objectivity that this was the best HCIR workshop to date.

The opening “poster boaster” session was particularly energetic. There was no award for best boaster, but Cathal Hoare won an ovation by delivering his boaster as a poem:

If a picture is worth a thousand words

Surely to query formulation a photo affords

The ability to ask ‘what is that’ in ways that are many

But for years we have asked how can-we

Narrow the search space so that in reasonable time

We can use images to answer questions that are yours and mine

In my humble poster I will describe

How recent technology and users prescribe

A solution that allows me to point and click

And get answers so that I don’t feel so thick

About my location and my environment

And to my touristic explorations bring some enjoyment

Now if after all that you feel rather dazed

Please come by my poster and see if you are amazed….

As in past years, we enlisted a rock-star keynote speaker–this time, Google UX researcher Dan Russell. His slides hardly do justice to his talk–especially without the audio and video–but I’ve embedded them here so that you can get a flavor for his presentation on how we need to do more to improve the searcher.

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hcir-keynote-talk-russell-aug-22-2010-100827000301-phpapp01&stripped_title=dan-russell-search-quality-and-user-happiness

We accepted six papers for the presentation sessions–sadly, one of the presenters could not make it because of visa issues. The five presentations covered a variety of topics relating to tools, models, and evaluation for HCIR. The most intriguing of these (to me, at least) was a presentation by Max Wilson about “casual-leisure searching”–which he argues breaks our current models of exploratory search. Check out the slides below, as well as Erica Naone’s article in Technology Review on “Searching for Fun“.

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hcir2010pres-100824083643-phpapp02&stripped_title=hcir2010-casualleisure-search

As always, the poster session was the most interactive. Part of the energy came from HCIR Challenge participants showing off their systems in advance of the final session that would decide which of them would win. In any case, I felt like a heel having to walk through the hall of poster three times in order to herd people back to their seats.

Which brings us to the Challenge. When I first suggested the idea of a competition or challenge to my co-organizers back in February, I wasn’t sure we could pull it off. Indeed, even after we managed to obtain the use of the New York Times Annotated Corpus (thank you, LDC!) and a volunteer to set up a baseline system in Solr (thank you, Tommy!), I still worried that we’d have a party and no one would come. So I was delighted to see six very credible entries competing for the “people’s choice” award.

All of the participants offered interesting ideas: custom facets, visualization of the associations between relevant terms, multi-document summarization to catch up on a topic, and combining topic modeling with sentiment analysis to analyzing competing perspectives on a controversial issue. The winning entry, presented by Michael Matthews of Yahoo! Labs Bareclona, was the Time Explorer. As its name suggests, it allows users see the evolution of a topic over time. A cool feature is that it parses absolute and relative dates from article test–in some cases references to past or future times outside the publication span of the collection. Moreover, the temporal visualization of topics allows users to discover unexpected relationships between entities at particular points in time, e.g., between Slobodan Milosevic and Saddam Hussein. You can read more about it in Tom Simonite’s Technology Review article, “A Search Service that Can Peer into the Future“.

In short, HCIR 2010 will be a tough act to follow. But we’re already working on it. Watch this space…

General

Exploring Nuggetize

I’ve been exchanging emails with Dhiti co-founder Bharath Mohan about Nuggetize, an intriguing interface that surfaces “nuggets” from a site to reduce the user’s cost of exploring a document collection. Specifically Nuggetize targets research scenarios where users are likely to assemble a substantial reading list before diving into it. You can try Nuggetize on the general web or on a particular site that has been “nuggetized”, e.g., a blog like this one or Chris Dixon’s.

I’m always happy to see people building systems that explicitly support exploratory search (and am looking forward to seeing the HCIR Challenge entries in a week!). Regular readers may recall my coverage of Cuil, Kosmix, and Duck Duck Go. And of course I helped build a few of my own at Endeca. So what’s special about Nuggetize?

Mohan describes it as a faceted search interface for the web. I’ll quibble here–the interface offers grouped refinement options, but the groups don’t really strike me as facets. Moreover, the interface isn’t really designed to explore intersections of the refinement options–rather, at any given time, you see the intersection of the initial search and a currently selected refinement. But it is certainly an interface that supports query refinement and exploration.

The more interesting features are the nuggets and the support for relevance feedback.

The nuggets are full sentences, and thus feel quite different from conventional search-engine snippets. Conventional snippets serve primarily to provide information scent, helping users quickly determine the utility of a search result without the cost of clicking through to it and reading it. In contrast the nuggets are document fragments that are sufficiently self-contained to communicate a coherent thought. The experience suggests passage retrieval rather than document retrieval.

The relevance feedback is explicit: users can thumbs-up or thumbs-down results. After supplying feedback, users can refresh their results (which re-ranks them) and are also presented with suggested categories to use for feedback (both positive and negative). Unfortunately, the research on relevance feedback tells us that, helpful as it could be to improving user experience, users don’t bite. But perhaps users in research scenarios will give it a chance–especially with the added expressiveness and transparency of combining document and category feedback.

Overall it is a slick interface, and it’s nice seeing the various ideas Mohan and his colleagues put together. There’s certainly room for improvement–particularly in the quality of the categories, which sometimes feel like victims of polysemy. Open-domain information extraction is hard! Some would even call it a grand challenge.

Mohan reads this blog (he reached out to me a few months ago via a comment), and I’m sure he’d be happy to answer questions here.

General

Taking Blekko out for a Spin

Post author By Daniel Tunkelang
Post date August 6, 2010
9 Comments on Taking Blekko out for a Spin

http://player.ooyala.com/player.js?embedCode=90cmtrMTom9vae2YoUwJrngW3UCgI2Zu&deepLinkEmbedCode=90cmtrMTom9vae2YoUwJrngW3UCgI2Zu

If you’re a search engine junkie like me, you’ve probably heard about Blekko, a search engine that has been percolating for over two years and recently launched a private beta. If not, I encourage you to watch the TechCrunch video I’ve embedded above. You can join the beta by following them on Twitter. I did that earlier this week, and my invitation arrived via a direct message the next day.

Blekko’s main differentiating feature is that it supports “slashtags”. These aren’t the same as the Twitter microsyntax proposed by Chris Messina and named by Chris Blow. Rather, they are a way for users to “spin” their search results using a variety of filters. For example, [climate /liberal] and [climate /conservative] return very different results, because they are restricted to different sets of sites.

In addition to providing a set of curated slashtags, Blekko allows users to define their own slashtags by specifying the sets of sites to be included. There’s a social aspect here too: you can use (and follow) other users’ slashtags. Blekko also has some special slashtags that don’t act as site filters, e.g., /date shows recent results and /seo offers indexing information about web sites.

Blekko emphasizes two characteristics that I find very appealing: transparency and user control. While they do not disclose their relevance ranking algorithm, they do expose some of the information they use to compute it. More significantly, their emphasis on slashtags de-emphasizes default ranking, but rather encourages users to take more responsibility in the information seeking process. Very HCIR!

I like the concept. But I’m not sure how I feel about the execution. I have three main concerns.

First, the set of slashtags is somewhat haphazard–to be expected in a beta, but I’m not sure how it will evolve. I’d love to see a vocabulary collectively (and transparently) curated like Wikipedia, but I fear it will look more like social tagging site Delicious, which is a case study in the “vocabulary problem“. As any information scientist can tell you, managing vocabularies is hard!

Second, I’m not sure if site filters are the right model. What happens to sites with heterogeneous content? Or to sites that have one-hit wonders and therefore are unlikely to show up in any slashtags? I’d prefer to see the sites used as seeds to train classifiers that could then be applied to the entire index. Something a bit more like what Miles Efron implemented in this research–only on a much larger scale and applied at a page rather than site level.

Third, I think there’s a third ingredient that is essential to complement transparency and user control: guidance. As a user, I need to know what slashtags would lead me to interesting results, and ideally I’d want some kind of preview to make exploration as low-cost as possible.

I know I’m asking for a lot–especially from an ambitious startup that has just launched its private beta. But I think the stakes are high in this space, and going easy on a newcomer is no favor. I offer the tough love of a critic who would really like to see this kind of vision succeed.

General

HCIR 2010 Accepted Papers

The 4th Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2010) is coming up on August 22 in New Brunswick, NJ, taking place immediately after the Information Interaction in Context conference (IIiX 2010). That’s just a few weeks away!

If you are are interested in attending and haven’t already registered, please let me know as soon as possible via email or Twitter (speaking of which, follow the #hcir2010 hash tag). We’re making the remaining slots available to the community on a first-come, first-serve basis.

Google user experience researcher Dan Russell will be delivering this year’s keynote on “Why is search sometimes easy and sometimes hard? Understanding serendipity and expertise in the mind of the searcher“.

Here is the list of accepted papers:

Oral Presentations

VISTO: for Web Information Gathering and Organization
Anwar Alhenshiri, Carolyn Watters, and Michael Shepherd (Dalhousie University)
Time-based Exploration of News Archives
Omar Alonso (Microsoft Corporation), Klaus Berberich (Max-Planck Institute for Informatics), Srikanta Bedathur (Max-Planck Institute for Informatics), and Gerhard Weikum (Max-Planck Institute for Informatics)
Combining Computational Analyses and Interactive Visualization to Enhance Information Retrieval
Carsten Goerg, Jaeyeon Kihm, Jaegul Choo, Zhicheng Liu, Sivasailam Muthiah, Haesun Park, and John Stasko (Georgia Institute of Technology)
Impact of Retrieval Precision on Perceived Difficulty and Other User Measures
Mark Smucker and Chandra Prakash Jethani (University of Waterloo)
Exploratory Searching As Conceptual Exploration
Pertti Vakkari (University of Tampere)
Casual-leisure Searching: The Exploratory Search Scenarios that Break our Current Models
Max L. Wilson (Swansea University) and David Elsweiler (University of Erlangen)

HCIR Challenge Reports

Search for Journalists: New York Times Challenge Report
Corrado Boscarino, Arjen P. de Vries, and Wouter Alink (Centrum Wiskunde and Informatica)
Exploring the New York Times Corpus with NewsClub
Christian Kohlschütter (Leibniz Universität Hannover)
Searching Through Time in the New York Times
Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika, and Hugo Zaragoza (Yahoo! Labs)
News Sync: Three Reasons to Visualize News Better
V.G. Vinod Vydiswaran (University of Illinois), Jeroen van den Eijkhof (University of Washington), Raman Chandrasekar (Microsoft Research), Ann Paradiso (Microsoft Research), and Jim St. George (Microsoft Research)
Custom Dimensions for Text Corpus Navigation
Vladimir Zelevinsky (Endeca Technologies)
A Retrieval System Based on Sentiment Analysis
Wei Zheng and Hui Fang (University of Delaware)

Research Posters

Improving Web Search for Information Gathering: Visualization in Effect
Anwar Alhenshiri, Carolyn Watters, and Michael Shepherd (Dalhousie University)
User-oriented and Eye-Tracking-based Evaluation of an Interactive Search System
Thomas Beckers and Norbert Fuhr (University of Duisberg-Essen)
Exploring Combinations of Sources for Interaction Features for Document Re-ranking
Emanuele Di Buccio (University of Padua), Massimo Melucci (University of Padua), and Dawei Song (The Robert Gordon University)
Extracting Expertise to Facilitate Exploratory Search and Information Discovery: Combining Information Retrieval Techniques with a Computational Cognitive Model
Wai-Tat Fu and Wei Dong (University of Illinois at Urbana-Champaign)
An Architecture for Real-time Textual Query Term Extraction from Images
Cathal Hoare and Humphrey Sorensen (University College Cork)
Transaction Log Analysis of User Actions in a Faceted Library Catalog Interface
Bill Kules (The Catholic University of America), Robert Capra (University of North Carolina at Chapel Hill), and Joseph Ryan (North Carolina State University Libraries)
Context in Health Information Retrieval: What and Where
Carla Lopes and Cristina Ribeiro (University of Porto)
Tactics for Information Search in a Public and an Academic Library Catalog with Faceted Interfaces
Xi Niu and Bradley M. Hemminger (University of North Carolina at Chapel Hill)

Position Papers

Understanding Information Seeking in the Patent Domain and its Impact on the Interface Design of IR Systems
Daniela Becks, Matthias Görtz, and Christa Womser-Hacker (University of Hildesheim)
Better Search Applications Through Domain Specific Context Descriptions
Corrado Boscarino, Arjen P. de Vries, and Jacco van Ossenbruggen (Centrum Wiskunde and Informatica)
Layered, Adaptive Results: Interaction Concepts for Large, Heterogeneous Data Sets
Duane Degler (Design for Context)
Revisiting Exploratory Search from the HCI Perspective
Abdigani Diriye (University College London), Max L. Wilson (Swansea University), Ann Blandford (University College London), and Anastasios Tombros (Queen Mary University London)
Supporting Task with Information Appliances: Taxonomy of Needs
Sarah Gilbert, Lori McCay-Peet, and Elaine Toms (Dalhousie University)
A Proposal for Measuring and Implementing Group’s Affective Relevance in Collaborative Information Seeking
Roberto González-Ibáñez and Chirag Shah (Rutgers University)
Evaluation of Music Information Retrieval: Towards a User-Centered Approach
Xiao Hu (University of Illinois at Urbana Champaign) and Jingjing Liu (Rutgers University)
Information Derivatives: A New Way to Examine Information Propagation
Chirag Shah (Rutgers University)
Implicit Factors in Networked Information Feeds
Fred Stutzman (University of North Carolina at Chapel Hill)
Improving the Online News Experience
V. G. Vinod Vydiswaran (University of Illinois) and Raman Chandrasekar (Microsoft Research)
Breaking Down the Assumptions of Faceted Search
Vladimir Zelevinsky (Endeca Technologies)
A Survey of User Interfaces in Content-based Image Search Engines on the Web
Danyang Zhang (The City University of New York)

You can also download the full proceedings here.

General

Overcoming Spammers in Twitter

Post author By Daniel Tunkelang
Post date August 2, 2010
3 Comments on Overcoming Spammers in Twitter

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=ceri2010-gayobrenes-imagenes-100615061415-phpapp02&stripped_title=overcoming-spammers-in-twitter-a-tale-of-five-algorithms

As I blogged a few months ago, University of Oviedo professor Daniel Gayo-Avello published a research paper entitled “Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms“, in which he concluded that TunkRank was the best of the measures he studied for ranking Twitter users. I recently discovered that he and David Brenes posted slides from their presentation at CERI 2010 on “Overcoming Spammers in Twitter”. Enjoy!

General

Questions. But Why?

Yahoo! Answers and Answers.com have been around since 2005. But community question answering (as distinct from question answering using natural language processing) has witnessed a resurgence of popularity–at least in the blogosphere and among investors. Quora and Hunch are two of hottest startups on the web, and Aardvark was acquired by Google earlier this year. Most recently, Ask.com relaunched with a return to its question-answering roots and Facebook began rolling out Facebook Questions.

So there’s no question that community question answering is hot. The question is why? In particular, is community question answering a step forward or backward relative to today’s search engines, or is it something different?

Regarding Facebook Questions, Jason Kincaid writes in TechCrunch:

Given its size, it won’t take long for Facebook to build up a massive amount of data — if that data is consistently reliable, Questions could turn into a viable alternative to Google for many queries.

That’s a big if. But I think the bigger caveat is the vague quantifier “many”. The success of community question answering services will depend on how these services position themselves relative to users’ information needs. Anyone arguing that these services can or should replace today’s web search engines might want to consider the following examples of information needs that are typical of current search engine use:

I hope I don’t have to keep going to convince you that web search engines have earned their popularity by serving a broad class of information needs (i.e., answer lots of questions)–and that’s without even using the wide variety of personalized and social features that web search engines are rapidly developing.

The common thread in the above questions is that they focus on objective information. In general, such questions are effectively and efficiently answered by search engines based on indexed, published content (including “deep web” content made available to search engines via APIs). There’s a lot of work we can do to improve search engines, particularly in the area of supporting query formulation. But it seems silly and wasteful to route such questions to other people–human beings should not be reduced to performing tasks at which machines excel.

That said, I agree with Kincaid that there are many information needs that are well addressed by community question answering. In particular:

Questions for which point of view is a feature, not a bug. Review sites succeed when they provide sincere, informed personal reactions to products and services. Similarly, routing questions to people makes sense either when we care about the answerer’s a point of view. For some questions, I want the opinion of someone who shares my taste (which is what Hunch is pursuing with its “taste graph“). For others, I want a diversity of expert opinions–for which I might turn to Aardvark (which tries to route questions to topic experts), Quora (where people follow particular topics), or LinkedIn Answers. Over time, the answers to many such questions can be published and indexed–and indeed some answers sites receive a large share of their traffic from search engines.
Niche topics. As much as web search as improved information accessibility for the “long tail” of published information, the effectiveness of web search can be highly variable for the most obscure information needs. Moreover, this effectiveness depends significantly on the user: some people are better at searching than others, especially in their areas of domain expertise. Social search can help level the playing field. Much as Wikipedia has surfaced much of the expertise at the head of the information distribution, community question answering can help out in the tail.
Community for its own sake. Even in cases where search engines are more effective and efficient than community question answering services, some people prefer to participate in a social exchange rather than to conduct a transaction with an impersonal algorithm. Indeed, researchers at Aardvark found that many of the questions posed through their service (pre-acquisition) could be answered successfully using Google. I’ll go out on a limb and assume that Aardvark’s users were early technology adopters who are quite conversant with search engines–but in some case chose to use a social alternative simply because they wanted to be social.

Conclusions? Community question answering may be overhyped right now, but it isn’t a fad. There are broad classes of subjective information needs that require a point of view, if not a diversity of views. And even if much of the use of community question answering sites is mediated by search engines indexing their archives, there will always be a need for fresh content. I also believe that social search will continue to be valuable for niche topics, since neither search engines nor searchers will ever be perfect.

But I think the biggest open question is whether people will favor community question answering simply to be social. I conjecture that, by very publicly integrating community question answering into is social networking platform, Facebook is testing the hypothesis that it can turn information seeking from a utilitarian individual task into an entertaining social destination. Given Facebook’s highly engaged user population, we won’t have to wait long to find out.