Categories
General

Human-Computer Information Retrieval in Layman’s Terms

One of the great benefits of practicing, as Daniel Lemire calls it, open scholarship is that I have many opportunities to see how ideas translate across the research / practice divide. In particular, I obtain invaluable feedback on the accuracy and effectiveness of that translation process.

A few days ago, I was exchanging email with serial entrepreneur Chris Dixon about human-computer information retrieval (HCIR). He’d just looked through the accepted submissions list for HCIR 2009 and said, if I may paraphrase: this is great stuff, but it needs to be better communicated for broader consumption. I quickly shot back a reaction that I’ll excerpt here (when in doubt, make it public!):

At some level it’s blindingly obvious: to err is human, to really screw up takes a computer. The HealthBase fiasco isn’t a shocker: lots of people are skeptical of pure AI approaches.

What people don’t get is that you can work to optimize the division of labor. I’m evangelizing it in places like Technology Review–a bit more mainstream than my blog. But ultimately the message has to resonate with entrepreneurs and investors who will make that vision a reality. Endeca is all about HCIR. Bing is a step in the right direction for the open web. But there’s a long way to go.

His response: that’s a lot more consumable that any other description of HCIR he’d seen to date (and he’s a regular reader here!). Having just finished reading Steve Blank’s Four Steps to the Epiphany, I appreciate his point: in a new market, the most critical priority is educating the potential customers.

As a number of us prepare for the HCIR 2009 workshop, that’s something to keep in mind. There’s a natural tension between rigorous scholarship and mass communication, but some have the greatest scholars (e.g., Richard Feynman and Linus Pauling) have shown the way for us mere mortals. Indeed, in a field as cross-disciplinary as HCIR, we would do well to make our work and vision as broadly consumable as possible, albeit without oversimplifying it to the point that it is vapid or even misleading.

Generally speaking, I blog in order to convince people that some of the esoteric ideas I encounter–and the occasional ideas I am fortunate enough to conceive–are worthy of broader consideration. I started blogging in order to bring greater visibility to HCIR–to convince people that the choice between human and machine responsibility is a false dichtomy in almost every aspect of the information seeking process.

In grade school, I learned that division of labor is the cornerstone of civilization–perhaps and our adaptive process of allocating effort our greatest achievement as a species. As machines play an increasingly important role in our lives–and serve as the lenses through which seek and consume almost all information–it is key that we not forget our roots. Let us be neither Luddites nor passive participants, but rather let us help computers help us.

Categories
General

Information Retrievability

Last year, I wrote a post about Leif Azzopardi and Vishwa Vinay‘s work on information accessibility:

Instead of an actual physical space, in IR, we are predominately concerned with accessing information within a collection of documents (i.e., information space), and instead of a transportation system, we have an Information Access System (i.e., a means by which we can access the information in the collection, like a query mechanism, a browsing mechanism, etc). The accessibility of a document is indicative of the likelihood or opportunity of it being retrieved by the user in this information space given such a mechanism.

After reading a pre-print of my HCIR 2009 position paper about the information availability problem, Vinay pointed me at follow-up work he’d done with Leif on information retrievability. I agree with his observation that, while I look at information availability from a user-centric perspective; they consider retrievability from  a document- or system-centric perspective. The approaches are complementary, and both add to a growing body of work that advocates a holistic model of how users access information, rather than a narrow focus on reductionist measures like precision and recall at the level of individual queries.

To be clear, those reductionist measures still have their place. In fact, I’m looking forward to NIST‘s Ellen Voorhees defending Cranfield next month to an HCIR crowd that is, for the most part, deeply suspicious of it.

Categories
General

Free Chapter on Faceted Search User Interface Design

If you are are interested in user interface design for faceted search–and I know that’s a hot topic for many Noisy Channel readers–then be sure to check out this free book chapter by Moritz Stefaner, Sébastian Ferré, Saverio Perugini, Jonathan Koren, and Yi Zhang.

By the way, a chapter of my own book on faceted search is also available for free online, as is Marti Hearst‘s entire book on search user interfaces.

Categories
General

HCIR 2009 Accepted Submissions

The agenda for HCIR 2009 is now online! As previously announced, Ben Shneiderman from the University of Maryland will be the keynote speaker. The accepted submissions are as follows:

Panel Presentations

  • Usefulness as the Criterion for Evaluation of Interactive Information Retrieval
    Michael Cole, Jingjing Liu, Nicholas Belkin, Ralf Bierig, Jacek Gwizdka, Chang Liu, Jun Zhang and Xiangmin Zhang (Rutgers University)
  • Modeling Searcher Frustration
    Henry Feild and James Allan (University of Massachusetts Amherst)
  • Query Suggestions as Idea Tactics for Information Search
    Diane Kelly (University of North Carolina at Chapel Hill)
  • I Come Not to Bury Cranfield, but to Praise It
    Ellen Voorhees (National Institute of Standards and Technology)
  • Search Tasks and Their Role in Studies of Search Behaviors
    Barbara Wildemuth (University of North Carolina at Chapel Hill) and Luanne Freund (University of British Columbia)

Posters and Demonstrations

  • Visual Interaction for Personalized Information Retrieval
    Jae-wook Ahn and Peter Brusilovsky (University of Pittsburgh)
  • PuppyIR: Designing an Open Source Framework for Interactive Information Services for Children
    Leif Azzopardi (University of Glasgow), Richard Glassey (University of Glasgow), Mounia Lalmas (University of Glasgow), Tamara Polajnar (University of Glasgow) and Ian Ruthven (University of Strathclyde)
  • Designing an Interactive Automatic Document Classification System
    Kirk Baker (Collexis)
  • The HCI Browser Tool for Studying Web Search Behavior
    Robert Capra (University of North Carolina at Chapel Hill)
  • A Graphic User Interface for Content and Structure Queries in XML Retrieval
    Juan M. Fernández-Luna, Luis M. de Campos, Juan F. Huete and Carlos J. Martin-Dancausa (University of Granada)
  • Improving Search-Driven Development with Collaborative Information Retrieval Techniques
    Juan M. Fernández-Luna (University of Granada), Juan F. Huete (University of Granada), Ramiro Pérez-Vázquez (Universidad Central de Las Villas) and Julio C. Rodríguez-Cano (Universidad de Holguín)
  • A visualization interface for interactive search refinement
    Fernando Figueira Filho (State University of Campinas), João Porto de Albuquerque (University of Sao Paulo), André Resende (State University of Campinas), Paulo Lício de Geus (State University of Campinas) and Gary Olson (University of California, Irvine)
  • Cognitive Dimensions Analysis of Interfaces for Information Seeking
    Gene Golovchinsky (FX Palo Alto Laboratory, Inc.)
  • Cognitive Load and Web Search Tasks
    Jacek Gwizdka (Rutgers University)
  • Visualising Digital Video Libraries for TV Broadcasting Industry: A User-Centred Approach
    Mieke Haesen, Jan Meskens and Karin Coninx (Hasselt University)
  • Log Based Analysis of How Faceted and Text Based Searching Interact in a Library Catalog Interface
    Bradley Hemminger (University of North Carolina), Xi Niu (University of North Carolina) and Cory Lown (NC State Libraries)
  • Freebase Cubed: Text-based Collection Queries for Large, Richly Interconnected Data Sets
    David Huynh (Metaweb Technologies, Inc.)
  • System Controlled Assistance for Improving Search Performance
    Bernard Jansen (Pennsylvania State University)
  • Designing for Enterprise Search in a Global Organization
    Maria Johansson and Lina Westerling (Findwise AB)
  • Cultural Differences in Information Behavior
    Anita Komlodi (University of Maryland Baltimore County) and Karoly Hercegfi (Budapest University of Technology and Economics)
  • Adapting an Information Visualization Tool for Mobile Information Retrieval
    Sherry Koshman and Jae-wook Ahn (University of Pittsburgh)
  • A Theoretical Framework for Subjective Relevance
    Katrina Muller and Diane Kelly (University of North Carolina)
  • Query Reuse in Exploratory Search Tasks
    Chirag Shah and Gary Marchionini (University of North Carolina at Chapel Hill)
  • Augmenting Cranfield-Style Evaluation with GOMS to Obtain Timed Predictions of User Performance
    Mark Smucker (Waterloo University)
  • Text-To-Query: Suggesting Structured Analytics to Illustrate Textual Content
    Raphael Thollot (SAP Business Objects) and Marie-Aude Aufaure (Ecole Centrale Paris)
  • The Information Availability Problem
    Daniel Tunkelang (Endeca)
  • Exploratory Search Over Temporal Event Sequences: Novel Requirements, Operations, and a Process Model
    Taowei Wang, Krist Wongsuphasawat, Catherine Plaisant and Ben Shneiderman (University of Maryland)
  • Keyword Search: Quite Exploratory Actually
    Max Wilson (Swansea University)
  • Using Twitter to Assess Information Needs: Early Results
    Max Wilson (Swansea University)
  • Integrating User-generated Content Description to Search Interface Design
    Kyunghye Yoon (SUNY Oswego)
  • Ambiguity and Context-Aware Query Reformulation
    Hui Zhang (Indiana University)
Categories
General

Transparent Text Symposium: Day 2

Given how intense yesterday was at the Transparent Text symposium, I couldn’t imagine that today would match it. But it did!

The morning kicked off with a series of 18 lighting talks in 90 minutes–that was 5 minutes apiece, with a ruthless gong for anyone who went overtime. The presentations were consistently intense, and I had the misfortune to follow one of the best talks–a very passionate presentation about crowd-sourced translation by IBM’s Uyi Stewart. Other notable presenters included design ninja Alexis Lloyd from the New York Times R&D Lab, Karrie Karahalios from the University of Illinois talking about the experimental WeMeddle Twitter client,  MIT Media Lab professor and Berkman Fellow Judith Donath showing a stunning gallery of “data portraits”, and Dragon Systems co-founder Janet Baker explaining how the brain recognizes speech–with an skull as a prop! The session was incredible, and I hope other conferences adopt this model.

After the coffee break, there was a session on Text Analysis in the Large, featuring Dan Gruhl (IBM), Gary King (Harvard), and David Ferrucci (IBM). Dan Gruhl talked about web-scale text analysis–a topic up his alley, considering his role in architecting the IBM WebFountain project. Gary King gave a fascinating talk about using ensemble methods to improve on existing clustering methods–the idea is to synthesize a collection of derived clusterings and place them in an explorable metric space. You can read the full paper here. But the winner for this session was definitely David Ferrucci, who described the work IBM Research is doing to develop a machine Jeopardy player. He spent much of the talk building a case for the difficulty of the problem–and then delivered the punchline: In less then three years of research, they’ve developed a machine player whose performance is comparable to that or jeopardy winners. Hopefully they’ll be competing on live television by next year!

After lunch, there was a session on Investigation, featuring MAPLight Research Director Emily Calhoun, UC Berkeley law professor Kevin Quinn, and Guardian news editor Simon Rogers. Emily Calhoun showed how MAPLight illuminates the connections between money and politics–it was great seeing data to correlate who supports and opposes bills with the associated campaign contributions from interest groups. Kevin Quinn’s presentation was a bit more technical, but his work reminds me a lot of Miles Efron’s work on estimating political orientation in web documents–but Quinn’s work is more general and goes beyond co-citation analysis to analyze the actual language of the documents. Great application of topic modeling! But my favorite presentation in this session was the one from Simon Rogers: he told the story of how the Guardian successfully crowd-sourced a project to investigate the expenses of UK Parliament members.

The final session was a panel discussion about how visualization might elevate or advance the debate over health care policy. The panelists were Ben Fry, Marti Hearst, Gary King, and Simon Rogers; Fernanda Viégas and Martin Wattenberg moderated. Unfortunately, the overwhelming sentiment from the panel was pessimism that anything we could do might actually lead to improved outcomes. Nonetheless, it’s clear that a lot of people are going to try.

Again, I want to thank Fernanda, Martin, Irene Greif, and everyone at IBM for organizing this fantastic event–and for inviting me to attend! I am impressed that anyone could manage to assemble such an impressive set of speakers in one place, and I appreciate the effort that everyone put into making the past two days so worthwhile. I look forward to seeing the videos available online, and I hope those who weren’t able to attend take the opportunity to watch some of them. I also encourage you to check out the live Twitter stream at #tt09 while it’s still available.

Categories
General

Transparent Text Symposium: Day 1

Wow, what an intense day at the Transparent Text symposium! I won’t try to give detailed summaries of the talks–videos will be posted after the conference, and you can get a pretty good picture from the live tweet stream at #tt09. Instead, I’ll try to capture my personal highlights and reactions.

I’ll start with Deputy U.S. CTO Beth Noveck‘s keynote about the Open Government Initiative. First, the very existence of such an initiative is incredible, given the culture of secrecy traditionally associated with Washington. Second, I like the top priority of releasing raw data so that other people can work on analyzing it, visualizing it, and generally making it more accessible either to the general public or to particular interest groups. This is very much what I had in mind in January when I posted “Information Sharing We Can Believe In” and I’m glad to see tangible progress. I was never a big fan of faith-based initiatives. 🙂

The next session was a group of talks about watchdogs and accountability–people looking at how to ensure government transparency from the outside. New York Times editor Aron Pilhofer and software developer Jeremy Ashkenas talked about DocumentCloud, an ambitious project to enable exploratory search for news documents on the open web. Sunlight Foundation co-founder and executive director Ellen Miller offered a particularly compelling example of the power of visualization: a graph correlating the campaign contributions and earmark associated with a congressman under investigation. But my favorite presenter in this section was ProPublica‘s Amanda Michel, whose thoughts about a “human test of transparency” are worth a talk in themselves. For now, I recommend you look at the two projects she discussed: Stimulus Spot Check and Off the Bus.

After lunch, we shifted gears from government transparency to more of a focus on text. The first of the two afternoon sessions was entitled “Analyzing the Written Record” and featured Matthew Gray from Google Books, Tom Tague from Open Calais (a free text annotation service that almost all of the previous speakers raved about), and Ethan Zuckerman from Harvard’s Berkman Center. All of the talks were solid, but Ethan’s was outstanding. I blogged about his Media Cloud project back in March, but it’s come a long was in the past six months and is doing something I’ve been waiting years to see someone do: comparing how different news organizations select and cover news.

The final session was about visualization. David Small offered a presentation about literally transparent text that was, in the words of Marian Dörk, “refreshingly non-utilitarian and visually stimulating”. Ben Fry showed the power of visualizing changes in a document over time–specifically, a project called “the preservation of favoured traces” that illustrates  the evolution of Darwin’s On the Origin of Species. But, as expected, IBM’s Many Eyes researchers Fernanda Viégas and Martin Wattenberg stole the show with an incredibly informative and entertaining presentation about the visualization of repetition in text. No summary can do it justice, so I urge you to watch the video when it is available.

After all that, we enjoyed a nice reception at the IBM Center for Social Software. I’m incredibly grateful to IBM for organizing and sponsoring this event, and to Martin Wattenberg for being so kind as to invite me. I’ll try to earn my keep in my 5 minutes at the “Ignite-style” session tomorrow morning.

Categories
General

T2: Judgment Day for Twine?


Nova Spivack, CEO and founder of Radar Networks, just released a preview (see above) announcing Twine 2.0, a semantic search engine to be released later this year. As Erick Schonfeld points out on TechCrunch, Twine hasn’t managed to attract broad adoption. I tried it briefly when it came out, and I have to confess that I never understood it.

But I can certainly see the appeal of delivering faceted search for the web to support exploratory information seeking. It’s the dream that’s been driving Bing, Freebase, not to mention smaller efforts like Kosmix. It’s hard, to be sure. But, as Sarah Lacy tells us, startups are supposed to be changing the world–and established companies can play too.

The demo video is appealing, but I’ll believe it when I can off-road on it–and on more than just recipes and restaurants, two highly structured domains that already well covered by sites like Food Network and Yelp. Twine doesn’t necessarily have to cover all domains to be useful–perhaps a “short snout” approach like Bing’s will be good enough to drive adoption.

In any case, I’m impressed with Twine’s ambition. But ambition isn’t enough–especially given the increasing number of people and companies who share it. If Nova really wants to build a “World Wide Database“, then he’ll have to do more than swing for the fences and miss. I’ll be waiting for a beta invite, and I’ll let you know what I find out.

Categories
General

Transparent Text Symposium

One of the unexpected benefits of accepting an invitation to speak at SIGMOD 2009 was an invitation from fellow participant Martin Wattenberg to attend the upcoming Transparent Text symposium at the IBM Center for Social Software:

The Transparent Text symposium is a free event that will focus on ways to make large collections of documents understandable to laypeople and experts alike. We are interested in approaches that shed light on unstructured text, ranging from novel statistical techniques to web-based crowdsourcing.

The speaker list is impressive, ranging from familiar (at least to me) interface experts Ben Fry and Marti Hearst to social scientist Gary King and Sunlight Foundation Executive Director Ellen Miller. IBM also contributed some of its own researchers to the program, including David Ferrucci, who has been leading the Jeopardy project. There’s even an “Ignite-style” session where all attendees will have the opportunity to give five-minute presentations.

I’m looking forward to the eclectic mix of speakers and attendees. As Chris Dixon recently reminded us, it’s important to introduce some randomization into our intellectual diets so that we don’t get stuck in a rut of local optimization. While an event with a theme of transparency and interacting with textual information is hardly a detour for me, I am excited about the opportunity to hear a diversity of new perspectives on this topic. There will be videos of the speakers posted after the event, as well as a  Twitter stream at #tt09.

Of course, I’ll blog about what I learn and recycle it in the discussion activities at the HCIR workshop next month.

Categories
General

Udorse: Give Product Placement a Chance

udorse

Those of you who don’t live and breathe the software startup scene might be oblivious that a substantial fraction of Silicon Valley is following TechCrunch50, an annual competition hosted by TechCrunch. As if it weren’t enough to have A-list judges like Marissa Mayer and Paul Graham, there’s even the fortuitous timing of Intuit acquiring 2007 TC50 winner Mint for a respectable $170M.

Here in New York, I have to confess that I haven’t had my eyes glued to the proceedings. But I have been looking at some of the entries, and one that at least stands out as distinctive is Udorse (and no, I’m not just biased because they’re local). Their premise is simple: democratize product placement through “visual endorsement”. Everyone who shares photos can embed a “udorsement” and can either pocket the advertising revenue or donate it to charity. More details from TechCrunch (naturally) and VentureBeat.

Perhaps your reaction is like mine, uncertain whether to be awed or horrified by this simple concept. Indeed, given my penchant for using ad blockers, you might think I’d be ideologically against product placement.

But I’m not, as long as it’s transparent–and, as far as I can tell, Udorse passes that test. In theory, this is advertising done right: content creators monetizing their own content by advertising goods and services they believe in–and putting their own credibility on the line to do so.

Of course, it might turn out very differently in practice. Any way of making money online brings out the worst in people, and I’m sure we’ll see lots of people try to game this service if it takes off. Meanwhile, people like me will probably block the “udorsements” like any other ads.

Or maybe not. I certainly don’t block emails from friends recommending the products they like, and I actually wish it were easier to benefit from their sincere opinions. If Udorse succeeds in a way that feels like word-of-mouth marketing, I’ll be thrilled. I think it’s a long shot, but I’m at least intrigued by their approach.

ps. No, I wasn’t payed to write this post, nor do I have any stake in Udorse. I at least have to keep my record clean for the Ethics of Blogging panel next week!

Categories
General

Is Bing Optimizing for the Short Snout?

In a post about Bing on CNET today, Rafe Needleman comments that “it makes business sense to pour resources into popular searches. Optimizing for the short snout pays.”

First, it’s an interesting counterpoint to the conventional wisdom that search (if not the future of business as we know it) is all about the “long tail“. But second and more importantly, it’s an intriguing claim about Bing’s strategy for differentiating itself from Google.

Needleman goes on to say:

I’d wager that this is how Bing is making its gains in market share. Latest Nielsen data says Bing gained 22 percent month-over-month in August, taking it to 10.7 percent of all U.S. searches. People probably try Bing for a travel or product search (where there’s also a cash-back financial kicker) and remember their good experience, and then they try it for more obscure searches and find it good enough. It highlights, I believe, an important flaw in Google’s historic strategy of indexing the entire Web equally well and making the user interface fast and consistent above all, as opposed to specializing as dictated by the query.

While I’ve never heard this claim about Bing before, it is consistent with something I’ve noticed–and which Nick Craswell said when he talked about Bing at SIGIR 2009. In the upper left area that Bing calls the table of contents (TOC), Bing selectively presents a refinement interface based on the entity type it infers for the search query. For example, a search for Argentina returns options that include Argentina Map, Argentina Tourism, and Argentina Culture; while a search for Abraham Lincoln returns options that include Abraham Lincoln Speeches and Abraham Lincoln Facts.

It’s a nifty feature, even if marketers and reporters have struggled to label it. But, as Needleman says, it does indeed focus on the short snout. For example, there are no TOC options when you search for faceted search, since the technical term doesn’t match a recognized entity type. Searches for names of auto companies, such as Toyota, yield a rich set of options, while those for scooter companies like Vespa do not. Similarly, searches for celebrities receive VIP treatment, as compared to searches for ordinary people that just return a list of search results.

All in all, I’m inclined to agree with Needleman that Bing is focusing on the short snout–and I love that phrase to describe it. The open question is whether he’s right that users “remember their good experience, and then they try it for more obscure searches and find it good enough”. It would be great to see data to confirm or refute that hypothesis.