Author: Daniel Tunkelang

High-Class Consultant.

Thinking Outside the Black Box

Post author By Daniel Tunkelang
Post date August 16, 2008
9 Comments on Thinking Outside the Black Box

I was reading Techmeme today, and I noticed an LA Times article about RushmoreDrive, described on its About Us page as “a first-of-its-kind search engine for the Black community.” My first reaction, blogged by others already, was that this idea was dumb and racist. In fact, it took some work to find positive commentary about RushmoreDrive.

But I’ve learned from the way the blogosphere handled the Cuil launch not to trust anyone who evaluates a search engine without having tried it, myself included. My wife and I have been the only white people at Amy Ruth’s and the service was as gracious as the chicken and waffles were delicious; I decided I’d try my luck on a search engine not targeted at my racial profile.

The search quality is solid, comparable to that of Google, Yahoo, and Microsoft. In fact, the site looks a lot like a re-skinning (no pun intended) of Ask.com, a corporate sibling of IAC-owned RushmoreDrive. Like Ask.com, RushmoreDrive emphasizes search refinement through narrowing and broadening refinements.

What I find ironic is that the whole controversy about racial bias in relevance ranking reveals the much bigger problem–that relevance ranking should not be a black box (ok, maybe this time I’ll take responsibility for the pun). I’ve been beating this drum at The Noisy Channel ever since I criticized Amit Singhal for Google’s lack of transparency. I think that sites like RushmoreDrive are inevitable if search engines refuse to cede more control of search results to users.

I don’t know how much information race provides as prior to influence statistical ranking approaches, but I’m skeptical that the effects are useful or even noticeable beyond a few well-chosen examples. I’m more inclined to see RushmoreDrive as a marketing ploy by the folks at IAC–and perhaps a successful one. I doubt that Google is running scared, but I think this should be a wake-up call to folks who are convinced that personalized relevance ranking is the end goal of user experience for search engines.

Uncategorized

New Information Retrieval Book Available Online

Post author By Daniel Tunkelang
Post date August 15, 2008
3 Comments on New Information Retrieval Book Available Online

Props to Jeff Dalton for alerting me about the new book on information retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. You can buy a hard copy, but you can also access it online for free at the book website.

General

David Huynh’s Freebase Parallax

Post author By Daniel Tunkelang
Post date August 13, 2008
1 Comment on David Huynh’s Freebase Parallax

One of the perks of working in HCIR is that you get to meet some of the coolest people in academic and industrial research. I met David Huynh a few years ago, while he was a graduate student at MIT, working in the Haystack group and on the Simile project. You’ve probably seen some of his work: his Timeline project has been deployed all over the web.

Despite efforts by me and other to persuade David to stay in the Northeast, he went out west a few months ago to join Metaweb, a company with ambitions “to build a better infrastructure for the Web.” While I (and others) am not persuaded by Freebase, Metaweb’s “open database of the world’s information,” I am happy to see that David is still doing great work.

I encourage you to check out David’s latest project: Freebase Parallax. In it, he does something I’ve never seen outside Endeca (excepting David’s earlier work on a Nested Faceted Browser) he allows you to navigate using the facets of multiple entity types, joining between sets of entities through their relationships. At Endeca, we call this “record relationship navigation”–we presented it at HCIR ’07, showing an how it can enable social navigation.

David includes a video where he eloquently demonstrates how Parallax works, and the interface is quite compelling. I’m not sure how well it scales with large data sets, but David’s focus has been on interfaces rather than systems. My biggest complaint–which isn’t David’s fault–is that the Freebase content is a bit sparse. But his interface strikes me as a great fit for exploratory search.

General

Conversation with Seth Grimes

Post author By Daniel Tunkelang
Post date August 13, 2008
1 Comment on Conversation with Seth Grimes

I had an great conversation with Intelligent Enterprise columnist Seth Grimes today. Apparently there’s an upside to writing critical commentary on Google’s aspirations in the enterprise!

One of the challenges in talking about enterprise search is that no one seems to agree on what it is. Indeed, as I’ve been discussing with Ryan Shaw , I use the term broadly to describe information access scenarios distinct from web search where an organization has some ownership or control of the content (in contrast to the somewhat adversarial relationship that web search companies have with the content they index). But I realize that many folks define enterprise search more narrowly to be a search box hooked up to the intranet.

Perhaps a better way to think about enterprise search is as a problem rather than solution. Many people expect a search box because they’re familiar with searching the web using Google. I don’t blame anyone for expecting that the same interface will work for enterprise information collections. Unfortunately, wishful thinking and clever advertising notwithstanding, it doesn’t.

I’ve blogged about this subject from several different perspectives over the past weeks, so I’ll refer recent readers to earlier posts on the subject rather than bore the regulars.

But I did want to mention a comment Seth made that I found particularly insightful. He defined enterprise search even more broadly than I do, suggesting that it encompassed any information seeking performed in the pursuit of enterprise-centric needs. In that context, he does see Google as the leader in enterprise search–not because of their enterprise offerings, but rather because of the web search they offer for free.

I’m not sure how I feel about his definition, but I think he raises a point that enterprise vendors often neglect. No matter how much information an enterprise controls, there will always be valuable information outside the enterprise. I find today’s APIs to that information woefully inadequate; for example, I can’t even choose a sort order through any of the web search APIs. But I am optimistic that those APIs will evolve, and that we will see “federated” information seeking that goes beyond merging ranked lists from different sources.

Indeed, I look forward to the day that web search providers take a cue from the enterprise and drop the focus on black box relevance ranking in favor of an approach that offers users control and interaction.

General

Position papers for NSF IS3 Workshop

Post author By Daniel Tunkelang
Post date August 11, 2008

I just wanted to let folks know that the position papers for the NSF Information Seeking Support Systems Workshop are now available at this link.

Here is a listing to whet your curiosity:

Supporting Interaction and Familiarity
James Allan, University of Massachusetts Amherst, USA
From Web Search to Exploratory Search: Can we get there from here?
Peter Anick, Yahoo! Inc., USA
Complex and Exploratory Web Search (with Daniel Russell)
Anne Aula, Google, USA
Really Supporting Information Seeking: A Position Paper
Nicholas J. Belkin, Rutgers University, USA
Transparent and User-Controllable Personalization For Information Exploration
Peter Brusilovsky, University of Pittsburgh, USA
Faceted Exploratory Search Using the Relation Browser
Robert Capra, UNC, USA
Towards a Model of Understanding Social Search
Ed Chi, Palo Alto Research Center, USA
Building Blocks For Rapid Development of Information Seeking Support Systems
Gary Geisler, University of Texas at Austin, USA
Collaborative Information Seeking in Electronic Environments
Gene Golovchinsky, FX Palo Alto Laboratory, USA
NeoNote: User Centered Design Suggestions for a Global Shared Scholarly Annotation System
Brad Hemminger, UNC, USA
Speaking the Same Language About Exploratory Information Seeking
Bill Kules, The Catholic University of America, USA
Musings on Information Seeking Support Systems
Michael Levi, U.S. Bureau of Labor Statistics, USA
Social Bookmarking and Information Seeking
David Millen, IBM Research, USA
Making Sense of Search Result Pages
Jan Pedersen, Yahoo, USA
A Multilevel Science of Social Information Foraging and Sensemaking
Peter Pirolli, XEROX PARC USA
Characterizing, Supporting and Evaluating Exploratory Search
Edie Rasmussen, University of British Columbia, Canada
The Information-Seeking Funnel
Daniel Rose, A9.com, USA
Complex and Exploratory Web Search (with Anne Aula)
Daniel Russell, Google, USA
Research Agenda: Visual Overviews for Exploratory Search
Ben Shneiderman, University of Maryland, USA
Five Challenges for Research to Support IS3
Elaine Toms, Dalhousie University, Canada
Resolving the Battle Royale between Information Retrieval and Information Science
Daniel Tunkelang, Endeca, USA

General

Why Enterprise Search Will Never Be Google-y

Post author By Daniel Tunkelang
Post date August 10, 2008
4 Comments on Why Enterprise Search Will Never Be Google-y

As I prepared to end my trilogy of Google-themed posts, I ran into two recently published items. They provide an excellent context for what I intended to talk about: the challenges and opportunities of enterprise search.

The first is Google’s announcement of an upgrade to their search appliance that allows one box to index 10 million documents and offers improved search quality and personalization.

The second is an article by Chris Sherman in the Enterprise Search Sourcebook 2008 entitled Why Enterprise Search Will Never Be Google-y.

First, the Google announcement. These are certainly improvements for the GSA, and Google does seem to be aiming to compete with the Big Three: Autonomy, Endeca, FAST (now a subsidiary of Microsoft). But these improvements should be seen in the context of state of the art. In particular, Google’s scalability claims, while impressive, still fall short of the market leaders in enterprise search. Moreover, the bottleneck in enterprise search hasn’t been the scale of document indexing, but rather the effectiveness with which people can access and interact with the indexed content. Interestingly, Google’s strongest selling point for the GSA, their claim it works “out of the box”, is also its biggest weakness: even with the new set of features, the GSA does not offer the flexibility or rich functionality that enterprises have come to expect.

Second, the Chris Sherman piece. Here is an excerpt:

Enterprise search and web search are fundamentally different animals, and I’d argue that enterprise search won’t–and shouldn’t–be Google-y any time soon….Like web search, Google’s enterprise search is easy to use–if you’re willing to go along with how Google’s algorithms view and present your business information….Ironically, enterprises, with all of their highly structures and carefully organized silos of information, require a very different and paradoxically more complex approach.

I highly recommend you read the whole article (it’s only 2 pages), not only because it informative and well written, but also because the author isn’t working for one of the Big Three.

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn’t recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for “enterprise search” is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

If you’re interested in search and want to be on the cutting edge of innovation, I suggest you think about the enterprise.

General

Where Google Isn’t Good Enough

Post author By Daniel Tunkelang
Post date August 7, 2008
5 Comments on Where Google Isn’t Good Enough

My last post, Is Google Good Enough?, challenged would-be Google killers to identify and address clear consumer needs for which Google isn’t good enough as a solution. I like helping my readers, so here are some ideas.

Shopping. Google Product Search (fka Froogle) is not one of Google’s crown jewels. At best, it works well when you know the exact name of the product you are looking for. But it pales in contrast to any modern ecommerce site, such as Amazon or Home Depot. What makes a shopping site successful? Put simply, it helps users find what they want, even when they didn’t know exactly what they wanted when they started.
Finding a job. Google has not thrown its hat into the ring of job search, and even the page they offer for finding jobs at Google could use some improvement. The two biggest job sites, Monster and Careerbuilder, succeed in terms of the number of jobs posted, but aren’t exactly optimized for user experience. Dice does better, but only for technology jobs. Interestingly, the best job finding site may be LinkedIn–not because of their search implementation (which is adequate but not innovative), but because of their success in getting millions of professionals to provide high-quality data.
Finding employees. Again, LinkedIn has probably come closest to providing a good employee finding site. The large job sites (all of which I’ve used at some point) not only fail to support exploratory search, but also suffer from a skew towards ineligible candidates and a nuisance of recruiters posing as job seekers. Here again, Google has not tried to compete.
Planning a trip. Sure, you can use Expedia, Travelocity, or Kayak to find a flight, hotel, and car rental. But there’s a lot of room for improvement when it comes to planning a trip, whether for business or pleasure. The existing tools do a poor job of putting together a coordinated itinerary (e.g., meals, activities), and also don’t integrate with relevant information sources, such as local directories and reviews. This is another area where Google has not tried to play.

Note two general themes here. The first is thinking beyond the mechanics of search and focusing on the ability to meet user needs at the task level. The second is the need for exploratory search. These only scratch the surface of opportunities in consumer-facing “search” applications. The opportunities within the enterprise are even greater, but I’ll save that for my next post.

General

Is Google Good Enough?

As Chief Scientist of Endeca, I spend a lot of my time explaining to people why they should not be satisfied with an information seeking interface that only offers them keyword search as an input mechanism and a ranked list of results as output. I tell them about query clarification dialogs, faceted navigation, and set analysis. More broadly, I evangelize exploratory search and human computer information retrieval as critical to addressing the inherent weakness of conventional ranked retrieval. If you haven’t heard me expound on the subject, feel free to check out this slide show on Is Search Broken?.

But today I wanted to put my ideology aside and ask the the simple question: Is Google good enough? Here is a good faith attempt to make the case for the status quo. I’ll focus on web search, since, as I’ve discussed before on this blog, enterprise search is different.

1) Google does well enough on result quality, enough of the time.

While Google doesn’t publish statistics about user satisfaction, it’s commonplace that Google usually succeeds in returning results that users find relevant. Granted, so do all of the major search engines: you can compare Google and Yahoo graphically at this site. But the question is not whether other search engines are also good enough–or even whether they are better. The point is that Google is good enough.

2) Google doesn’t support exploratory search. But it often leads you to a tool that does.

The classic instance of this synergy is when Google leads you to a Wikipedia entry. For example, I look up Daniel Kahneman on Google. The top results is his Wikipedia entry. From there, I can traverse links to learn about his research areas, his colleagues, etc.

3) Google is a benign monopoly that mitigates choice overload.

Many people, myself includes, have concerns about Google’s increasing role in mediating our access to information. But it’s hard to ignore the upside of a single portal that gives you access to everything in one place: web pages, blogs, maps, email, etc, And it’s all “free”–at least in so far as ad-supported services can be said to be free.

In summary, Google sets the bar pretty high. There are places where Google performs poorly (e.g., shopping) or doesn’t even try to compete (e.g., travel). But when I see the series of companies lining up to challenge Google, I have to wonder how many of them have identified and addressed clear consumer needs for which Google isn’t good enough as a solution. Given Google’s near-monopoly in web search, parity or even incremental advantage isn’t enough.

General

Not as Cuil as I Expected

Today’s big tech news is the launch of Cuil, the latest challenger to Google’s hegemony in Web search. Given the impressive team of Xooglers that put it together, I had high expectations for the launch.

My overall reaction: not bad, but not good enough to take seriously as a challenge to Google. They may be “The World’s Biggest Search Engine” based on the number of pages indexed, but they return zero results for a number of queries where Google does just fine, including noisy channel blog (compare to Google). But I’m not taking it personally–after all, their own site doesn’t show up when you search for their name (again, compare to Google). As for their interface features (column display, explore by category, query suggestions), they’re fine, but neither the concepts nor the quality of their implementation strike me as revolutionary.

Perhaps I’m expecting too much on day 1. But they’re not just trying to beat Gigablast; they’re trying to beat Google, and they surely expected to get lots of critical attention the moment they launched. Regardless of the improvements they’ve made in indexing, they clearly need to do more work on their crawler. It’s hard to judge the quality of results when it’s clear that at least some of the problem is that the most relevant documents simply aren’t in their index. I’m also surprised to not see Wikipedia documents showing up much for my searches–particularly for searches when I’m quite sure the most relevant document is in Wikipedia. Again, it’s hard to tell if this is an indexing or results quality issue.

I wish them luck–I speak for many in my desire to see Google face worthy competition in web search.

General

Catching up on SIGIR ’08

Now that SIGIR ’08 is over, I hope to see more folks blogging about it. I’m jealous of everyone who had the opportunity to attend, not only because of the culinary delights of Singapore, but because the program seems to reflect an increasing interest of the academic community in real-world IR problems.

Some notes from looking over the proceedings:

Of the 27 paper sessions, 2 include the word “user” in their titles, 2 include the word “social”, 2 focus on Query Analysis & Models, and 1 is about exploratory search. Compared to the last few SIGIR conferences, this is a significant increase in focus on users and interaction.
A paper on whether test collections predict users’ effectiveness offers an admirable defense of the Cranfield paradigm, much along the lines I’ve been advocating.
A nice paper from Microsoft Research looks at the problem of whether to personalize results for a query, recognizing that not all queries benefit from personalization. This approach may well be able to reap the benefits of personalization while avoiding much of its harm.
Two papers on tag prediction: Real-time Automatic Tag Recommendation (ACM Digital Library subscription required) and Social Tag Prediction. Semi-automated tagging tools are one of the best ways to leverage the best of both human and machine capabilities.

And I haven’t even gotten to the posters! I’m sad to see that they dropped the industry day, but perhaps they’ll bring it back next year in Boston.