Categories
General

“In Quotes” by Google Labs and Community Journalism

I was just checking out the latest Google Labs release: In Quotes. As described on their FAQ:

The “In Quotes” feature allows you to find quotes from stories linked to from Google News. These quotations are a valuable resource for understanding where people in the news stand on various issues. Much of the published reporting about people is based on the interpretation of a journalist. Direct quotes, on the other hand, are concrete units of information that describe how newsmakers represent themselves. Google News compiles these quotations from online news stories and sorts them into browsable groups based on who is being quoted.

Here’s a screenshot to give you a feeling for the application:

This reminds me of an idea I once discussed with Craig Newmark after hearing a talk by Miles Efron about using cocitation information to estimate the political orientation of web documents. I’d just heard of Craig’s interest in community journalism, and I thought he might be persuaded to consider a new way of automatically presenting news as neutral happenings (perhaps obtained through passage retrieval algorithms on the news stream) through a variety of ideological lenses. I’m not sure that very many people would be want to hear points of view in conflict with their own, but this is precisely what I feel people need to hear. The conversation never led to anything concrete, but it’s still something I muse about.

Categories
General

Knol vs. Wikipedia: A Follow-Up

Like most of the blogosphere, I greeted the debut of Google’s Knol in July with deep skepticism. Perhaps two months is too soon to judge their endeavor, even in internet time, but I’m inclined to agree with Farhad Manjoo at Slate that Knol will never be as good as Wikipedia.

I maintain, as does Udi Manber at Google, that anonymity is overrated. All else equal, I’d like to know whose writing I’m reading. Author–or, in Wikipedia’s case, editor–reputation is a valuable signal, and Wikipedia all but obliterates it. In fact, reader can track a non-anonymous editor’s contributions to Wikipedia. But Wikipedia hardly facilitates or encourages readers to pay attention to identities of editors.

Still, Knol does not appear to be a credible alternative to Wikipedia, let alone a competitive threat. From all accounts, Wikipedia offers not only much greater quantity, but also higher quality. Why?

Here are my speculations, mostly borrowed from the conventional wisdom:

  1. First-mover advantage. Much of the information in Wikipedia is good enough and is easily found. Even if you write a better article elsewhere, few people will care. And those who do might suggest you could have improved the Wikipedia entry instead.
  2. Non-financial motivations. According to Manjoo, most Knol authors are financially motivated. In contrast, Wikipedia authors have no hope of obtaining direct financial benefit from their contributions–and Wikipedia strongly discourages contributions that reflect a conflict of interest. As a result, those who have contributed to Wikipedia have done so from non-financial motivation, and there are numerous studies suggesting that non-financial motivations trump financial ones.
  3. Ease of collective editing. Wikipedia makes it easy–perhaps too easy–to edit an entry. In contrast, on Knol, an edit must be accepted by the original author before it is effective. I know from my own experience with moderated comment threads, that the delay is often sufficient to quash my initiative to contribute.

Perhaps it is premature to write Knol’s obituary. But I agree with Manjoo’s conclusion: “The problem is that we don’t need the next Wikipedia. Today’s version works amazingly well.”

Categories
General

Information Accountability

The recent United Airlines stock fiasco triggered an expected wave of finger pointing. For those who didn’t follow the event, here is the executive summary:

    In the wee hours of Sunday, September 7th, The South Florida Sun-Sentinel (a subsidiary of the Tribune Company) included a link to an article entitled “UAL Files for Bankruptcy.” The link was legit, but the linked article didn’t carry its publication date in 2002. Then Google’s news bot picked up the article and automatically assigned it a current date. Furthermore, Google sent the link to anyone with an alert set up for news about United. Then, on Monday, September 8th, someone at Income Security Advisors saw the article in the results for a Google News search and sent it out on Bloomberg. The results are in the picture below, courtesy of Bloomberg by way of the New York Times.

    For anyone who wants all of the gory details, Google’s version of the story is here; the Tribune Company’s version is here.

I’ve spent the past week wondering about this event from an information access perspective. And then today I saw two interesting articles:

  • The first was a piece in BBC News about a speech by Sir Tim Berners-Lee expressing concern that the internet needs a way to help people separate rumor from real science. His examples included the fears about the Large Hadron Collider at CERN creating a black hole that would swallow up the earth (which isn’t quite the premise of Dan Brown’s Angels and Demons), and rumors that a vaccine given to children in Britain was harmful.
  • The second was a column in the New York Times about the dynamics of the US presidential campaign, where Adam Nagourney notes that “senior campaign aides say they are no longer sure what works, as they stumble through what has become a daily campaign fog, struggling to figure out what voters are paying attention to and, not incidentally, what they are even believing.”

I see a common thread here is that I’d like to call “information accountability.” I don’t mean this term in the sense of a recent CACM article about information privacy and sensitivity, but rather in a sense of information provenance and responsibility.

Whether we’re worrying about Google bombing, Google bowling, or what Gartner analyst Whit Andrews calls “denial-of-insight” attacks, our concern is that information often arrives with implicit authority. Despite the aphorism telling us “don’t believe everything you read,” most of us select news and information sources with some hope that they will be authoritative. Whether the motto is “all the news that’s fit to print” or “don’t be evil”, our choice of what we believe to be information sources is a necessary heuristic to avoid subjecting everything we read to endless skeptical inquiry.

But sometimes the most reputable news sources get it wrong. Or perhaps “wrong” is the wrong word. When newspapers reported that the FBI was treating Richard Jewell as a “person of interest” in the Centennial Olympic Park bombing (cf. “Olympic Park Bomber” Eric Robert Rudolph), they weren’t lying, but rather were communicating information from what they believed to be a reliable source. And, in turn the FBI may have been correctly doing its job, given the information they had. But there’s no question that Jewell suffered tremendously from his “trial by media” before his name was ultimately cleared.

It’s tempting to react to these information breakdowns with finger-pointing, to figure out who is accountable and, in as litigious a society as the United States, bring on the lawyers. Moreover, there clearly are cases where willful misinformation constitutes criminal defamation or fraud. But I think we need to be careful, especially in a world where information flows in a highly connected–and not necessary acyclic–social graph. Anyone who has played the children’s game of telephone knows that small communication errors can blow up rapidly, and that it’s difficult to partition blame fairly.

The simplest answer is that we are accountable for how we consume information: caveat lector. But this model seems overly simplistic, since our daily lives hinge our ability to consume information without such a skeptical eye that we can accept nothing at face value. Besides, shouldn’t we hold information providers responsible for living up the reputations they cultivate and promote?

There are no easy answers here. But the bad news is that we cannot ignore the questions of information accountability. If terms like “social media” and “web 2.0” mean anything, they surely tell us that the game of telephone will only grow in the number of participants and in the complexity of the communication chains. As a society, we will have to learn to live with and mitigate the fallout.

Categories
General

Is Blog Search Different?

Alerted by Jeff and Iadh, I recently read What Should Blog Search Look Like?, a position paper by Marti Hearst, Matt Hurst, and Sue Dumais. For those readers unfamiliar with this triumvirate, I suggest you take some time to read their work, as they are heavyweights in some of the areas most often covered by this blog.

The position paper suggests focusing on 3 three kinds of search tasks:

  1. Find out what are people thinking or feeling about X over time.
  2. Find good blogs/authors to read.
  3. Find useful information that was published in blogs sometime in the past.

The authors generally recommend the use of faceted navigation interfaces–something I’d hope would be uncontroversial by now for search in general.

But I’m more struck by their criticism that existing blog search engines fail to leverage the special properties of blog data, and that their discussion, based on work by Mishne and de Rijke, that blog search queries differ substantially from web search queries. I don’t doubt the data they’ve collected, but I’m curious if their results account for the rapid proliferation and mainstreaming of blogs. The lines between blogs, news articles, and informational web pages seem increasingly blurred.

So I’d like to turn the question around: what should blog search look like that is not applicable to search in general?

Categories
General

Incentives for Active Users

Some of the most successful web sites today are social networks, such as Facebook and LinkedIn. These are not only popular web sites; they are also remarkably effective people search tools. For example, I can use LinkedIn to find the 163 people in my network who mention “information retrieval” in their profiles and live within 50 miles of my ZIP code (I can’t promise you’ll see the same results!).

A couple of observations about social networking sites (I’ll focus on LinkedIn) are in order.

First, this functionality is a very big deal, and it’s something Google, Yahoo, and Microsoft have not managed to provide, even though their own technology is largely built on a social network–citation ranking.

Second, the “secret sauce” for sites like LinkedIn is hardly their technology (a search engine built on Lucene and a good implementation of breadth-first search), but rather the way they have incented users to be active participants, in everything from virally marketing the site to their peers to inputting high-quality semi-structured profiles that make the site useful. In other words, active users ensure both the quantity and quality of information on the site.

Many people have noted the network effect that drove the run-away success of Microsoft Office and eBay. But I think that social networking sites are taking this idea further, because users not only flock to the crowds, but become personally invested not only in the success of the site generally, but especially in the quality and accuracy of their personal information.

Enterprises need to learn from these consumer-oriented success stories. Some have already. For example, a couple of years ago, IBM established a Professional Marketplace, powered by Endeca, to maintain a skills and availability inventory of IBM employees. This effort was a run-away success, saving IBM $500M in its first year. But there’s more: IBM employees have reacted to the success of the system by being more active in maintaining their own profiles. I spent the day with folks at the ACM, and their seeing great uptake in their author profile pages.

I’ve argued before that there’s no free lunch when it comes to enterprise search and information access. The good news, however, is that, if you create the right incentives, you can get other folks to happily pay for lunch.

Categories
General

Query Elaboration as a Dialogue

I ended my post on transparency in information retrieval with a teaser: if users aren’t great at composing queries for set retrieval, which I argue is more transparent than ranked retrieval, then how will we ever deliver an information retrieval system that offers both usefulness and transparency?
The answer is that the system needs to help the user elaborate the query. Specifically, the process of composing a query should be a dialogue between the user and the system that allows the user to progressively articulate and explore an information need.
Those of you who have been reading this blog for a while or who are familiar with what I do at Endeca shouldn’t be surprised to see dialogue as the punch line. But I want to emphasize that the dialogue I’m describing isn’t just a back-and-forth between the user and the system. After all, there are query suggestion mechanisms that operate in the context of ranked retrieval algorithms–algorithms which do not offer the user transparency. While such mechanisms sometimes work, they risk doing more harm than good. Any interactive approach requires the user to do more work; if this added work does not result in added effectiveness, users will be frustrated.
That is why the dialogue has to be based on a transparent retrieval model–one where the system responds to queries in a way that is intuitive to users. Then, as users navigate in query space, transparency ensures that they can make informed choices about query refinement and thus make progress. I’m partial to set retrieval models, though I’m open to probabilistic ones. 
But of course we’ve just shifted the problem. How do we decide what query refinements to offer to a user in order to support this progressive refinement process? Stay tuned…
Categories
General

Transparency in Information Retrieval

It’s been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I’m very happy to see this level of interest, and I hope to continue catalyzing such discussions.

Today, I’d like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search–perhaps not surprising, since users are finally starting to question the idea of search as a black box.

The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on “why” rather than “how”. Most users don’t care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately “understood” their query–in other words, what question the engine thinks it’s answering.

Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds–never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.

What does this have to do with set retrieval vs. ranked retrieval? Plenty!

Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.

The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.

In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.

But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users–who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.

Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.

If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn’t optimal, users may do well to satisfice. But when the search engine fails to read the user’s mind, transparency offer the best hope of recovery.

But, as I mentioned earlier, users aren’t great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?

To be continued…

Categories
General

Set Retrieval vs. Ranked Retrieval

After last week’s post about a racially targeted web search engine, you’d think I’d avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I’ll start by introducing the terms.

  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.
  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.

An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:

  • The number of documents reported to match my search should be meaningful–or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.
  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

Categories
General

Thinking Outside the Black Box

I was reading Techmeme today, and I noticed an LA Times article about RushmoreDrive, described on its About Us page as “a first-of-its-kind search engine for the Black community.” My first reaction, blogged by others already, was that this idea was dumb and racist. In fact, it took some work to find positive commentary about RushmoreDrive.

But I’ve learned from the way the blogosphere handled the Cuil launch not to trust anyone who evaluates a search engine without having tried it, myself included. My wife and I have been the only white people at Amy Ruth’s and the service was as gracious as the chicken and waffles were delicious; I decided I’d try my luck on a search engine not targeted at my racial profile.

The search quality is solid, comparable to that of Google, Yahoo, and Microsoft. In fact, the site looks a lot like a re-skinning (no pun intended) of Ask.com, a corporate sibling of IAC-owned RushmoreDrive. Like Ask.com, RushmoreDrive emphasizes search refinement through narrowing and broadening refinements.

What I find ironic is that the whole controversy about racial bias in relevance ranking reveals the much bigger problem–that relevance ranking should not be a black box (ok, maybe this time I’ll take responsibility for the pun). I’ve been beating this drum at The Noisy Channel ever since I criticized Amit Singhal for Google’s lack of transparency. I think that sites like RushmoreDrive are inevitable if search engines refuse to cede more control of search results to users.

I don’t know how much information race provides as prior to influence statistical ranking approaches, but I’m skeptical that the effects are useful or even noticeable beyond a few well-chosen examples. I’m more inclined to see RushmoreDrive as a marketing ploy by the folks at IAC–and perhaps a successful one. I doubt that Google is running scared, but I think this should be a wake-up call to folks who are convinced that personalized relevance ranking is the end goal of user experience for search engines.

Categories
General

David Huynh’s Freebase Parallax

One of the perks of working in HCIR is that you get to meet some of the coolest people in academic and industrial research. I met David Huynh a few years ago, while he was a graduate student at MIT, working in the Haystack group and on the Simile project. You’ve probably seen some of his work: his Timeline project has been deployed all over the web.

Despite efforts by me and other to persuade David to stay in the Northeast, he went out west a few months ago to join Metaweb, a company with ambitions “to build a better infrastructure for the Web.” While I (and others) am not persuaded by Freebase, Metaweb’s “open database of the world’s information,” I am happy to see that David is still doing great work.

I encourage you to check out David’s latest project: Freebase Parallax. In it, he does something I’ve never seen outside Endeca (excepting David’s earlier work on a Nested Faceted Browser) he allows you to navigate using the facets of multiple entity types, joining between sets of entities through their relationships. At Endeca, we call this “record relationship navigation”–we presented it at HCIR ’07, showing an how it can enable social navigation.

David includes a video where he eloquently demonstrates how Parallax works, and the interface is quite compelling. I’m not sure how well it scales with large data sets, but David’s focus has been on interfaces rather than systems. My biggest complaint–which isn’t David’s fault–is that the Freebase content is a bit sparse. But his interface strikes me as a great fit for exploratory search.