Month: August 2009

The Raging Debate Over The Link Economy

Post author By Daniel Tunkelang
Post date August 16, 2009
16 Comments on The Raging Debate Over The Link Economy

Arnon Mishkin wrote a post last Thursday on paidContent called “The Fallacy Of The Link Economy” that has been generating a lot of discussion, so I figured I’d join in the free-for-all. First, let me try to reduce each person’s argument to a direct quote that best sums up his position.

Arnon Mishkin:

The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped.

Jeff Jarvis:

Links are worth what the recipient makes of them.

Mike Masnick:

It’s not the link alone that has value or the story alone that has value, but the overall process of building a community.

Erick Schonfeld:

If a news site or a blog can say enough interesting things enough times that news aggregators (or other sites) keep linking to them, then they can build up their brand and reader loyalty.

Sigh. I thought the health care debate was bad enough, but I suppose that almost all impassioned debates come down to opposing sides exchanging half-truths.

In Mishkin’s defense: news organizations are in a catch-22. Many have suggested that if a news organization doesn’t want its content showing up on aggregators’ sites, it simply has to modify robots.txt accordingly. But news organizations can only do so individually–which puts them in a prisoner’s dilemma. Anti-trust law prevents news organizations from collectively bargaining with those who aggregate their content. For all intensive purposes, they are forced to abide by the status quo.

In Jarvis’s defense (yes, I’m actually defending Jeff Jarvis!): there isn’t much point in producing content for which most of the value is captured in a teaser so small as to be covered under fair use rights. As he’s said elsewhere, newspapers are inefficient, and the industry will have to shrink a lot to be healthy.

In Masnick’s defense: I cite my own blog post (also inspired by one of his posts) about monetizing community because participation is inherently uncopiable. It’s hard for me to agree with him more strongly than that!

In Schonfeld’s defense: his argument sounds a lot like the “freemium” strategy, which has a respectable track record. In order to build a loyal customer base, you often need to give away free trials as teasers–and that’s effectively what happens when media sites make some of their content available through aggregators. And, as in the freemium model, the actual product has to be significantly more interesting that the free teaser to earn the consumer’s investment–whether that investment is in the form of money, attention, or loyalty.

So, do I agree with them all? Not exactly. Mishkin’s first prescription to news organization should probably be to cut investment in undifferentiated content. Jarvis should acknowledge that the inability of news organizations to collectively bargain is unfair to them. Masnick–well, I basically do agree with him on the limited point he’s making. I suppose the strongest objection would be that not all media sites should be forced to become communities just because they’re hobbled in their ability to negotiate the monetization of the content they produce. And Schonfeld’s argument assumes the current link economy as a given–and one of the biggest points of contention is whether news organizations should be allowed to try to change that economy.

Sadly, I don’t see any of these guys giving the other an inch, which is why this discussion will probably continue unchanged for the foreseeable future. Hopefully the passion of the debate helps sell, um, papers.

General

Why Does Google Hold Back On Faceted Search?

Post author By Daniel Tunkelang
Post date August 14, 2009
20 Comments on Why Does Google Hold Back On Faceted Search?

Sometimes the response to a comment is worthy of an entire post, and this is one of those times. In response to my recent post about Able Grape, a wine search engine developed by Doug Cook (now Director of Twitter Search), Lee asked:

Let’s say I know almost nothing about wines/digital cameras/cars and a search site offers me “options” to drill down. However, I can’t use those effectively and eventually it comes down to availability and price for me. My questions are what are your thoughts on these kinds of situations and is there a scientific explanation/theory on this case?

This may be why Google does not endorse faceted search except for experimental projects.

It’s a great question. There’s been a lot of research on how people make decisions when they have to manage trade-offs among multiple attributes, and the increasing interest in behavioral economics since Daniel Kahneman won the Nobel Prize in 2002 has helped some of that research has even percolated into the mainstream thanks to bestsellers like Freakonomics and Dan Ariely’s Predictable Irrationality.

The short answer is that there’s no point in offering users options that they can’t (or won’t) use effectively. Choice overload is certainly a problem, and our reaction to it is to satisfice, typically resorting to “fast and frugal” heuristics that throw out most of the potential decision criteria and instead focus on one or two attributes, e.g., price and availability.

But that’s no reason to dumb down the data we make available to decision makers. We make hard choices all the time, and fast and frugal can be horrendously suboptimal. We don’t hire employees based solely on their price and availability–or at least good employers don’t! For that matter, I don’t think most people pick wines that way, given that even Trader Joe has to diversify beyond “Two Buck Chuck“. And, while there’s probably more of a market for cheap cameras and cars, I’m pretty sure you’re an extreme outlier if you completely ignore other criteria.

That said, there are some caveats about exposing options to users. Faceted search is hard, especially on the open web. Take it from the folks at Microsoft Research–but I’m sure Googlers would be the first to agree, especially given their experience with projects like Google Squared that, while promising, are nowhere near ready for prime time.

I appreciate that Google is conservative about embracing faceted search–and HCIR in general. I’m actually impressed by the steadily improving quality of their related terms for search queries–even if they do hide them behind two clicks (show options -> related searches). Perhaps they’re feeling some pressure from Bing. But I think they’re largely following the dictum of “if it ain’t broke, don’t fix it”. Google is an extremely successful company. And, as Clayton Christensen argues, successful companies are great at incremental innovation and bad at disruptive innovation. As far as I can tell, faceted search is very disruptive to their model.

Uncategorized

Google’s Chief Economist Hal Varian Talks Stats 101

Post author By Daniel Tunkelang
Post date August 14, 2009
5 Comments on Google’s Chief Economist Hal Varian Talks Stats 101

In an interview with CNET’s Tom Krazit, Google Chief Economist Hal Varian made a nice argument regarding the relative advantages of scale to a search engine:

On this data issue, people keep talking about how more data gives you a bigger advantage. But when you look at data, there’s a small statistical point that the accuracy with which you can measure things as they go up is the square root of the sample size. So there’s a kind of natural diminishing returns to scale just because of statistics: you have to have four times as big a sample to get twice as good an estimate.

Another point that I think is very important to remember…query traffic is growing at over 40 percent a year. If you have something that is growing at 40 percent a year, that means it doubles in two years.

So the amount of traffic that Yahoo, say, has now is about what Google had two years ago. So where’s this scale business? I mean, this is kind of crazy.

The other thing is, when we do improvements at Google, everything we do essentially is tested on a 1 percent or 0.5 percent experiment to see whether it’s really offering an improvement. So, if you’re half the size, well, you run a 2 percent experiment.

For those unfamiliar with statistics, I encourage you to look at the Wikipedia entry on standard deviation. Varian is obviously reducing the argument to a sound bite, but the sound bite rings true. More is better, but there’s a dramatically diminishing return at the scale of either Microsoft or Google.

However, I do think there’s a big difference when you start talking about running lots of experiments on small subsets of your users. The ability to run twice as many simultaneous tests without noticeably disrupting overall user experience is a major competitive advantage. But even there quality trumps quantity–how you choose what to test matters a lot more than how many tests you run.

What does strike me as ironic is that the moral here is a great counterpoint to the Varian’s colleagues’ arguments about the “unreasonable effectiveness of data“. Granted, it’s apples and oranges–Alon Halevy, Peter Norvig, and Fernando Pereira are talking about data scale, not user scale. Still, the same arguments apply. Sampling is sampling.

ps. Also check out Nick Carr’s commentary here.

Uncategorized

UIE Virtual Seminar on Faceted Search

Post author By Daniel Tunkelang
Post date August 13, 2009

My colleague, Endeca co-founder Pete Bell, and I are giving a virtual seminar on faceted search next week for User Interface Engineering (UIE). It’s on Thursday, August 20th at 1:30PM EST. The regular price is $129, but Noisy Channel readers who are interested in attending can get a $30 discount by using TUNKELANG (yes, all caps) as a promo code. Attendees also receive a free copy of my book, Faceted Search.

Whether or not you can attend, I do encourage you to check out the UIE site. It’s got a lot of free, useful content, and Jared Spool is definitely someone worth following if you are interested in web usability.

General

An Able Grape at the Helm of Twitter Search

Post author By Daniel Tunkelang
Post date August 13, 2009
5 Comments on An Able Grape at the Helm of Twitter Search

While I am an avid Twitter user (and apparently a tradeable commodity in a “Fantasy Twitter” game that some friends are playing), regular readers know that I’ve offered mixed reviews of Twitter Search.

I’ve link-baited Summize founder and Twitter Chief Scientist Abdur Chowdhury here once or twice, but I understand that he’s no longer running Twitter Search. They’ve got a new guy, Doug Cook, as Director of Search.

This is great news, because Doug is someone who’s thought a lot about search and user experience. He was one of the early web search guys at Inktomi and also spent some time at Yahoo!, but what impresses me most is a project he’s pursued as a labor of love: Able Grape.

From their about page:

We’re a wine search engine — not for comparison shopping, but for learning and research. We aim to be the world’s most comprehensive, up-to-date, and authoritative source for online wine information.

Great, another vertical search engine, just what the world needs (unfortunately WordPress 2.8.4 doesn’t support sarcastic font). But seriously, Able Grape is worth a look, even if, like me, you are not a wine nerd. So wash your glasses and let’s have a quick tasting.

First off, Able Grape is not searching a proprietary document collection. Rather, it’s based on a focused crawl of “more than 38,000 sites and some 18 million pages.” In other words, Able Grape is in no position to ask anyone to add meta-data. Even at the site level, I doubt Doug had the time to customize the handling of content for each of 38,000 sites. In other words, there’s enough scale here to make the problem interesting.

Now let’s look at some examples of the site in action. I’m a fan of Spanish wines, so I’ll start with one of their example queries, tempranillo. The first page of results looks relevant to the topic, but so far that doesn’t distinguish them from Google, Yahoo, or Bing. What surprises me is that the “Filter by Region” offers regions outside of Spain–like California and even New York! Yes, I might have learned some of that from Wikipedia–though it would not even have occurred to me to ask about non-Spanish Tempranillo. That’s exploratory search and serendipitous discovery for you!

Let’s try a different example, this time not from their list. I like Malbec wines (which I associated with my maternal link to Argentina), but the only local wine region for me is the North Fork of Long Island. So here’s a search for north fork malbec, filtered to Long Island. It certainly gives me ideas of which wineries to check out on my next trip there. Though, to be fair to the competition, G/Y/B all handle this query pretty well–though none of them offer refinement by region to disambiguate “north fork”.

Able Grape has lots of cool features, ranging from how they handle multilingual content to clever use of constrained “wildcard” terms like anyvariety to match any wine variety (aka varietal). I suspect that there is much to learn from its design that applies to a broad variety (sorry!) of search applications.

I’m a wine dilletante, so it’s hard for me to spend too much time on this site without any deep-seated information needs to fulfill. But I’m a card-carrying member of searchaholics anonymous (well, maybe not so anonymous), and I’m impressed by what Doug’s done with this vertical.

Which brings us back to Twitter Search. Director of Search for Twitter is a high-profile, high-pressure job, even without Facebook nipping at Twitter’s heels. I’m sure Able Grape will ferment for a while as Doug devotes his creative energies to improving Twitter Search. I certainly hope he brings the same focus and sensitivity to his new endeavor and makes Twitter Search a grand cru of search engines.

General

Lots of Search News Today!

Post author By Daniel Tunkelang
Post date August 10, 2009
3 Comments on Lots of Search News Today!

I try not to write posts that are just cut-and-paste from Techmeme, but it’s hard to resist a trio like this:

OK, perhaps that last item isn’t strictly search news, but it may as well be, given that the microblogging wars are in no small part about “real time” search.

I’m not a huge Facebook fan (as those of you who have looked at my spartan profile page may have noticed), but I am curious about how they’re implementing search over their sprawling collection of content. I’m underwhelmed with my own search experience on the site, but that might be my own fault for not being an active Facebook participant. Perhaps folks here who are more active can share their own experiences.

As for the acquisition of FriendFeed, I’m surely in good company to assume this was Facebook’s second choice after the attempt to acquire Twitter fell through. If, as has been reported. Facebook only paid $50M for FriendFeed, then the acquisition was pocket change compared to the $500M they offered Twitter (granted, some or all of that being based on a controversial valuation of the Facebook). Anyway, it should keep life interesting in the status-sphere.

And then there’s Google’s preview site, which you can try here. The only difference I see between it and the non-preview Google search is that the estimated result counts tend to be slightly higher. The top-ranked results seem almost identical, modulo tiny permutations for the queries I checked, as do related searches and any other features I tried. But apparently that’s the idea:

The new infrastructure sits “under the hood” of Google’s search engine, which means that most users won’t notice a difference in search results. But web developers and power searchers might notice a few differences, so we’re opening up a web developer preview to collect feedback.

Anyway, it’s more fun reading all of this stuff than hearing the CEO of the web’s great search brands proclaim that her company has never been a search company–or wondering where all the great search people I know there will land as Yahoo search is assimilated into Bing. Don’t get me wrong, I’m looking forward to the competition between Google and Microsoft–one that I think will finally be waged in earnest. But I’m still sad for Yahoo and its employees.

Which bring us to the last news item: Doug Cutting is leaving Yahoo for Cloudera, where he’ll continue to work on Hadoop. According to his blog post about it, “This move will not fundamentally change my day-to-day activities.” It will certainly be interesting to see what comes next from someone who has been instrumental to so many major open-source packages associated with search.

General

Norbert Fuhr’s Probability Ranking Principle for Interactive Information Retrieval

Post author By Daniel Tunkelang
Post date August 10, 2009
1 Comment on Norbert Fuhr’s Probability Ranking Principle for Interactive Information Retrieval

The other day, I was talking with Paul Thompson about the challenges of evaluating interactive information retrieval (IIR) systems, and he mentioned a paper that came up in discussion at the SIGIR 2009 workshop on Understanding the User: “A probability ranking principle for interactive information retrieval” by Norbert Fuhr–an update to the decades-old probability ranking principle (PRP) for information retrieval.

I was embarrassed to admit that I’d never seen or even heard of this paper, despite Nick Belkin citing it in the ECIR 2008 keynote that inspired my first blog post! In my defense, the citation is a single sentence that offers more of a tease than an explanation of the paper’s thesis.

Have no fear; I’ll offer more than a sentence here! I’ll summarize the paper and then offer my personal reaction. Let’s start with a few lines from the abstract:

In this paper, a new theoretical framework for interactive retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation. Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering of the choices can the derived – the PRP for IIR.

The first two sections of the paper provide an introduction and motivation. I’ll skip these, other than excerpt the following:

Interactive retrieval consists of user actions of various types, and scanning through document lists for identifying the relevant entries is not the most crucial activity in interactive retrieval [Turpin & Hersh 01]. In contrast, other activities (like e.g. query reformulation) seem to be more ‘expensive’ from the user’s point of view.

It’s in the third section that Fuhr starts outlining his approach. He established three requirements that, in his view, a PRP for IIR must satisfy:

Consider the complete interaction process.
Allow for different costs and benefits of different activities.
Allow for changes of the information need.

He then makes four simplifying assumptions:

Focus on the functional level of interaction (i.e., ignore design, visualization).
Decisions are the major interaction activity.
Users evaluate choices in linear order.
Only positive, correct decisions are of benefit for a user.

With these assumptions in hand, Fuhr establishes an interaction model comprised of a sequence of “situations”:

a situation reflects the system state of the interactive search a user is performing…a situation consists of a list of choices the user has to evaluate…the first positive decision by the user will move him to another situation (depending on the choice he selected positively)…assume that there is always a last choice that will move him to another situation…the user’s information need does not change during a situation, knowledge is added only when switching to another situation due to a positive decision

In the fourth section, Fuhr describes a cost model for IIR. The key concept is that each choice incurs an effort, yields an expected benefit, and incurs an additional cost if the user has to backtrack. His overall framework is generic, but he offers a concrete “illustrating example” in which:

queries represent conjunctions of query terms
choices represent narrowing query refinements that add individual terms to the current query
the probability of the user choosing a given query term is proportional to that term’s frequency in the corpus
the benefit of an accepted choice is the log of factor by which it reduces the result set

Given such a cost model, the goal of an IIR system is now to maximize the user’s overall expected benefit, which in turn requires making tradeoffs at each choice. Fuhr offers an example to illustrate these tradeoffs:

As a simple example, assume that the system proposes some terms for query expansion. As one possibility, only the terms themselves are listed. Alternatively, for each term, the system could show a few example term occurrences in their context, thus giving the user some information about the usage of the term. The user effort per choice is lower in the first case, but the decisions will also be more error-prone.

The fifth section explains optimum ranking of choices according to the PRP. It has the most math, and I’ll refer the interested reader to the paper. The section that intrigued me the most was the sixth, entitled “Towards application”, in which Fuhr considers the various components of his generic framework, and speculates how existing and future research may supply the concrete details needed to instantiate it. Finally, Fuhr wraps up by citing related work and offering a conclusion/ outlook.

That’s the summary–perhaps it will inspire some of you to read the whole paper. Personally, I’m intrigued by Fuhr’s model. I suspect I was trying to back into a similar approach in my post on “Ranked Set Retrieval” a few months ago. But, even if I’d try to formalize my proposal, I don’t believe I would have ever arrived at a framework as general as Fuhr’s.

That said, Fuhr’s article is like a fancy dinner that leaves me hungry at the end. It’s a solid paper: general framework, suggestive examples, and good writing generally. Nonetheless, it doesn’t give me what I really wanted: to feel closer to either a cost-effective evaluation methodology for IIR–or a tool that can be embedded into an IIR system to improve user experience. I know it’s a theory paper, and I should judge it on its own terms, but I was hoping to see more immediate applicability.

Given that Fuhr has published a fair amount of applied research, I’m hopeful about his future work in this area. I do appreciate the formal framework he has proposed, and I’d like to think that people are already looking at way to build on that framework. Perhaps I’ll be one of them.

General

Public Expression, Liability, and Anonymity

Post author By Daniel Tunkelang
Post date August 8, 2009
8 Comments on Public Expression, Liability, and Anonymity

A colleague just sent me a link to a story about a Twitter user being sued for a tweet. At least he’s not being sued in London.

I’m strongly if not absolutely in favor of freedom of expression, so it’s hard not to find such cases depressing. Nonetheless, I don’t think the legal landscape hasn’t changed.

Rather, what has changed (or accelerated) is that:

It is easier for people to express themselves publicly–and hence far more people are doing it.
The detached nature of online communication releases people’s inhibitions. Moreover, people not only don’t self-censor, but in some cases are deliberately provocative to attract attention.
The speed and efficiency of distribution (especially through search / alerts) means that the people most likely to be or feel damaged by an act of public expression are far more likely to discover that act.

So it’s not surprising that users are being sued for what they say online–it’s an expected consequence of the democratization of publishing, especially in the litigious English-speaking countries on both sides of the pond.

I’d personally like to see it a higher bar for someone to initiate a defamation lawsuit–let alone win it–but I’m not holding my breath. Instead, I expect that we’ll see more anonymous expression by people who don’t feel the authenticity of disclosure justified the risk of retaliation. Oh well.

Uncategorized

Will Browsers Ship With Ad Blockers?

Post author By Daniel Tunkelang
Post date August 7, 2009
22 Comments on Will Browsers Ship With Ad Blockers?

A while ago, I wrote a post entitled “Think Evil” in which I mused that:

A few years ago, when it became clear that Microsoft was losing the search wars to Google–but when they hadn’t lost much browser market share to Firefox–I thought they should have used a scorched earth strategy of including an ad-blocker in Internet Explorer. The ad blocker would be on by default and would block all ads, including sponsored links from search engines. Actually, I can’t bring myself to consider this particular approach evil–from my perspective, the means would justify the end.

I guess I’m not the only person with such musings. In a post with the descriptive (if uncreative) title “In five years all browsers will block internet advertisements by default“, Orin Thomas argues:

People have become conditioned to accessing content for free on the Internet and people also don’t want to see advertisements on the Internet. At some point in the not too distant future, Ad blocking will become a necessary browser feature like Tabs are today. Any browser that does not include the feature will suffer a dramatic downturn in market share as people move to platforms that “block those darn advertisements”. Within five years, all browsers will block advertisements by default because, in the end, it is a feature that most people want.

I’d like to believe that he’s right, but I’m pretty sure I made similar claims at least five years ago, and I’m not aware of even a niche browser that ships with a built-in ad blocker.

I’m curious what readers think. Is it a matter of time before we see another arms race, like we had a few years ago over pop-up ads? Or, as one of the commenters responded to Thomas , is it just a matter of equilibrium, where advertisers produce ads that users don’t want to block?

Indeed, are we already at that equilibrium? Is the lack of traction for easily available ad blockers a sign that people don’t mind ads, and that the ad-supported ecosystem can easily afford to ignore outliers like me who religiously use Adblock Plus and CustomizeGoogle to block all ads?

Uncategorized

Guest Post: Rich Marr, Media As a Search Term

Post author By Daniel Tunkelang
Post date August 5, 2009
10 Comments on Guest Post: Rich Marr, Media As a Search Term

The following is a guest post by Rich Marr. Rich is the Director of Engineering at Pixsta, where he’s been working on Empora.com, a consumer-facing site that enables browsing of fashion products according to image similarity (much like Modista). Pixsta is a growing start-up focused on turning our R&D team’s ongoing search and image processing work into workable products. The post is entirely his, with the exception of links that I have added so that readers can find company sites and Wikipedia entries.

It’s an often-heard urban myth that Eskimos have many words for snow, but that we only have one. This idea rings true because there’s value in being able to make precise distinctions when dealing with something important to you. You can find specialised vocabularies in cultures and sub-cultures all over the world, from surfers to stock brokers. When there’s value in describing something, you’ll usually find someone has created a word to do the job.

In search we often come across problems caused by insufficient vocabulary. People have an inconvenient habit of describing things in different ways, and some types of document are just plain difficult to describe.

This vocabulary problem has spawned armies of semantic search start-ups, providing search results based on inferred meaning rather than keyword matching, but semantic systems are text-driven which means there are still vocabulary problems, for example you might overhear some lyrics and then use a search engine to look up who wrote the song but how would you identify a piece of unknown instrumental music? Most people don’t have the vocabulary to describe music in a way that can identify it. This type of problem is addressed by search tools that use Media As a Search Term, which I’ll abbreviate to MAST.

MAST applications attempt to fill the vocabulary gap by extracting meaning from the query media. These apps break down into two rough groups, one concerned with identification (e.g. SnapTell, TinEye, MusicBrainz, Shazam, and the field of biometrics) and the other concerned with similarity search (e.g. Empora, Modista, Incogna, and Google Similar Images).

These apps use media-specific methods to interpret objects and extract data in a meaningful form for the given context. Techniques used here include wavelet decomposition, Fourier transforms, machine learning techniques, and a whole load of good old-fashioned pixel scraping.

The interpreted data available is then made available in a searchable index, usually either a vector space that judges similarity using distance, or a conventional search index containing domain-specific ‘words’ extracted from the media collection. Both of these indexing mechanisms are a known quantity to the programmers which leaves the main challenge as the extraction of useful meaning, conceptually similar to using natural language processing (NLP) to interpret text.

The challenge of extracting useful meaning is based largely around establishing context, i.e. what exactly the user intends when they request an item’s identity, or want to see a ‘similar’ item. What properties of a song identify it as the same? Should live versions of the same song also match studio versions? Is the user more interested in the shape of a pair of shoes, or the colour, or the pattern?

Framed in the context the difficulties of NLP it’s clear that there’s not likely to be an immediate leap in the capabilities of these apps but rather a gradual evolution. That said, these technologies are already good enough to surprise people and they’re quickly finding commercial use, which adds more resources and momentum. As our researchers chip away at these big challenges you’ll find MAST systems appearing in more and more places and becoming more and more important to the way people acquire and manage information.