Categories
Uncategorized

Google’s Chief Economist Hal Varian Talks Stats 101

In an interview with CNET’s Tom Krazit, Google Chief Economist Hal Varian made a nice argument regarding the relative advantages of scale to a search engine:

On this data issue, people keep talking about how more data gives you a bigger advantage. But when you look at data, there’s a small statistical point that the accuracy with which you can measure things as they go up is the square root of the sample size. So there’s a kind of natural diminishing returns to scale just because of statistics: you have to have four times as big a sample to get twice as good an estimate.

Another point that I think is very important to remember…query traffic is growing at over 40 percent a year. If you have something that is growing at 40 percent a year, that means it doubles in two years.

So the amount of traffic that Yahoo, say, has now is about what Google had two years ago. So where’s this scale business? I mean, this is kind of crazy.

The other thing is, when we do improvements at Google, everything we do essentially is tested on a 1 percent or 0.5 percent experiment to see whether it’s really offering an improvement. So, if you’re half the size, well, you run a 2 percent experiment.

For those unfamiliar with statistics, I encourage you to look at the Wikipedia entry on standard deviation. Varian is obviously reducing the argument to a sound bite, but the sound bite rings true. More is better, but there’s a dramatically diminishing return at the scale of either Microsoft or Google.

However, I do think there’s a big difference when you start talking about running lots of experiments on small subsets of your users. The ability to run twice as many simultaneous tests without noticeably disrupting overall user experience is a major competitive advantage. But even there quality trumps quantity–how you choose what to test matters a lot more than how many tests you run.

What does strike me as ironic is that the moral here is a great counterpoint to the Varian’s colleagues’ arguments about the “unreasonable effectiveness of data“. Granted, it’s apples and oranges–Alon Halevy, Peter Norvig, and Fernando Pereira are talking about data scale, not user scale. Still, the same arguments apply. Sampling is sampling.

ps. Also check out Nick Carr’s commentary here.

Categories
Uncategorized

UIE Virtual Seminar on Faceted Search

My colleague, Endeca co-founder Pete Bell, and I are giving a virtual seminar on faceted search next week for User Interface Engineering (UIE). It’s on Thursday, August 20th at 1:30PM EST. The regular price is $129, but Noisy Channel readers who are interested in attending can get a $30 discount by using TUNKELANG (yes, all caps) as a promo code. Attendees also receive a free copy of my book, Faceted Search.

Whether or not you can attend, I do encourage you to check out the UIE site. It’s got a lot of free, useful content, and Jared Spool is definitely someone worth following if you are interested in web usability.

Categories
General

An Able Grape at the Helm of Twitter Search

While I am an avid Twitter user (and apparently a tradeable commodity in a “Fantasy Twitter” game that some friends are playing), regular readers know that I’ve offered mixed reviews of Twitter Search.

I’ve link-baited Summize founder and Twitter Chief Scientist Abdur Chowdhury here once or twice, but I understand that he’s no longer running Twitter Search. They’ve got a new guy, Doug Cook, as Director of Search.

This is great news, because Doug is someone who’s thought a lot about search and user experience. He was one of the early web search guys at Inktomi and also spent some time at Yahoo!, but what impresses me most is a project he’s pursued as a labor of love: Able Grape.

From their about page:

We’re a wine search engine — not for comparison shopping, but for learning and research. We aim to be the world’s most comprehensive, up-to-date, and authoritative source for online wine information.

Great, another vertical search engine, just what the world needs (unfortunately WordPress 2.8.4 doesn’t support sarcastic font). But seriously, Able Grape is worth a look, even if, like me, you are not a wine nerd. So wash your glasses and let’s have a quick tasting.

First off, Able Grape is not searching a proprietary document collection. Rather, it’s based on a focused crawl of “more than 38,000 sites and some 18 million pages.” In other words, Able Grape is in no position to ask anyone to add meta-data. Even at the site level, I doubt Doug had the time to customize the handling of content for each of 38,000 sites. In other words, there’s enough scale here to make the problem interesting.

Now let’s look at some examples of the site in action. I’m a fan of Spanish wines, so I’ll start with one of their example queries, tempranillo. The first page of results looks relevant to the topic, but so far that doesn’t distinguish them from Google, Yahoo, or Bing. What surprises me is that the “Filter by Region” offers regions outside of Spain–like California and even New York! Yes, I might have learned some of that from Wikipedia–though it would not even have occurred to me to ask about non-Spanish Tempranillo. That’s exploratory search and serendipitous discovery for you!

Let’s try a different example, this time not from their list. I like Malbec wines (which I associated with my maternal link to Argentina), but the only local wine region for me is the North Fork of Long Island. So here’s a search for north fork malbec, filtered to Long Island. It certainly gives me ideas of which wineries to check out on my next trip there. Though, to be fair to the competition, G/Y/B all handle this query pretty well–though none of them offer refinement by region to disambiguate “north fork”.

Able Grape has lots of cool features, ranging from how they handle multilingual content to clever use of constrained “wildcard” terms like anyvariety to match any wine variety (aka varietal). I suspect that there is much to learn from its design that applies to a broad variety (sorry!) of search applications.

I’m a wine dilletante, so it’s hard for me to spend too much time on this site without any deep-seated information needs to fulfill. But I’m a card-carrying member of searchaholics anonymous (well, maybe not so anonymous), and I’m impressed by what Doug’s done with this vertical.

Which brings us back to Twitter Search. Director of Search for Twitter is a high-profile, high-pressure job, even without Facebook nipping at Twitter’s heels. I’m sure Able Grape will ferment for a while as Doug devotes his creative energies to improving Twitter Search. I certainly hope he brings the same focus and sensitivity to his new endeavor and makes Twitter Search a grand cru of search engines.

Categories
General

Lots of Search News Today!

I try not to write posts that are just cut-and-paste from Techmeme, but it’s hard to resist a trio like this:

OK, perhaps that last item isn’t strictly search news, but it may as well be, given that the microblogging wars are in no small part about “real time” search.

I’m not a huge Facebook fan (as those of you who have looked at my spartan profile page may have noticed), but I am curious about how they’re implementing search over their sprawling collection of content. I’m underwhelmed with my own search experience on the site, but that might be my own fault for not being an active Facebook participant. Perhaps folks here who are more active can share their own experiences.

As for the acquisition of FriendFeed, I’m surely in good company to assume this was Facebook’s second choice after the attempt to acquire Twitter fell through. If, as has been reported. Facebook only paid $50M for FriendFeed, then the acquisition was pocket change compared to the $500M they offered Twitter  (granted, some or all of that being based on a controversial valuation of the Facebook). Anyway, it should keep life interesting in the status-sphere.

And then there’s Google’s preview site, which you can try here. The only difference I see between it and the non-preview Google search is that the estimated result counts tend to be slightly higher. The top-ranked results seem almost identical, modulo tiny permutations for the queries I checked, as do related searches and any other features I tried. But apparently that’s the idea:

The new infrastructure sits “under the hood” of Google’s search engine, which means that most users won’t notice a difference in search results. But web developers and power searchers might notice a few differences, so we’re opening up a web developer preview to collect feedback.

Anyway, it’s more fun reading all of this stuff than hearing the CEO of the web’s great search brands proclaim that her company has never been a search company–or wondering where all the great search people I know there will land as Yahoo search is assimilated into Bing. Don’t get me wrong, I’m looking forward to the competition between Google and Microsoft–one that I think will finally be waged in earnest. But I’m still sad for Yahoo and its employees.

Which bring us to the last news item: Doug Cutting is leaving Yahoo for Cloudera, where he’ll continue to work on Hadoop. According to his blog post about it, “This move will not fundamentally change my day-to-day activities.” It will certainly be interesting to see what comes next from someone who has been instrumental to so many major open-source packages associated with search.

Categories
General

Norbert Fuhr’s Probability Ranking Principle for Interactive Information Retrieval

The other day, I was talking with Paul Thompson about the challenges of evaluating interactive information retrieval (IIR) systems, and he mentioned a paper that came up in discussion at the SIGIR 2009 workshop on Understanding the User: “A probability ranking principle for interactive information retrieval” by Norbert Fuhr–an update to the decades-old probability ranking principle (PRP) for information retrieval.

I was embarrassed to admit that I’d never seen or even heard of this paper, despite Nick Belkin citing it in the ECIR 2008 keynote that inspired my first blog post! In my defense, the citation is a single sentence that offers more of a tease than an explanation of the paper’s thesis.

Have no fear; I’ll offer more than a sentence here! I’ll summarize the paper and then offer my personal reaction. Let’s start with a few lines from the abstract:

In this paper, a new theoretical framework for interactive retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation. Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering of the choices can the derived – the PRP for IIR.

The first two sections of the paper provide an introduction and motivation. I’ll skip these, other than excerpt the following:

Interactive retrieval consists of user actions of various types, and scanning through document lists for identifying the relevant entries is not the most crucial activity in interactive retrieval [Turpin & Hersh 01]. In contrast, other activities (like e.g. query reformulation) seem to be more ‘expensive’ from the user’s point of view.

It’s in the third section that Fuhr starts outlining his approach. He established three requirements that, in his view, a PRP for IIR must satisfy:

  • Consider the complete interaction process.
  • Allow for different costs and benefits of different activities.
  • Allow for changes of the information need.

He then makes four simplifying assumptions:

  • Focus on the functional level of interaction (i.e., ignore design, visualization).
  • Decisions are the major interaction activity.
  • Users evaluate choices in linear order.
  • Only positive, correct decisions are of benefit for a user.

With these assumptions in hand, Fuhr establishes an interaction model comprised of a sequence of “situations”:

a situation reflects the system state of the interactive search a user is performing…a situation consists of a list of choices the user has to evaluate…the first positive decision by the user will move him to another situation (depending on the choice he selected positively)…assume that there is always a last choice that will move him to another situation…the user’s information need does not change during a situation, knowledge is added only when switching to another situation due to a positive decision

In the fourth section, Fuhr describes a cost model for IIR. The key concept is that each choice incurs an effort, yields an expected benefit, and incurs an additional cost if the user has to backtrack. His overall framework is generic, but he offers a concrete “illustrating example” in which:

  • queries represent conjunctions of query terms
  • choices represent narrowing query refinements that add individual terms to the current query
  • the probability of the user choosing a given query term is proportional to that term’s frequency in the corpus
  • the benefit of an accepted choice is the log of factor by which it reduces the result set

Given such a cost model, the goal of an IIR system is now to maximize the user’s overall expected benefit, which in turn requires making tradeoffs at each choice. Fuhr offers an example to illustrate these tradeoffs:

As a simple example, assume that the system proposes some terms for query expansion. As one possibility, only the terms themselves are listed. Alternatively, for each term, the system could show a few example term occurrences in their context, thus giving the user some information about the usage of the term. The user effort per choice is lower in the first case, but the decisions will also be more error-prone.

The fifth section explains optimum ranking of choices according to the PRP. It has the most math, and I’ll refer the interested reader to the paper. The section that intrigued me the most was the sixth, entitled “Towards application”, in which Fuhr considers the various components of his generic framework, and speculates how existing and future research may supply the concrete details needed to instantiate it.  Finally, Fuhr wraps up  by citing related work and offering a conclusion/ outlook.

That’s the summary–perhaps it will inspire some of you to read the whole paper. Personally, I’m intrigued by Fuhr’s model.  I suspect I was trying to back into a similar approach in my post on “Ranked Set Retrieval” a few months ago. But, even if I’d try to formalize my proposal, I don’t believe I would have ever arrived at a framework as general as Fuhr’s.

That said, Fuhr’s article is like a fancy dinner that leaves me hungry at the end. It’s a solid paper: general framework, suggestive examples, and good writing generally. Nonetheless, it doesn’t give me what I really wanted: to feel closer to either a cost-effective evaluation methodology for IIR–or a tool that can be embedded into an IIR system to improve user experience. I know it’s a theory paper, and I should judge it on its own terms, but I was hoping to see more immediate applicability.

Given that Fuhr has published a fair amount of applied research, I’m hopeful about his future work in this area. I do appreciate the formal framework he has proposed, and I’d like to think that people are already looking at way to build on that framework. Perhaps I’ll be one of them.

Categories
General

Public Expression, Liability, and Anonymity

A colleague just sent me a link to a story about a Twitter user being sued for a tweet. At least he’s not being sued in London.

I’m strongly if not absolutely in favor of freedom of expression, so it’s hard not to find such cases depressing. Nonetheless, I don’t think  the legal landscape hasn’t changed.

Rather, what has changed (or accelerated) is that:

  • It is easier for people to express themselves publicly–and hence far more people are doing it.
  • The detached nature of online communication releases people’s inhibitions. Moreover, people not only don’t self-censor, but in some cases are deliberately provocative to attract attention.
  • The speed and efficiency of distribution (especially through search / alerts) means that the people most likely to be or feel damaged by an act of public expression are far more likely to discover that act.

So it’s not surprising that users are being sued for what they say online–it’s an expected consequence of the democratization of publishing, especially in the litigious English-speaking countries on both sides of the pond.

I’d personally like to see it a higher bar for someone to initiate a defamation lawsuit–let alone win it–but I’m not holding my breath. Instead, I expect that we’ll see more anonymous expression by people who don’t feel the authenticity of disclosure justified the risk of retaliation. Oh well.

Categories
Uncategorized

Will Browsers Ship With Ad Blockers?

A while ago, I wrote a post entitled “Think Evil” in which I mused that:

A few years ago, when it became clear that Microsoft was losing the search wars to Google–but when they hadn’t lost much browser market share to Firefox–I thought they should have used a scorched earth strategy of including an ad-blocker in Internet Explorer. The ad blocker would be on by default and would block all ads, including sponsored links from search engines. Actually, I can’t bring myself to consider this particular approach evil–from my perspective, the means would justify the end.

I guess I’m not the only person with such musings. In a post with the descriptive (if uncreative) title “In five years all browsers will block internet advertisements by default“, Orin Thomas argues:

People have become conditioned to accessing content for free on the Internet and people also don’t want to see advertisements on the Internet. At some point in the not too distant future, Ad blocking will become a necessary browser feature like Tabs are today. Any browser that does not include the feature will suffer a dramatic downturn in market share as people move to platforms that “block those darn advertisements”. Within five years, all browsers will block advertisements by default because, in the end, it is a feature that most people want.

I’d like to believe that he’s right, but I’m pretty sure I made similar claims at least five years ago, and I’m not aware of even a niche browser that ships with a built-in ad blocker.

I’m curious what readers think. Is it a matter of time before we see another arms race, like we had a few years ago over pop-up ads? Or, as one of the commenters responded to Thomas , is it just a matter of equilibrium, where advertisers produce ads that users don’t want to block?

Indeed, are we already at that equilibrium? Is the lack of traction for easily available ad blockers a sign that people don’t mind ads, and that the ad-supported ecosystem can easily afford to ignore outliers like me who religiously use Adblock Plus and CustomizeGoogle to block all ads?

Categories
Uncategorized

Guest Post: Rich Marr, Media As a Search Term

The following is a guest post by Rich Marr. Rich is the Director of Engineering at Pixsta, where he’s been working on Empora.com, a consumer-facing site that enables browsing of fashion products  according to image similarity (much like Modista).  Pixsta is a growing start-up focused on turning our R&D team’s ongoing search and image processing work into workable products. The post is entirely his, with the exception of links that I have added so that readers can find company sites and Wikipedia entries.

It’s an often-heard urban myth that Eskimos have many words for snow, but that we only have one.  This idea rings true because there’s value in being able to make precise distinctions when dealing with something important to you.  You can find specialised vocabularies in cultures and sub-cultures all over the world, from surfers to stock brokers. When there’s value in describing something, you’ll usually find someone has created a word to do the job.

In search we often come across problems caused by insufficient vocabulary.  People have an inconvenient habit of describing things in different ways, and some types of document are just plain difficult to describe.

This vocabulary problem has spawned armies of semantic search start-ups, providing search results based on inferred meaning rather than keyword matching, but semantic systems are text-driven which means there are still vocabulary problems, for example you might overhear some lyrics and then use a search engine to look up who wrote the song but how would you identify a piece of unknown instrumental music?  Most people don’t have the vocabulary to describe music in a way that can identify it.  This type of problem is addressed by search tools that use Media As a Search Term, which I’ll abbreviate to MAST.

MAST applications attempt to fill the vocabulary gap by extracting meaning from the query media. These apps break down into two rough groups, one concerned with identification (e.g. SnapTell, TinEye, MusicBrainz, Shazam, and the field of biometrics) and the other concerned with similarity search (e.g. Empora, Modista, Incogna, and Google Similar Images).

These apps use media-specific methods to interpret objects and extract data in a meaningful form for the given context.  Techniques used here include wavelet decomposition, Fourier transforms, machine learning techniques, and a whole load of good old-fashioned pixel scraping.

The interpreted data available is then made available in a searchable index, usually either a vector space that judges similarity using distance, or a conventional search index containing domain-specific ‘words’ extracted from the media collection. Both of these indexing mechanisms are a known quantity to the programmers which leaves the main challenge as the extraction of useful meaning, conceptually similar to using natural language processing (NLP) to interpret text.

The challenge of extracting useful meaning is based largely around establishing context, i.e. what exactly the user intends when they request an item’s identity, or want to see a ‘similar’ item. What properties of a song identify it as the same? Should live versions of the same song also match studio versions? Is the user more interested in the shape of a pair of shoes, or the colour, or the pattern?

Framed in the context the difficulties of NLP it’s clear that there’s not likely to be an immediate leap in the capabilities of these apps but rather a gradual evolution. That said, these technologies are already good enough to surprise people and they’re quickly finding commercial use, which adds more resources and momentum. As our researchers chip away at these big challenges you’ll find MAST systems appearing in more and more places and becoming more and more important to the way people acquire and manage information.

Categories
Uncategorized

Reminder: HCIR 2009 Submission Deadline is August 24th!

Just a quick reminder that the submission deadline for HCIR 2009, the Third Annual Workshop on Human-Computer Interaction and Information Retrieval, is August 24th, which is just 3 weeks away! Please spread the word; I know people can be forgetful during the summer months. The workshop itself will be held on October 23rd, at Catholic University in Washington, DC.

Categories
General

SIGIR 2009: Day 3, Industry Track: Vendor Panel

The last session of the SIGIR 2009 Industry Track was the enterprise search vendor panel. Originally, I’d hoped to have CTOs (or the equivalent) from Autonomy, Endeca, and FAST–specifically, Peter Menell (CTO of Autonomy), Adam Ferrari (CTO of Endeca), and Bjørn Olstad (formerly CTO of FAST, now a Microsoft Distinguished Engineer).

Since it would have been inappropriate for me to moderate a panel that included my own manager and representatives of two of Endeca’s competitors, I recruited Liz Liddy, who not only is the chair of SIGIR, but also whom I felt was uniquely qualified to understand both the research and business sides of this field. As if that wasn’t enough, James Allan managed to procure Bruce Croft, whose volume of achievements includes both a Salton Award and a Research in Information Science Award, the highest honors in information retrieval and information science. And I recruited our very own commenter-in-chief Jeremy Pickens to serve as a time-keeper. I wish I’d also enlisted him for the panel I moderated!

That, at least was the plan. Plans, of course, are subject to change. Only a couple of days after I reached out to Bjørn, I saw that he was promoted to replace former FAST CEO John Markus Lervik. He suggested Øystein Torbjørnsen, chief architect of FAST’s core search. Fine by me. Two weeks before the conference, however, Bjørn wrote to inform me Øystein had to back out for personal reasons. But he offered senior product manager (the acquisition shaved a notch off of his former title of VP of Product Management) Jeff Fried as a substitute. All good.

Fortunately, I knew that I could trust my manager to live up to his commitments. I did have to fill out his online registration form on site at 7:30am, but he more than made up for it by buying me drinks after the conference. Two down, one to go.

Then there was Autonomy. Strangely, despite having working in this space for nearly a decade, I’d never actually met anyone from Autonomy. That meant I’d have to cold-call someone to have any chance of getting them to participate in the panel. Who to call? I decided that, since industry conference participation was probably under the umbrella of corporate marketing, I’d try their head of PR, one of the few Autonomy employees whose contact information is published on the open web.

Success! Or so I thought at the time. She told me that Peter Menell would participate in the panel, and, with my roster complete, I proceeded to start publicizing the Industry Track. However, a week and a half before the conference, she calls me to let me know that Peter can’t attend, and that in fact no one from Autonomy is available to take his place.

I was on vacation at the time, but of course conference organizers don’t get to take vacation, at least not when panelists cancel at the last minute. I started mulling over my options.

Including an empty chair on the panel would have given me the satisfaction of exposing Autonomy’s snub (both to me personally and to the information retrieval community), but I realized that wouldn’t be at all fair to the attendees.

Instead, I started going through the short list of possible panelists, favoring those who lived in the Boston area. And then, by fortuitous coincidence, I received an email from Raul Valdes-Perez, Executive Chairman and co-founder of Vivisimo: “If by any chance you need any last-minute help or stand-in, let me know.” I made a mental note to investigate Raul as a possible psychic and then happily welcomed him on board. And breathed a sigh of relief.

For all of the upheaval in bringing the panel together, the actual session went like clockwork. The panelists were respectful, disciplined, and yet not afraid to take risks in their stances. They talked about the challenges of holistically evaluating search applications, the interplay between relevance ranking and faceted search, the pros and cons of federated search, and much more. HCIR was a major theme, though perhaps that’s not surprising given the participants. As usual, Mary McKenna took more detailed notes.

I take pride both in the overall quality of the panel and specifically in the performance of my manager, whose background is more in systems and databases than in information retrieval. Of course, I’m biased, and he does sign my expense forms. 🙂

Regardless, I think I can muster enough objectivity to say that the panel was a huge success. Bruce Croft’s response, which also ended the Industry Track, was a fitting valedictory address: he urged us not to just walk away from the conversation. I am proud of the success of the Industry Track as an event, but I hope it is only the beginning of a deeper mingling of researchers and practitioners.

To James Allan and Jay Aslam, the SIGIR 2009 co-chairs, I thank you for the privilege and opportunity to organize the Industry Track. And, to whoever takes on such a responsibility at future SIGIR conferences, I hope you can benefit from my experience without having to relive it!