Month: February 2011

Life, the Universe, and SEO Revisited

Post author By Daniel Tunkelang
Post date February 26, 2011
24 Comments on Life, the Universe, and SEO Revisited

A couple of years ago, I wrote a post entitled “Life, the Universe, and SEO” in which I considered Google’s relationship with the search engine optimization (SEO) industry. Specifically, I compared it to the relationship that Deep Thought, the computer in Douglas Adams’s Hitchhikers Guide to the Galaxy, has with the Amalgamated Union of Philosophers, Sages, Luminaries and Other Thinking Persons.

Interestingly, both SEO and union protests have been front-page news of late. I’ll focus on the former.

Three recent incidents brought mainstream attention to the SEO industry:

Two weeks ago, Google head of web spam Matt Cutts told the New York Times that Google was engaging in a “corrective action” that penalized retailer J. C. Penney’s search results because the company had engaged in SEO practices that violated Google’s guidelines. For months before the action (which included the holiday season), J. C. Penney was performing exceptionally well in broad collection of Google searches, including such queries as [dresses], [bedding], [area rugs], [skinny jeans], [home decor], [comforter sets], [furniture], [tablecloths], and [grommet top curtains]. As I write this blog post, I do not see results from jcpenney.com on the first result page for any of these search queries.
This past Thursday, online retailer Overstock.com reported to the Wall Street Journal that Google was penalizing them because of Overstock’s now discontinued practice of rewarding students and faculties with discounts in exchange for linking to Overstock pages from their university web pages. Before the penalty, these links were helping Overstock show up at the top of result sets for queries like [bunk beds] and [gift baskets]. As I write this blog post, I do not see results from overstock.com on the first result page for either of these search queries.
That same day, Google announced, via an official blog post by Amit Singhal (Google’s head of core ranking) and Matt Cutts, a change that, according to their analysis, noticeably impacts 11.8% of of Google search queries. In their words: “This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sites—sites with original content and information such as research, in-depth reports, thoughtful analysis and so on.”

Of course, Google is always working to improve search quality and stay at least one step ahead of those who attempt to reverse-engineer and game its ranking of results. But it’s quite unusual to see so much public discussion of ranking changes in such a short time period.

Granted, there is a growing chorus in the blogosphere bemoaning the decline of Google’s search quality. Much of it focused on “content farms” that seem to be the target of Google’s latest update. Perhaps Google’s new public assertiveness is a reaction to what it sees as unfair press. Indeed, Google’s recent public spat with Bing would be consistent with a more assertive PR stance.

But what I find most encouraging is the Google’s recent release of Chrome browser extension that allows users to create personal site blocklists that are reported to Google. Some may see this is as a reincarnation of SearchWiki, an ill-conceived and short-lived feature that allowed searchers to annotate and re-order results. But filtering out entire sites for all searches offers users a much greater return on investment than demoting individual results for specific searches.

Of course, I’d love to see user control taken much further. And I wonder if efforts like personal blocklists are the beginning of Amit offering me a more positive answer to the question I asked him back in 2008 about relevance approaches that relied on transparent design rather than obscurity.

I’m a realist: I recognize that many site owners are competing for users’ attention, that most users are lazy, and that Google wants to optimize search quality subject to these constraints. I also don’t think that anyone today threatens Google with the promise of better search quality (and yes, I’ve tried Blekko).

Perhaps the day is in sight when human-computer information retrieval (HCIR) offers a better alternative to the organization of web search results than the black-box ranking that fuels the SEO industry. But I’ve waiting for that long enough to not be holding my breath. Instead, I’m encouraged to see a growing recognition that today’s approaches are an endless game of Whac-A-Mole, and I’m delighted that at least one of the improvements on the table takes a realistic approach to putting more power in the hands of users.

General

Life’s a Beach

Heading to Punta Cana for a week. Feel free to keep writing great comments — will catch up when I get back!

General

Google vs. Bing: A Tweetle Beetle Battle Muddle

Post author By Daniel Tunkelang
Post date February 5, 2011
52 Comments on Google vs. Bing: A Tweetle Beetle Battle Muddle

Unless you’ve been living in a cone of silence, you’ve probably heard about the epic war of words between Google and Bing. But just in case, here’s a quick summary:

Amit Singhal, Google Fellow: “Microsoft’s Bing uses Google search results—and denies it“:

Bing is using some combination of:

Internet Explorer 8, which can send data to Microsoft via its Suggested Sites feature

the Bing Toolbar, which can send data via Microsoft’s Customer Experience Improvement Program

or possibly some other means to send data to Bing on what people search for on Google and the Google search results they click. Those results from Google are then more likely to show up on Bing. Put another way, some Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation.

Harry Shum, Corporate Vice President, Bing: “Thoughts on search quality“:

We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.

Yusuf Mehdi, Senior Vice President, Online Services Division, Bing: “Setting the record straight“:

Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results. What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index.

Matt Cutts, Head of Webspam, Google: “My thoughts on this week’s debate“:

Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:

The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. –Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”

This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “&spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places.

Let me start by saying that these are very serious words from very serious people.

Amit and Matt, both of whom I know personally, are not just two of the most prominent Google employees — they have a deep personal investment in Google’s search quality. Amit is personally responsible for much of Google’s web search ranking algorithm, and Matt is surely the person whom spammers (and many SEO consultants) most love to hate. There is no question in my mind that the emotion both of them are expressing is sincere.

I haven’t met Harry or Yusuf, but I have no reason to doubt their own sincerity — especially since everything they are saying seems consistent with the facts — in fact, consistent with the substantive parts of Google’s allegations. Indeed, the facts don’t really seem to be in dispute. And more generally, I’ve met some of the folks who lead the Bing team (like Jan Pedersen), and, like Matt, I believe they are thoughtful and sincere and are devoted to building a great search engine of their own.

The debate is not about the facts. Rather, it’s about what is right and wrong. I will try to summarize the two sides’ position without editorializing.

Bing is claiming that:

Users have a right to do as they please with their own clickthrough data, which includes data from Google search sessions.
Bing toolbar users opted in to share this clickthrough data with Bing.
By using this clickthrough data, Bing creates value for users.

Google is claiming that:

Bing’s specific targeting of Google clickthrough data amounts to copying Google and is wrong.
Bing toolbar users are not necessarily aware that they are complicit in this behavior.
Bing is disingenuous in understating how much it benefits from Google as a signal.

What do I think?

I agree with Bing that users have the right to do as they please with clickthrough data. I’d think Google would agree too, given that Google wrote the sermon on “the meaning of open“:

Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

I agree with all of the three points I listed as Google’s claims except for the part that Bing’s behavior is wrong. It’s up to users if they want to help Bing compete with Google. Do users know that they’re doing so? Probably not. But would they stop doing so if they did? I doubt it. I can’t see why most users would have a dog in this fight — and in fact, it may be in users’ interest to help Bing be more competitive.

I do think Bing should be forthright about what it is doing — and how much this user-provided data from Google search sessions is contributing to its own quality improvements. Bing can, of course, keep this information secret, but I’d think that Bing would want to defend its reputation as an innovator — especially as the David in a David vs. Goliath fight.

But I also think that Google should be careful with its accusations. Accusing Bing of not being innovative is one thing, and that accusation, backed by concrete examples, is probably enough to score points. But implying that Google owns its users’ clickthrough data and that Bing has no right to solicit that data from users is another thing entirely.

I’m curious to hear what others here think. It’s been a while since I could freely express opinions about Google and Bing, so I’m delighted to have such a hot controversy to incite discussion. Because everyone enjoys a muddle puddle tweetle poodle beetle noodle bottle paddle battle!

General

Got Skills?

Last October, a certain blogger said:

LinkedIn needs to implement some kind of concept extraction to provide a useful topic facet (something I’d also love to see for their regular people search). This is a challenging information extraction problem, especially for the open web, but I also know from experience that it is tractable within a domain. Given LinkedIn’s professional focus, I believe this is a problem they can and should tackle.

Shortly after writing that post, I interviewed at LinkedIn and met Pete Skomoroch, who showed me an early preview of the work his team was doing to make skills a facet for exploring the space of LinkedIn member profiles. That demo made a strong impression on me, giving me a taste of the great products LinkedIn’s data scientists were working on in the lab.

And now I’m delighted that everyone can try out the beta launch of LinkedIn Skills which was announced today at O’Reilly’s Strata 2011 conference on Big Data.

As Pete says in his blog post:

If you search for a particular skill, we’ll surface key people within that community, show you the top locations, related companies, relevant jobs, and groups where you can interact with like minded professionals. You’ll also be able to explore similar skills and compare their growth relative to each other.

I encourage you to check it out — whether you’re looking for experts on Hadoop, cheese, or anything else! It’s a beta, so I’m sure you’ll find rough edges; but I hope it gives you a sense of how LinkedIn’s data can enable a incredibly powerful and useful exploratory search experience.

No forward-looking statements, except to say that it only gets better from here!