The Noisy Channel

 

Canonical URLs and Faceted Search

February 13th, 2009 · 1 Comment · General

Big news from Google, Yahoo, and Microsoft: the three web search leaders announced yesterday that they will jointly support a standard by which a web page can indicate the address of its “canonical” version. By using this standard, a site can avoid the problem of indexing duplicate copies of pages and suffering, from an SEO perspective in terms of how well those pages are indexed.

You can find coverage at:

This is a great development for everyone, but especially for anyone building sites that use faceted search (which should be everyone!). One of the problems we identified early on at Endeca is that faceted search, if implemented naively, can lead to massive duplication of URLs. The whole point of a faceted information architecture is that there are many paths that lead to a given product or document page.

For example, consider a page that is associated with values from 10 facets. There may be 10! = 3,628,800 ways to reach it–and that’s assuming that none of the facets are hierarchical. In fairness, it also assumes that none of the paths contract from implicit selection. Regardless, the number of paths is large enough to be a problem for SEO if each path receives its own URL.

Endeca recognized this problem a while ago, and addressed it through what we call “URL beautification”–our own means of canonicalizing URLs that, in addition to deduping the multiple paths, has the side benefit of creating URLs that are SEO-friendly.

Nonetheless, my colleagues and I are delighted to see the major web search engines recognizing this problem and making it easier for everyone to solve it. It’s a rare day to see Google, Yahoo, and Microsoft working together, but it’s nice when it happens. Good thing they got the news out before “Be Evil” day!

1 response so far ↓

  • 1 Jaimie Sirovich // May 5, 2011 at 5:44 pm

    I just noticed this blog post years later, and I’m not sure it tells the whole story.

    Isn’t canonicalization not good enough? Even if you canonicalize a spider trap, a robot can’t (or won’t — if you’re not important enough) spider a spider trap.

    Implicit selection is also only going to happen if it’s a single selection interface. Along those lines, it’s probably best just to exclude multiple selection entirely from the view of search engines. It’s just too huge a space.

    Canonicalizing a spider trap is like saying “see all this stuff you spidered? Yeah, it’s useless.” Isn’t it? I’m assuming Endeca sorts the filters when they beautify URLs.

    Canonicalizing solves small duplicate content problems — like deduping a product in N categories. Facets introduce powersets and factorials, and that’s a bigger challenge to throw at it. Perhaps too big, no?

Clicky Web Analytics