SIGIR 2009: Day 3, Industry Track: Matt Cutts

At last we arrive at the SIGIR 2009 Industry Track. Since I organized this track (which mainly involved coming up with a program and then actually producing the speakers), I’m not exactly an impartial observer. But hopefully the organizers of future industry tracks will benefit from my perspective as an organizer.

Last December (New Year’s Eve, to be precise), I started recruiting speakers. I started with a list of topics I wanted to see covered, and one of those topics was spam / adversarial information retrieval. My top two choices were Matt Cutts and Amit Singhal, both members of the Search Quality group at Google. I’d heard Amit speak before: he delivered one of the keynotes at ECIR 2008 (and inspired one of my first blog posts!). So I decided to aim for Matt Cutts, despite having no way to contact him (the head of Google’s Webspam team is understandably a bit protective of his personal email address). And, just two weeks later, I had Matt locked in to the program.

Matt was an incredible speaker, and he had the unenviable task of opening the Industry Track at 8:30 AM, the morning after the banquet. His title, “WebSpam and Adversarial IR: The Road Ahead”, gave him a fair amount of maneuvering room, and he used his 45 minutes to give the audience a peek into his world.

He opened the talk by inducing the audience to try to think like a spammer. He then game examples of social engineering attacks, to put us in a “black hat” mindset. He also pointed out the danger of punishing sites with spammy inlinks: people and companies would use this knowledge against their competitors / enemies (the practice has been called “Google bowling“).

He then moved on to examples of spam techniques. He showed examples of pages whose spaminess is only detectable by parsing JavaScript, something I wasn’t aware that Google could do (though apparently this has been public knowledge for a while). The theoretical computer scientist in me wonders about using random self-reducibility as obfuscation on steroids, but hopefully spammers aren’t quite that sophisticated yet!

He offered a common-sense framework for fighting spam: reduce the return on investment. Unfortuately, he sees a trend in spam where spammers are aiming for faster, higher payoffs by hacking sites and installing malware. Indeed, the democratizing effect of social media means that a lot more people have pages that can serve spam, including their Twitter and Facebook pages. He invited the information retrieval community to invest effort in learning how to automatically detect  that a page or server has been hacked.

My only quibble with the talk is that Matt did not discuss the inherent subjectivity of spam. Sure, there are many cases that are black and white, but ultimately spam (like relevance) is in the eye of the user. I’d love to see more use of techniques like attention bond mechanisms that accommodate a subjective definition of spam, e.g., “any email that you would rather have not received.”

But I quibble. Matt delivered an excellent talk to a packed audience, and it was a real privilege to have him kick off the Industry Track.

ps. You can also read Jeff Dalton’s notes on Matt’s presentation.

By Daniel Tunkelang

High-Class Consultant.

4 replies on “SIGIR 2009: Day 3, Industry Track: Matt Cutts”

Thanks for the useful synopsis.

I’m especially glad you give a nod to the subjective nature of what constitutes “spam,” a concept is all too 0ften glossed over – or rather, is not even acknowledged as a being both a theoretical and practical matter worth debating.

It is interesting that the link you supply for attention bond mechanisms (published July 2004) defines spam as “any email that you would rather have not received,” which at once exposes the historical roots of the concept and suggests how it has morphed. Though not an irreducible concept in itself, spam framed by email is at least relatively black and white: this is either a communication that I requested (or at least expected) or one that is unsolicited. The focus here is on the user experience.

For Google, spam has come to include – in the specific arena of their search results – “web pages that are ranked higher in results than their content or linking environments warrant,” as well as more straightforwardly unexpected or unsolicited communications (such as a redirect to another page, or the installation of malware).

The criteria used by Google to determine which pages fall into this category are both obscure and, again, subjective. Certainly at the linking level these criteria presuppose, it seems to me, a pool of “neutral” humans without a vested interest in a page’s ranking/reputation, pitted against a seemingly pool of commercially-interested manipulators whose interest is in artificially elevating a web page’s ranking through manipulative practices. The “neutrality” of those raters-by-proxy, and the notion that similar efforts by those with mercantile connections constitutes “spam” is, I think, very much open for discussion.


Aaron, I’m glad you picked up on that thread. I feel strongly that systems should not be in the business of unilaterally deciding what content I’ll find important–spam detection being a special case of relevance determination, and web search results and email being special cases of content.

I do concede that some cases are clear cut, which is why I use spam filters and appreciate that there is often value in search result ranking–which also amounts to filtering when the number of matches exceeds those displayed.

But, as you point out, my relevance criteria may not be the same as Google’s–or anyone else’s. That’s why I want control over the experience, rather than the benign paternalism of a search engine. I want HCIR!


Comments are closed.