Author: Daniel Tunkelang

High-Class Consultant.

Daniel Lemire on Diversity in Recommender Systems

Post author By Daniel Tunkelang
Post date November 24, 2008

Daniel Lemire has been on a tear arguing in favor of diversity in recommender systems, and now he’s assembling a bibliography on the subject. The post is a good start, and the comment thread is already elaborating on it.

As regular readers know, I’m also in favor of diversity in recommender systems, but I’m more concerned with their transparency.

General

Life, the Universe, and SEO

Post author By Daniel Tunkelang
Post date November 24, 2008
8 Comments on Life, the Universe, and SEO

In The Hitchhikers Guide to the Galaxy, Douglas Adams tells the story of how the Amalgamated Union of Philosophers, Sages, Luminaries and Other Thinking Persons protest the development of a computer called Deep Thought to provide the Answer to Life, the Universe, and Everything. Lunkwill and Fook, two of the philosophers, argue, “what’s the use of our sitting up half the night arguing that there may or may not be a God if this machine only goes and gives us his bleeding phone number the next morning?”

Deep Thought addresses their concerns as follows:

“but the programme will take me a little while to run.”

Fook glanced impatiently at his watch.

“How long?” he said.

“Seven and a half million years,” said Deep Thought.

Lunkwill and Fook blinked at each other.

“Seven and a half million years …!” they cried in chorus.

“Yes,” declaimed Deep Thought, “I said I’d have to think about it, didn’t I? And it occurs to me that running a programme like this is bound to create an enormous amount of popular publicity for the whole area of philosophy in general. Everyone’s going to have their own theories about what answer I’m eventually to come up with, and who better to capitalize on that media market than you yourself? So long as you can keep disagreeing with each other violently enough and slagging each other off in the popular press, you can keep yourself on the gravy train for life. How does that sound?”

Adams wrote these words in 1979, before there was an Internet, let alone a World Wide Web or an attention economy based on it. And yet Adams could have been writing a parable about Google and the search engine optimization (SEO) industry.

According to Wikipedia, “Search engine optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines via “natural” (“organic” or “algorithmic”) search results.” More importantly, there is a cottage industry of SEO software tools and consultants, all ready to help you optimize your web site for a fee.

You might think that Google, the search engine whose traffic most SEO efforts are optimizing, might frown on the use of SEO. After all, Google’s number one tenet is “Focus on the user and all else will follow.” Nowhere in their list of “ten things” do I see mention of supporting a multi-billion dollar industry of companies and consultants trying to manipulate result ranking.

And yet Google’s official position on SEO is hardly one of censure. Here is an excerpt:

If you’re thinking about hiring an SEO, the earlier the better. A great time to hire is when you’re considering a site redesign, or planning to launch a new site. That way, you and your SEO can ensure that your site is designed to be search engine-friendly from the bottom up. However, a good SEO can also help improve an existing site.

Full disclosure: Endeca sells SEO / SEM tools, and our customers see great results from them. We pursue a white-hat goal of “ensuring that all your content is exposed to Web search engines the right way”.

So, is SEO good or bad? Today, it’s actually necessary. Google and other web search engines rely on SEO efforts to compensate for the limitations of their indexing. This is an example of where sites share in the responsibilitiy for contextualizing a user’s experience.

But the adversarial nature of SEO is surely suboptimal for all parties. Well, for all parties other than SEO consultants and vendors. As Deep Thought said, “So long as you can keep disagreeing with each other violently enough and slagging each other off in the popular press, you can keep yourself on the gravy train for life. How does that sound?”

It sounds pretty lame to me. But it’s what we get when the attention economy of the web centers around a black box approach to relevance.

General

Opting Out of TheGreatHatsby

Post author By Daniel Tunkelang
Post date November 22, 2008
3 Comments on Opting Out of TheGreatHatsby

I just found this page explaining how to opt out of the “online communication experiment” called TheGreatHatsby which randomly attempts to connect two strangers through AOL instant messenger (AIM) bots.

Here are the instructions on how to opt out of this social experiment:

You can recognize the Bot by having names that are <adjective>Trout, <adjective>Salmon or <adjective>Coho.

You can stop the messages by typing:

$optout

Then it will respond with:

OPERATOR: Are you sure you want to opt-out? If you do, you will *never* be contacted again on the account “<screenname>”. There is *no way* to opt back in and undo this.
If you are sure, type “$optout DADD”. Remember, this is permanent and irreversible!

Type what it asks:

$optout DADD

And you will recieve one final (hopefully) message:

OPERATOR: You have opted out. The account “<screnname>” will *never* be contacted again. Good bye!

I just went through this process myself. I will let readers know if any more Coho bots show up.

General

Build Your Own Pirate Bay

Post author By Daniel Tunkelang
Post date November 22, 2008

Some readers may be familiar with The Pirate Bay, a Swedish website that indexes and tracks BitTorrent files. Those readers may also be aware that the site, and the pro-piracy Piratbyrån organization that runs it, have had various incidents with law enforcement. Perhaps some folks are looking for a contingency plan in case The Pirate Bay is shut down.

Have no worries, Google is there for you. As Ernesto at TorrentFreak writes, you can now build your own BitTorrent search engine for free using Google’s App Engine. In fact, the V0rtex and TorrentTab sites have already done so.

I suspect this is an unintended consquence of releasing the App Engine, much as Google Images was never intended to be the world’s largest porn site. But I’m curious how anti-piracy organizations like the RIAA and MPAA will react.

General

The Great Hatsby and Other IM Bots

Post author By Daniel Tunkelang
Post date November 22, 2008
7 Comments on The Great Hatsby and Other IM Bots

In recent weeks, I’ve been hit several times by instant messenger bots. The bots are rather clever, and evidently started in 2006 as TheGreatHatsby:

TheGreatHatsby^[1]^[2] is an AIM bot which instigates conversations between pairs of AIM accounts. Its name is a play on words from the book The Great Gatsby. It is a relay bot that retrieves the list of most recent LiveJournal posts and obtains the AIM screen names of their authors. It then sends users the message “i say, old bean, have you seen my hat?” from one of its screen names.

Responses from users are forwarded by the bot to another one of the users similarly contacted. These two users are paired up, and any message which one of them sends to the bot will be forwarded to the other. Thus, if neither of the users is aware of TheGreatHatsby, they will each think that the other user contacted them, and that the other user’s screen name is the bot’s screen name.

Messages containing a user’s true screen name will have that screen name replaced with the bot’s screen name; similarly, if a message contains the bot’s screen name, it will be replaced by the screen name of the user receiving the message. This adds to the confusion, since copy-and-pasted chat logs will appear as though the users really were messaging each other directly.

Specifically, I’ve been hit by “Coho bots“, which are bots as described above with names of the form “<adjective>Coho”.

I had ignored the messages in the past, but I finally decided to respond to one. Even though I was unfamiliar with IM bots, I naturally assumed that an IM user named “FidgetyCoho” was a bot. So I was taken aback by the very human responses I received in response to my response. After my unwitting (and unamused) conversational partner and I each established that the other was human, I did my research and learned the above about Coho bots.

Between this experience and my recent bout with a focused comment spammer, I must say I’m impressed and disturbed by the increasing level of sophistication of social media spamming. Absent a robust means of establishing identify and a cultural will to use it, I only see things getting worse.

General

Focused Comment Spamming

Normally I take down spam commenters. But a recent comment spammer posting with the name “lively” (claiming the unlilkely email address of lively@gmail.com) struck me as interesting enough to to leave as an example to readers. You can see the comment here.

I searched my logs and found that this spammers discovered my blog through a highly targeted search query:

http://www.google.com/search?q=inurl%3Asearchwiki+leave+a+comment

I don’t know if the spammer manually posted comments to the pages in the results or used a fully automated tool. I also wonder at the spammer’s breadth–I could imagine a fairly simple tool that takes a few parameters (name, email, the targeted search query, and the comment payload) and then crawls and spreads the spam.

This approach is technically trivial and potentially quite devastating–especially if spammers step it up a notch and start varying the comments a bit to avoid detection through duplication. I suppose we’ll see a lot more CAPTCHAs around if this approach catches on.

Uncategorized

Sorting a Petabyte

Google may be reactionary when it comes to information seeking approaches, but they are at the cutting edge of systems research. Their official blog post today on sorting a petabyte in six hours using MapReduce was a reminder of the impressive caliber of their systems team. You can learn more from their Technology RoundTable Series.

General

The Napoleon Dynamite Problem

Post author By Daniel Tunkelang
Post date November 21, 2008
15 Comments on The Napoleon Dynamite Problem

This week’s New York Times Magazine features an article by Clive Thompson about the Netflix Prize. The Netflix Prize, sponsored by the Netflix movie rental company, is perhaps the best marketing stunt I’ve seen in the history of machine learning:

The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.

The Netflix Prize has captured the imagination of many information retrieval and machine learning researchers, and I tip my hat to the folks at Netflix for inspiring researchers while pursuing their self-interest.

But, for all of the energy that the contest has generated, I have my doubts as to whether it focuses on the right problem. As I commented at hunch.net a few months ago:

what are the real problems we’re trying to help users address through recommendations. I know that, as a Netflix customer, I don’t get much out of the numerical ratings (or out of sorting by them). When a friend recommends a movie to me, he explains why he thinks I’ll like it. Not only is that far more informative, but it also allows me to interact with the process, since I can correct errors in his explanation (e.g., I like British comedy but not slapstick, and hence hate Monty Python) and thus arrive at different recommendations.

I understand that a simple model of assigning a number to each movie is very easy to measure and lends itself to a competition like the Netflix Prize. But, as an end-user, I question the practical utility of the approach, both in the specific context of movie recommendations and more generally to the problem of helping users find what they want / need.

I am gratified to see the New York Times article raises a similar concern:

As the teams have grown better at predicting human preferences, the more incomprehensible their computer programs have become, even to their creators. Each team has lined up a gantlet of scores of algorithms, each one analyzing a slightly different correlation between movies and users. The upshot is that while the teams are producing ever-more-accurate recommendations, they cannot precisely explain how they’re doing this. Chris Volinsky admits that his team’s program has become a black box, its internal logic unknowable.

Specifically, the off-beat film “Napoleon Dynamite” confounds many of the algorithms, because it is nearly impossible to predict whether and why someone will like it.

It’s a matter of debate whether black box recommendation engines are better than transparent ones. I’ve repeatedly made the case for transparency–whether for relevance or recommendations. But the machine learning community, much like the information retrieval community, generally prefers black box approaches, because the restriction of transparency adversely impact the accuracy of recommendations.

If the goal is to optimize one-shot recommendations, they are probably right. But I maintain that the process of picking a movie, like most information seeking tasks, is inherently interactive, and thus that transparency ultimately pays off. Then again, I have drunk the HCIR kool-aid.

Uncategorized

Housekeeping

Post author By Daniel Tunkelang
Post date November 21, 2008

I’ve made a few changes lately that I hope readers will appreciate, but please let me know if you have complaints–or suggestions:

The front page now shows the 10 most recent posts, rather than just 3. That was Jeremy’s suggestion.
I’ve stopped using the “Popular Posts” widget because of doubts about its accuracy and utility.
The “Recent Posts”, “Recent Comments”, and “Similar Posts” widgets now show 10 entries each rather than 5.
I’ve eliminated my “Blogroll” widget. I will continue posting “Blogs I Read” entries, which I believe are more useful than just links.

I think that’s about it. Also a heads up that I’ll be on vacation from Friday, November 28th to Friday, December 5th. So please don’t be shocked at the sudden absence of posts!

Uncategorized

Happy Birthday, Dear Turing Machine

Post author By Daniel Tunkelang
Post date November 21, 2008
2 Comments on Happy Birthday, Dear Turing Machine

Well, happy belated birthday. November 18th was the 71st anniversary of the publication of Turing’s seminal paper, “On Computable Numbers with an Application to the Entscheidungsproblem“.

As described on Wikipedia:

In mathematics, the Entscheidungsproblem (German for ‘decision problem‘) is a challenge posed by David Hilbert in 1928. The Entscheidungsproblem asks for an algorithm that will take as input a description of a formal language and a mathematical statement in the language and produce as output either “True” or “False” according to whether the statement is true or false. The algorithm need not justify its answer, nor provide a proof, so long as it is always correct. Such an algorithm would be able to decide, for example, whether statements such as Goldbach’s conjecture or the Riemann hypothesis are true, even though no proof or disproof of these statements is known. The Entscheidungsproblem has often been identified in particular with the decision problem for first-order logic (that is, the problem of algorithmically determining whether a first-order statement is universally valid).

In 1936 and 1937, Alonzo Church and Alan Turing, respectively, published independent papers showing that it is impossible to decide algorithmically whether statements in arithmetic are true or false, and thus a general solution to the Entscheidungsproblem is impossible. This result is now known as Church’s Theorem or the Church-Turing Theorem (not to be confused with the Church–Turing thesis).

Shortly after finishing my undergraduate studies at MIT, I had the privilege of attending a brunch with Gian-Carlo Rota and some of his colleagues who represented a who’s who of local combinatorists. We got to talking about the halting problem, and someone asked me if it wasn’t just a contrived example of dubious relevance to computer science as a whole. Even then, I felt strongly that the halting problem, or it’s equivalent formulation in terms of Gödel’s incompleteness theorems, was at the heart of theoretical computer science, along with the still unsolved P vs. NP question.

So, happy birthday, dear Turing machine. You’re aged marvelously.

(Via The Register)