This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.
Those of you who attended the SIGIR 2009 Industry Track had the opportunity to hear Yahoo researcher Vanja Josifovski make an eloquent case for ad retrieval as a new frontier of information retrieval. At the CIKM 2011 Industry Event, Vanja delivered an equally compelling presentation entitled “Toward Deep Understanding of User Behavior: A Biased View of a Practitioner“.
Vanja first offered a vision in which the web of the future will be your life partner, delivering life-long pervasive personalized experience. Everything will be personalized, and the experience will pervade your entire online experience — from your laptop to your web-enabled toaster.
He then brought us back to the state of personalization today. For search personalization, the low entropy of query intent makes it difficult — or too risky — to significantly outperform the baseline of non-personalized search. In his view, the action today is in content recommendation and ad targeting, where there is high entropy of intent and lots of room for improvement over today’s crude techniques.
How do we achieve these improvements? We need more data, larger scale, and better methods for reasoning about data. In particular, Vanja noted the data we have today — searches, page views, connections, messages, purchases — represents the user’s thin observable state. In contrast, we lack data about the user’s internal state, e.g., is the user jet-lagged or worried about government debt. Vanja said that the only way to get more data is to motivate users by creating value for them with it — i.e., social is give to get.
Of course, we can’t talk about user’s hidden data without thinking about privacy. Vanja asserts that privacy is not dead, but that it’s in hibernation. So far, he argued, we’ve managed with a model of industry self-governance with relatively minor impact from data leaks — specifically as compared to the offline world. But he is apprehensive at the prospect of a major privacy breach inducing legislation that sets back personalization efforts for decades.
Vanja then talked about current personalization methods, including learning relationships among features, dimensionality reduction, and smoothing using external data. He argues that many of the models are mathematically very similar to one another, and it is difficult to analyze the relative merits of the models as opposed to other implementation details of the systems that use them.
Finally, Vanja touched on scale issues. He noted that the MapReduce framework imposes significant restrictions on algorithms used for personalization, and that we need the right abstractions for modeling in parallel environments.
ps. My thanks to Jeff Dalton for live-blogging his notes.