Note: this post is cross-posted at BLOG@CACM.
The day started with a short session on temporal interaction. Topics included clustering social media documents (e.g., Flickr photos) based on their association with events, statistical tests for early identification of popular social media content, and analysis of answers sites (like Yahoo! Answers) as evolving two-sided economic markets.
The next session focused on advertising. Two papers focused on click prediction: one proposing an Bayesian inference model to better predict click-throughs in the tail of the ad distribution; the other presenting a framework for personalized click models. Another paper addressed the closely related problem of predicting ad relevance. The remaining papers discussed other aspects of search advertising: one on estimating the value per click for channels like Google AdSense, where ad inventory is supplied by a third party; the other proposing an algorithmic approach to automate online ad campaigns based onlanding page content.
The following session was on systems and efficiency, a popular topic given the immense data and traffic associated with web search. Two papers proposed approaches to help short-circuit ranking computations: one by optimizing the organizations of inverted index entries to consider both the static ranks of documents and the upper bounds of term scores for all terms contained in each document; the other using early-exit strategies to optimize ensemble-based machine learning algorithms. Another used machine learning to mine rules for de-duplicating web pages based on URL string patterns. Another focused on compression, showing that web content is at least an order of magnitude more compressible that what can be achieved by gzip. The last paper proposed a method to perform efficient distance queries on graph (i.e., web graphs or social graphs) by pre-computing a collection of node-centered subgraphs.
The last session of the conference discussed various topics in web mining. One presented a system for identifying distributed search bot attacks. Another proposed an image search method using a combination of entity information and visual similarity. The final paper showed that shallow text features can be used for low-cost detection of boilerplate text in web documents.
All in all, WSDM 2010 was an excellent conference, and I’m sad to not to have been able to attend more of it in person. I’m delighted to see an even mix of academic and industry representatives sharing ideas and working to make the web a better place for information access.