Note: this post is cross-posted at BLOG@CACM.
Today was the first day of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), held at the Polytechnic Institute of NYU in Brooklyn, NY. WSDM is a young conference that has already become a top-tier publication venue for research in these areas. In contrast to some of the larger conferences, WSDM is single-track and feels more intimate and coherent–even with over 200 attendees.
The day started with an ambitious keynote by Soumen Chakrabarti (IIT Bombay): “Bridging the Structured Un-Structured Gap”. He described a soup-to-nuts architecture to annotate web documents and perform complex reasoning on them using a structured query language. But perhaps this ambitious approach is a practical one: it uses the web we have–as opposed to waiting for the semantic web to emerge–and there is a prototype using half a billion documents.
The first paper session focused on web search. Of the five papers, two emphasized temporal aspects of content, one considered social media recommendation, and one focused on identifying concepts in multi-word queries. The last paper of the session proposed using anchor text as a more widely available input than query logs to support the query reformulation process. It also attracted the most audience attention–whileinteraction is often a niche at information retrieval conferences, it always elicits strong interest and opinions.
The following session focused on tags and recommendations. Some take-aways: users produce tags similar to the topics designed by experts; individual “personomies” can be translated into aggregated folksonomies; matrix factorization methods can produce interpretable recommendations.
The last session of the day covered information extraction. One of the papers used pattern-based information extraction approaches, demonstrating how far we’ve come since Marti Hearst‘s seminal work on the subject. Another offered a SQL-like system for typed-entity search, complete with a live, publicly accessible prototype. The final paper addressed an issue the came up repeatedly at the SSM workshop: the problem of distilling the truth from a collection of inconsistent sources.