The moment I learned that CIKM 2012 would be held in Maui, I knew I had to be there. Having co-organized the CIKM 2011 industry event, I had enough karma to be invited as part of this year’s industry event, representing LinkedIn.
PRE-CONFERENCE: PSEUNAMI WARNING AND A WORKSHOP
I arrived in Maui on Sunday, October 28th, fortunate to miss the “pseunami” warnings prompted by an earthquake off the Canadian coast. And even more fortunate to be thousands of miles away from Hurricane Sandy.
Monday, I attended the Workshop on Data-Driven User Behavioral Modeling and Mining from Social Media. The topics within this area were diverse: they included Pinterest users, resume-job matching (unfortunately without the benefit of LinkedIn data), and street harassment stories reported via Project Hollaback.
But ironically in this workshop — and throughout the conference — there was more use of Twitter data than of Twitter itself. Most of the tweets using the #cikm2012 hashtag were my own.
DAY 1: USER ENGAGEMENT, EVALUATION BIAS
Tuesday opened with a welcome that included statistics showing how far CIKM has come as a top-tier international conference. There were 1,088 submissions this year! But the highlight of the opening was program co-chair Guy Lebanon demoing software to “improve” paper reviews. It was hilarious, if a bit close to home: the automatically generated reviews looked a lot like those generated by allegedly human reviewers.
We then proceeded to a keynote by Yahoo! Research VP Ricardo Baeza-Yates entitled “User Engagement: The Network Effect Matters!” The title was a bit confusing: it wasn’t about the conventional “network effect“, but rather about user engagement across a network of sites like those owned by Yahoo. He talked about different ways to measure user engagement, and noted that off-site (or, rather, off-network) links ultimately improve users’ downstream engagement. He also observed that style attributes outperform content attributes as predictors of user engagement. Lots of fascinating observations, but I’m curious how well they generalize beyond Yahoo.
I spent the rest of the day making tough choices among the various parallel sessions, starting with the morning session on information retrieval evaluation. Some nuggets from that session: captions and other surface features introduce significant evaluation bias; assessors have poor agreement when evaluating relevance in eDiscovery contexts; and system evaluation improves when it models user differences.
After lunch, I attended a session on web search. Some themes from that session: neighborhood-based methods are effective, whether the neighborhoods are based on document or user similarity; entities and structure are increasingly important for web search. After the coffee break, I went to the social networks session. Topics there included social contagion, online question answering, and social network data anonymization. The talks wrapped up just in time for us to watch the daily cliff diving ceremony before heading to the poster session.
DAY 2: QUERY PERFORMANCE PREDICTION, ABANDONMENT
Wednesday opened with a keynote by CMU professor Wllliam Cohen on “Learning Similarity Measures based on Random Walks in Graphs“. He described the framework and techniques that he and his colleagues used to build NELL (“Never-Ending Language Learning”). The keynote was pretty dense, but there are lots of papers available on the NELL publications page.
Then back to choosing among parallel sessions. Although I was tempted by the recommender systems session featuring presentations by my LinkedIn colleagues Mitul Tiwari and Bee-Chung Chen, I instead attended the session on ads and products. Two take-aways from that session: ad targeting benefits from explicit identification of user interests; influence maximization can be modeled adversarially as a two-player game.
After lunch, I attended the session on formal retrieval models and learning to rank. I most enjoyed the two talks by Oren Kurland that focused on query performance prediction. In particular, he offered a comprehensive probabilistic prediction framework that unifies most of the previously proposed prediction methods using a common formal basis. The session also included a deep dive into aspects of the IBM Watson question-answering system.
After the coffee break, I headed to another session on web search — one of my favorite sessions of the conference. There was a talk on query segmentation, a topic responsible for my most popular blog post. Also a great talk on identifying good abandonment, a problem I’ve been interesting ever since hearing about it at SIGIR 2010. Another talk about learning from search logs: generalizing from click entropy to “click pattern entropy” to analyze query ambiguity. And a talk on modeling domain-dependent query reformulation as machine translation using a pseudo-parallel corpus. All in all, a great session packed with practical content.
Then came a purely social evening. The conference reception was a luau, complete with kalua pig, mai tais, hula, and of course poi. Certainly my most memorable conference banquet. I didn’t take pictures, but I recommend Craig Stanfill‘s photos on Flickr.
DAY 3: INDUSTRY EVENT
Thursday began with the last conference keynote: University of Kansas provost Jeffrey Vitter on “Compressed Data Structures with Relevance“. Like the previous keynotes, it was fairly dense, and I suggest you read the papers cited in his abstract if you’re interested in the technical details of how to search for query patterns in massive document collections.
Then came my main reason for attending the conference: the industry event. As seems to have become a pattern at information retrieval conferences, the industry event dominated the other parallel sessions, drawing a standing-room only crowd.
The event started with eBay VP of Research Eric Brill talking about “Having A Great Career in Research”. Unusual in a conference talk, he offered personal and practical advice to students on how to focus their passion and effort towards a happy and successful career. It reminded me of my blog post about dream, fit, and passion, and I hope students took it to heart.
IBM researcher David Carmel gave a talk entitled “Is This Entity Relevant to Your Needs?”. Noting that 71% of web search queries contain named entities (people, places, organizations), he advocated a probabilistic ranking approach to entity-oriented search that ranks retrieved entities according to amount and quality of supporting evidence.
Microsoft Technical Fellow (and former Yahoo! Fellow) Raghu Ramakrishnan talked about “The Future of Information Discovery and Search: Content Optimization, Interactivity, Semantics, and Social Networks”. He packed in a lot of nice material, most of which was from his tenure at Yahoo. He included a nice explanation of explore/exploit, which was also a reminder of how lucky we are at LinkedIn to have hired his former Yahoo colleague Deepak Agarwal.
After lunch, WalmartLabs Chief Scientist AnHai Doan gave a talk entitled “Social Media, Data Integration, and Human Computation”, in which he described constructing a “social genome” by mining social data, connecting it to web data, representing the combined information in a knowledge base. If you’re interested in more details, he’ll be giving an extended version of that talk at LinkedIn on November 29th!
Tencent Research Director Chao Liu talked about “Question Answering through Tencent Open Platform”. Beyond giving a great overview of one of the world’s largest internet platforms, he delivered great self-deprecating lines like “The name is ten cents, and the search engine is soso“.
I spoke next about LinkedIn‘s “Data By The People, For The People“. Given that the talk was right after Halloween and just before the presidential elections, I thought it appropriate to choose a title that would have appealed to one of America’s most distinguished presidents and vampire hunters. If you’re curious to learn more about data science and engineering at LinkedIn (including the publications I cited in my talk), check out http://data.linkedin.com/.
Groupon Director of Research Rajesh Parekh talked about “Leveraging Data to Power Local Commerce”. He focused on a key problem Groupon faces: determining and optimal category mix for each local market. He described how they approach this problem using portfolio theory.
After a coffee break, Adobe Chief Software Architect Tom Malloy talked about “Revolutionizing Digital Marketing with Big Data Analytics”. Apache Pig co-creator — and now Google researcher — Christopher Olston talked about work he did at Yahoo on “Programming and Debugging Large-Scale Data Processing Workflows”. Finally, Microsoft Distinguished Engineer Xuedong (“XD”) Huang gave a talk entitled “From HyperText to HyperTEC”, in which he woke up the audience by having us all participate in the “Bing it On” challenge.
All in all, CIKM 2012 was a great conference in an idyllic setting. Holding the conference in Maui might have been a bit distracting, but the desirability of the location also ensure a high-quality program.
My main complaint is that I don’t like parallel sessions — especially when the topics overlap significantly (e.g., web search sessions competed with those on ranking and recommendations). I’m also not convinced that talks have to be 25 minutes long. Perhaps the conference could more to a format of shorter talks and at least reduce the number of parallel sessions. It would also be great to see more opportunity for interaction — the coffee breaks always felt too short. For more of my thoughts on reforming academic conferences, see my 2009 blog post on the subject.
I also wish more attendees would embrace social media. It’s ironic that researchers who depend so heavily on social media data (especially Twitter) don’t engage in it personally. While I’m honored to have been the conference’s unofficial tweeter (see this visualization of the #cikm2012 tweets), I would have liked to see more attendees engage in a public online conversation. Hopefully others will at least blog about the conference.