- “Query Quality: User Ratings and System Predictions” by Claudia Hauff, Franciska de Jong, Diane Kelly, and Leif Azzopardi offered the startling (to me at least) result that human prediction of query difficulty did not correlate (or at best correlated weakly) to post-retrieval query performance prediction (QPP) measures like query clarity. I talked with Diane about it, and I wonder how strongly the human prediction, which was pre-retrieval, would correlate to human assessments of the results. I also don’t know how well the QPP measures she used apply to web search contexts.
- Which leads me to the next poster I saw, “Predicting Query Performance on the Web” by Niranjan Balasubramanian, Giridhar Kumaran, and Vitor Carvalho. They offered what I saw as a much more encouraging result–namely that QPP is highly reliable when it returns low scores. In other words, a search engine may wrongly believe that it did well on a query, but it is almost certainly right when it thinks it failed. This certainty on the negative side is exactly the opening that HCIR advocates need to offer richer interaction for queries a conventional ranking approach recognizes its own failure. While some of the specifics of the authors’ approach are proprietary (they perform regression on features used by Bing), the approach seems broadly applicable.
- Next I saw “Hashtag Retrieval in a Microblogging Environment” by Miles Efron. He provided evidence that hashtags could be an effective foundation for query expansion of Twitter search queries, using a language model approach. The approach may generalize beyond hashtags, but hashtags do have the advantage of being highly topical and relatively unambiguous by convention.
- “The Power of Naive Query Segmentation” by Matthias Hagen, Martin Potthast, Benno Stein, and Christof Brautigam suggested a simple approach for segmenting long queries into quoted phrases: consider all segmentations and, for a given segmentation, compute a weighted sum of the Google ngram counts for each quoted phases, the weight of a phrase of length s being s^s. I don’t find the weighting particularly intuitive, but the accuracy numbers they present look quite nice relative to more sophisticated approaches.
- “Investigating the Suboptimality and Instability of Pseudo-Relevance Feedback” by Raghavendra Udupa and Abhijit Bhole showed that an oracle with knowledge of a few high-scoring non-relevant documents could vastly improve the performance of pseudo-relevance feedback. While this information does not lead directly to any applications, it does suggest that obtaining a very small amount of feedback from the user might go a long way. I’m curious how much is possible from even a single negative-feedback input.
- “Short Text Classification in Twitter to Improve Information Filtering” by Bharath Sriram, David Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas challenged the conventional wisdom that tweets are too short for traditional classification methods. They achieved nice results, but on the relatively simple problem of classifying tweets as news, events, opinions, deals, and private messages. I was offered promises of future work, but I think the more general classification problem is much harder.
- “Metrics for Assessing Sets of Subtopics” by Filip Radlinski, Martin Szummer, and Nick Craswell proposed an evaluation framework for result diversity based on coherence, distinctness, plausibility, and completeness. I suggested that this framework would apply nicely to faceted search interfaces, and that I’d love to see it demonstrated on production systems–especially since I think that might be easier to achieve than convincing the SIGIR community to embrace it.
- Which leads me nicely to the last poster I saw, “Machine Learned Ranking of Entity Facets” by Roelof van Zwol, Lluis Garcia Pueyo, Mridul Muralidharan, and Borkur Sigurbjornsson. They found that they could accurately predict click-through rates on named entity facets (people, places) by learning from click logs. It’s worth noting that their entity facets are extremely clean, since they are derived from sources like Wikipedia, IMDB, GeoPlanet, and Freebase. It’s not clear to me how well their approach would work for noisier facets extracted from open-domain data.
As I said, there were over a hundred posters, and I’d meant to see far more of them. Hopefully other people will blog about some of them! Or perhaps tweet about them at #sigir2010.