Back to our regularly scheduled blogging about the SIGIR 2009 Industry Track. For those who haven’t been reading along, we covered the first three talks:
- Matt Cutts (Google): Web Spam and Adversarial IR: The Road Ahead
- danah boyd (Microsoft Research): The Searchable Nature of Acts in Networked Publics
- Vanja Josifovski (Yahoo! Research): Ad Retrieval – A New Frontier of Information Retrieval
As you can see, that covered the three major web search engine companies, at least at the time of the conference. Sure, danah’s talk wasn’t exactly what people might have expected, but I had some creative license as an organizer, and the audience loved her talk. Besides, as we’ll get to in the next post, Microsoft had other opportunities to present representatives of its more conventional information retrieval divisions.
The next speaker, according to the original plan, was to be Tom Tague, who leads the Open Calais project at Thomson Reuters. Unfortunately, a week before SIGIR, I found out that he would be unable to make it. One of his colleagues offered to present in his stead less than 24 hours after his cancellation, but by then I’d already found a replacement on my own: Evan Sandhaus, Semantic Technologist in the New York Times Research and Development Labs, agreed to talk about “Corpus Linguistics and Semantic Technology at the New York Times”.
You can get a good idea of his talk from these slides–or, better, yet, from this video. Both are from the closing keynote that he and his colleague Rob Larson delivered at the 2009 Semantic Technology Conference.
I won’t try to recapture Evan’s fascinating narrative about the history of information storage and retrieval at the New York Times. Rather, I’ll skip to the parts that should matter most to information retrieval researchers and practitioners: the availability of the New York Times Annotated Corpus through the Linguistic Data Consortium (LDC), and the New York Times’s intention to contribute to the Linked Data Cloud.
For me personally, the annotated corpus is the bigger deal. It represents 1.8 million articles written over 20 years. It is annotated both with manually-supplied summaries and tags–the latter drawn from a controlled vocabulary of people, organizations, locations and topic descriptors–and with algorithmically-supplied tags that are manually verified. My colleagues and I at Endeca have been working with the annotated corpus, and it is a delight. I hope that the information retreival community will make heavy use of this wonderful new resource.