Author: Daniel Tunkelang

High-Class Consultant.

SIGIR ’09 Registration Details

Post author By Daniel Tunkelang
Post date May 1, 2009
2 Comments on SIGIR ’09 Registration Details

You can now register for SIGIR 2009! Here are the details from the registration page:

Registration fees for ACM members are as follows:

$695 for the main three-day conference, including the conference banquet; $395 for students

$175 for half-day tutorials

$295 for full-day tutorials or two half-day tutorials

$150 for workshops

$250 for Wednesday’s Industry Track

Attendees who are not members of the ACM are charged higher registration fees.

Students have special discount conference/tutorial/workshop package options.

These early registration rates will end at midnight on May 24, at which point rates will rise, generally by $50. Normal registration rates will end at midnight on July 12, at which point rates will rise again, generally by $50.

I know that some have complained about the increase in price relative to previous years. Unfortunately, I suspect that one of the consequences of our current economic climate is that it’s much harder to subsidize conferences through sponsorship.

What I can say, however, is that the $250 for the one-day Industry Track is a steal–compare that to the fees for other industry conferences, and look at the speaker line-up!. I strongly recommend it for practitioners. Of course, if you can afford to attend the whole conference, even better. This year’s papers and posters particularly emphasize work from industry, and I’m already excited about learning what everyone has been up to!

General

Conferences, Conferences, Conferences!

Post author By Daniel Tunkelang
Post date May 1, 2009
1 Comment on Conferences, Conferences, Conferences!

Apologies for the lull in blogging this week, but it’s been a busy week in what looks to be a busy spring (and summer!) of conferences related to information access.

This week, I was in Boston, presenting at the Infonortics Search Engine Meeting and the International Association of Scientific, Technical & Medical Publishers Spring Conference.

The Search Engine Meeting was fun, if a bit cozier than in previous years (the recession is definitely taking a toll on travel budgets). The keynote by David Evans, entitled “E-Discovery: A Signature Challenge for Search“, made a phenomenal case for weaning researchers from web seach as the canonical domain for information retrieval, and instead setting our sights on more valuable problems that emphasize recall, require human-in-the-loop processing, and lack training data or established evaluation metrics. He didn’t call it HCIR, but he was certainly preaching it! You can find copies of most of the presentations here.

The STM conference was a unique experience for me, starting from the keynote by a lobbyist for stronger copyright law. Indeed, the first day of the conference was largely concerned with addressing two threats to STM publishers’ current business models: copyright infringement and open access. Not everyone at the conference saw open access as a threat, and that made for a healthy debate. The second day focused on the present and future of semantic technologies–somewhat more familiar territory for me. I particularly liked a presentation by Priya Parvatikar that explained the semantic web in clear, hype-free terms. In fact, I’m looking forward to re-using it when she or the conference organizers post it!

Meanwhile, there are more conferences coming up! The Enterprise Search Summit takes place May 12-13 at the Hilton New York. I’ll be presenting on a panel about “Emergent Social Search Experiences“. The conference isn’t cheap, but they are offering a great recession-busting special: a free “VIP Pass” that includes admission to the keynotes and showcase. I hope that means I’ll see more of you at the summit in a couple of weeks!

Still to come in June and July: the 5th Annual Text Analytics Summit, Endeca Discover ’09, SIGMOD 2009, and of course SIGIR ’09.

General

Google Shows Wolfram Who’s The Alpha Dog

Post author By Daniel Tunkelang
Post date April 28, 2009
10 Comments on Google Shows Wolfram Who’s The Alpha Dog

I actually feel bad for Stephen Wolfram. After all the weeks of hype leading up to his public demonstration of Wolfram Alpha at Harvard this afternoon, Google upstaged him by releasing Google Public Data today. Catty? Perhaps, but in a very classy way. They just blogged about it and released it. No private demos. no fanfare–they just shipped it.

More importantly, they say:

The data we’re including in this first launch represents just a small fraction of all the interesting public data available on the web. There are statistics for prices of cookies, CO2 emissions, asthma frequency, high school graduation rates, bakers’ salaries, number of wildfires, and the list goes on. Reliable information about these kinds of things exists thanks to the hard work of data collectors gathering countless survey forms, and of careful statisticians estimating meaningful indicators that make hidden patterns of the world visible to the eye. All the data we’ve used in this first launch are produced and published by the U.S. Bureau of Labor Statistics and the U.S. Census Bureau’s Population Division. They did the hard work! We just made the data a bit easier to find and use.

Clearly Google doesn’t put much stock in Wolfram Alpha’s proprietary collection of ten trillion curated facts. Moreover, Google is also using curated data–only that it’s data already freely available in the public domain. Perhaps Wolfram Alpha has collected broader data or built a more robust query parser, but now the onus will be on them to prove it–and to prove that the difference is meaningful to users. I can’t imagine that Wolfram is loving Google right now.

Ironically, one of the things that Google may have inadvertently proved is that this kind of question answering isn’t really that valuable to users. The queries I’ve seen posted or tried myself are a novelty, but at best they are a minimal time saver–a modest improvement on Google Calculator. As I pointed out in an earlier post about Wolfram Alpha, I think the NLP interface is wrong-headed, and that they–or anyone else trying to create more value from objective data–should be focusing on APIs to make it easier to integrate into other applications. But they don’t seem to be headed in that direction.

In any case, Google certainly wins this round. And the blogosphere is loving it.

General

Who Wants To Play “Jeopardy”?

Post author By Daniel Tunkelang
Post date April 27, 2009
4 Comments on Who Wants To Play “Jeopardy”?

That would be IBM Research, for millions of dollars (I suspect). I’ve known about the Jeopardy project for a while from colleagues at IBM, and I’m glad I can finally talk about it publicly, now that it’s been reported in the New York Times.

It’s a great challenge, and I hope IBM can rally around it the way it did for chess. But I’d love to see information retrieval researchers consider a related problem–namely, looking at the results for a query and trying to reverse engineer the query from that set (i.e., without cheating and looking at the query). In order words, I want search engines to do what we as humans do naturally. When I’m not sure I understand you, I repeat back what I think you said, in words I’m sure I understand and that I believe you’ll understand too. It’s a great way to clarify misunderstandings and to make sure we end up on the same page.

This clarification dialogue is a key part of the HCIR vision: establishing shared understanding between the user and the system. And it bears a striking resemblance to the game of Jeopardy. When a user receives results in response to a query, those results should feel like an easy Jeopardy “answer”, for which the “question” jumps out as being compatible with the user’s information need. If that is not the case, then something has broken down in the communication, and the system should work with the user to resolve the breakdown.

I realize that HCIR isn’t quite as sexy as question answering (or is this answer questioning?) and certainly doesn’t have its own household-name game show. Then again, I never imagined that prospect theory and the prisoner’s dilemma would get their own game shows. A researcher can hope!

Uncategorized

Presenting at Infonortics Search Engine Meeting

Post author By Daniel Tunkelang
Post date April 24, 2009
6 Comments on Presenting at Infonortics Search Engine Meeting

If you’re attending the Infonortics Search Engine Meeting in Boston next week, please let me know! I’ll be there all day Monday and Tuesday, and I’ll be talking on Tuesday afternoon about “Enabling the Information Seeking Process”.

I’m sticking around in Boston to attend the International Association of Scientific, Technical & Medical Publishers Spring Conference, where I’ll be presenting “Exploring Semantic Means” on Thursday morning. My presentation there will similar to the one I delivered at the New York Semantic Web Meetup, but I’m also hoping to sneak in a live demo!

Hopefully I’ll see some of you there! I also apologize in advance if my blogging is a bit thin over the next several days. I’ll post reactions to the two conferences as soon as I have the time to gather them.

General

Book Writing vs. Blogging

When I announced that I’d be writing a book, I promised that I would blog about the experience. It may seem odd that I’m only blogging about it now, when I’m almost done, but perhaps that gives you a sense of how absorbing the book-writing process has been for me. In any case, I’m not sure if or when I’ll write another book, so I wanted to take a moment to jot down my thoughts about the experience.

Blogging, at least for me, is about public conversation. It’s an asymmetric conversation, for sure: I’m the blogger (at least on my blog!), so I get to go first. But my best posts–and, in my opinion, the best blog posts in general–are those where a comment stream quickly takes over, making the post more of a conversation starter than a monologue. Perhaps the best example of this on my blog is “Looking for a Devil’s Advocate“, which inspired over sixty comments in a conversation that lasted for three weeks.

A major factor in this dynamic is immediacy. A reflective blog post might take me an hour to write (fortunately most don’t take quite that long!), but that is still close to instant gratification, particularly when I’m blogging about a timely topic. Comments may only take seconds to write (Jeremy’s being a notable exception), and I don’t moderate comments–precisely because I want to preserve the momentum of real-time conversation.

Writing a book, needless to say, is quite different. I spent over two months putting together these hundred pages about faceted search. While faceted search is a current industry topic (albeit not as topical as Microsoft’s latest earnings report), I found myself drawing on materials from Aristotle, Linnaeus, and Ranganathan–hardly the usual fare for blogging. Even my more recent material includes research in the 1990s that barely has a web presence, let alone a presence in the blogosphere. It’s odd to think of such recent work as history, and yet it felt that way to me.

Perhaps that’s because writing a book is neither immediate nor conversational. It is a lonely endeavor, even for just a couple of months–though that might just reflect that I am pathologically extrovert. I had to resist the temptation to write the book as a wiki, encouraging the world to make edits and suggestions throughout the writing process (I’m sure my publisher would shudder to know that I did entertain this possibility). Only now in the final stages of writing have I been receiving feedback–and it is a refreshing change! Indeed, I’m moved by the number of people have stepped up to volunteer their time and effort–including people whom I have never met face-to-face. To anyone who questions whether an online social network is real, I stand as a case study in benefiting from the reality.

In short, it’s been a great learning experience, and I hope the artifact proves valuable to a broad audience. Still, it’s nice to be able to spend more time blogging again. I am a sucker for instant gratification.

Uncategorized

Slight Change to the HCIR ’09 CFP

Post author By Daniel Tunkelang
Post date April 23, 2009

I hope you all are gearing up for HCIR 2009! Those who have not yet read the call for participation or looked at the web site can safely ignore this message, which announces what we hope is a minor change for participants.

After receiving feedback to the CFP, we (Bill, Ryen, and I) decided to request more substantive position papers (up to 4 pages), and to select four to six of these for presentation in a workshop panel. We will still have a morning “poster boaster” session for all other participants, and we strongly encourage all attendees (including those on the panel) to present posters.

I hope the change does not cause any confusion or inconvenience–in any case, the submission date is still four months away! We made the change after being convinced that having a few peer-reviewed presentations would help continue the tradition that people bring their best work to the workshop, as they have done in previous years. But I do want to reinforce that, as we announced in the CFP, we will adjust the schedule to allow for more interaction time relative to previous years. We do recognize that the discussions are the best part of the workshop.

In any case, if you have any questions or concerns, please let me know! This is still a young workshop, and we’d like to make sure we are listening to the community we serve.

General

Too Connected, Or Not Connected Enough?

Post author By Daniel Tunkelang
Post date April 23, 2009
6 Comments on Too Connected, Or Not Connected Enough?

A bit off my usual selection of topics, but an article by Bruce Perens about a cyber-attack on Morgan Hill, a small city in northern California caught my attention:

Just after midnight on Thursday, April 9, unidentified attackers climbed down four manholes serving the Northern California city of Morgan Hill and cut eight fiber cables in what appears to have been an organized attack on the electronic infrastructure of an American city. Its implications, though startling, have gone almost un-reported.

That attack demonstrated a severe fault in American infrastructure: its centralization. The city of Morgan Hill and parts of three counties lost 911 service, cellular mobile telephone communications, land-line telephone, DSL internet and private networks, central station fire and burglar alarms, ATMs, credit card terminals, and monitoring of critical utilities. In addition, resources that should not have failed, like the local hospital’s internal computer network, proved to be dependent on external resources, leaving the hospital with a “paper system” for the day.

Read the full article for details. What struck me was the following question: is the vulnerability a sign of our being too connected, or not connected enough?

Perens notes how the attack demonstrated unnecessary dependence on connectivity, e.g., in the hospital’s internal network. But in an era of cloud computing, such dependencies on external services are becoming more common. It’s certainly easy to read a lesson in this experience that our systems should perform better in disconnected mode.

But the other lesson may be that it was too easy to disconnect the city. Should cutting eight cables be enough to disconnect over 50,000 people (not just in Morgan Hill, but also in nearby counties)? Should we instead be trying to achieve the fault tolerance of a mesh network? I’m no networking expert, so I don’t know whether, aside from the fixed costs associated with overhauling network infrastructure, mesh networking is efficient enough to replace our current architecture.

In any case, it was a sobering article. I’d like to believe it would be much harder to perpretrate a similar attack on my somewhat larger home town. But, more importantly, I’d like to think we are building a more reliable network of dependencies that exploits the extensive research on the subject.

General

SIGIR ’09 Accepted Papers

Thanks to Jeff Dalton for alerting me to SIGIR 2009 announcing the lists of accepted papers and posters. As Jon Elsas points out, the authorship looks quite different this year than from previous years, with industry showing an especially strong presence:

38% of the papers have at least one author from Microsoft (21 papers), Yahoo! (7 papers), or Google (3 papers)
No papers from current UMass researchers (though a number from alumni, and decent representation in the posters)–and the only CMU papers accepted were based on work done during internships.

I’m not sure how to interpret this sudden change. Tighter university budgets? More openness on the part of industry? Regardless, I am excited about the papers. Here are a few (well, ten) paper titles that caught my eye:

A Comparison of Query and Term Suggestion Features for Interactive Searching
A Statistical Comparison of Tag and Query Logs
Building Enriched Document Representations using Aggregated Anchor Text
Dynamicity vs. Effectiveness: Studying Online Clustering for Scatter/Gather
Effective Query Expansion for Federated Search
Enhancing Cluster Labeling Using Wikipedia
Formulating Effective Queries: An Empirical Study on Effectiveness and Effort
Telling Experts from Spammers: Expertise Ranking in Folksonomies
When More Is Less: The Paradox of Choice in Search Engine Use
Where to Stop Reading a Ranked List? Threshold Optimization using Truncated Score Distributions

The posters look great too! I’m especially curious about these ten:

A Case for Improved Evaluation of Query Difficulty Prediction
A Relevance Model Based Filter for removing Bad Ads
An Evaluation of Entity and Frequency Based Query Completion Methods
Analysing query diversity
Cluster-based query expansion
Evaluating Web Search Using Task Completion Time
Has Adhoc Retrieval Improved Since 1994?
Is This Urgent? Exploring Time-Sensitive Information Needs in Community Question Answering
Relevance Criteria for E-Commerce: A Crowdsourcing-based Experimental Analysis
When is Query Performance Prediction Effective?

And, of course, I’m gearing up for the Industry Track. More details will be posted soon–of course, you’ll be the first to know.

General

Google Profiles: Nice Idea, Meh Execution

Post author By Daniel Tunkelang
Post date April 22, 2009
1 Comment on Google Profiles: Nice Idea, Meh Execution

The top story on Techmeme is the new push Google is making around profiles. You can get a sense of the reactions from links to the official Google post. Here is a sampling:

My reaction, like the last one, is that this isn’t a big deal–at least not yet. Google profiles do have one really nice feature: verified names. At least in theory, that could give them an edge over other services, which do get their share of fakesters. And of course Google is in a position to give their own profiles more prominent exposure than anyone else’s–though they haven’t taken this tactic with Knol (remember Knol?) and I don’t think they will unless the profiles get a lot better.

As I’ve said before, Google doesn’t seem to have the knack for community. The profiles are spartan: they don’t have the professional information that makes LinkedIn so useful, or the personality that attracts people to Facebook. They don’t tie in to conversation like Twitter.

Google could, of course, improve its game. But it doesn’t have a great track record in this department. And they may be timid because success will bring them charges of monopolistic abuse. I am curious to see where they will go with the profiles, but not expecting much.