Author: Daniel Tunkelang

High-Class Consultant.

SIGIR 2009: Day 2, Morning Sessions (Anchor Text, Vertical Search)

Post author By Daniel Tunkelang
Post date July 25, 2009
1 Comment on SIGIR 2009: Day 2, Morning Sessions (Anchor Text, Vertical Search)

Sorry for the delay in postings. Not only was I super-busy the past week, but I had some connectivity challenges (both at SIGIR and at the apartment where I was staying) and mostly restricted my online activity to occasional tweets during talks. I meant to catch up on my blogging yesterday, but instead spent the day wine tasting in Long Island. But enough apologizing, I’m refreshed and ready to blog up a storm!

The second day of SIGIR (Tuesday) started straight off with research talks. I went to the web retrieval session, which consisted of two talks about anchor text and one about privacy-preserving link analysis.

“Building Enriched Document Representations using Aggregated Anchor Text“, by Don Metzler and colleagues at Yahoo Labs. They address the challenge of anchor text sparsity (the distribution of in-links for web pages follows a power law) by enriching document representation through aggregation of anchor text along the web graph. Their technique is intuitive, and the authors demonstrate statistically significant improvements in retrieval effectiveness. Unfortunately, their results are not repeatable, since used a proprietary test collection to obtain them.

The second talk of the session, “Using Anchor Texts with Their Hyperlink Structure for Web Search“, was by a group of authors from Microsoft Research Asia. They address the opposite problem of the previous paper: how to handle too much, rather than too little, anchor text. Specifically, they model dependence among multiple anchor texts associated with the same target document. Like the Yahoo folks, they demonstrate statistically significant results on a proprietary test collection.

The third talk, “Link Analysis for Private Weighted Graphs” (ACM DL subscribers only) by Jun Sakuma (University of Tsukuba) and Shigenobu Kobayashi (Tokyo Institute of Technology), was a bit of an outlier, if one can call a paper in a three-paper session an outlier. The authors offer privacy-preserving expansions of PageRank and HITS, the best-known link analysis methods associated with relevance and authority in web search. I’ve noticed an increasing number of papers like these that mix cryptography with information retrieval or database concerns. One of my frustrations in reading such papers is that I always suspect that people are re-inventing wheels because so few people are able to keep up with research in multiple disciplines.

Then I had the coffee break to solve my own research problem: how to fill the 11:30 slot in the Wednesday Industry Track, since a speaker called in sick that morning. When I walked by the Bing table, I saw Jan Pedersen (Chief Scientist for Core Search at Microsoft), and I begged him to help me out. I must have been a persuasive supplicant, because he procured me Nick Craswell, an applied researcher who works on Bing. Out of gratitude for this 11th-hour favor, I wore a Bing t-shirt all day yesterday as I went wine-tasting. Bing drinking, not binge drinking!

Anyway, that urgent problem resolved, I went back to enjoying the conference. For the second morning session, I went to the vertical search session.

As it turns out, that session kicked off the with SIGIR Best Paper winner: “Sources of Evidence for Vertical Selection” by Jaime Arguello (CMU), Fernando Diaz (Yahoo), Jamie Callan (CMU), and Jean-François Crespo (Yahoo). The authors do a lot of things I like: they apply query clarity as a performance predictor, and they bootstrap on an external collection (specifically Wikipedia). The test collection they use for evaluation is proprietary, but that seems to be the price (at least today) of doing this kind of work.

The second talk of the session was by a subset of the previous paper’s authors: “Adaptation of Offline Vertical Selection Predictions in the Presence of User Feedback” by Fernando Diaz and Jaime Arguello. The authors creatively used simulation to evalaute their approach. They did a nice job, but I have to admit I’m skeptical of results about feedback that aren’t based on user studies.

Unfortunately, I missed the third talk of the session because I had to play organizer. But I must have earned some good karma, because I got to enjoy a delightful lunch with Marti Hearst and David Grossman.

Stay tuned for more posts about the interactive search session, the keynote by Albert-László Barabási, the banquet at the JFK Presidential Library and Museum, and of course the Industry Track.

General

SIGIR 2009: Day 1

SIGIR ’09 is in full swing!

I arrived on Sunday evening, and the reception was like Cheers (“where everyone knows your name“)–only that, at least in my case, I was meeting many people face-to-face for the first time in years, and in some cases for the first time, period! I reconnected with some of the SIGIR regulars whom I’d missed last year (Singapore was a bit far for me), finally met my editor, Diane Cerra from Morgan & Claypool, and even ran into someone who is evaluating my company’s technology. And that was just Day 0.

Day 1 started bright and early with the 7:00 am newcomer’s breakfast, which brings together newcomers and “old hands”. I believe my role as organizer qualified my as an “old hand”, even though this is only my third SIGIR. Which might explain why Justin Zobel, a real old hand (and one of this year’s program chairs) joined my table. Of course, he hadn’t read my post about his recent SIGIR Forum essay, so we chatted a bit about recall. Not surprisingly, we mostly agreed, and I have to give credit to the essay for provoking that and other good discussions today.

Then the conference started in earnest, with Liz Liddy bestowing the Salton Award to Susan Dumais. In the tradition of the award, Sue delivered a keynote recounting her personal journey through the space of information retrieval. I was thrilled that her recognition called out her working to bring together information retrieval and human-computer interaction. Of course some of us were ahead of the curve by recruiting her as the keynote for HCIR ’08. 🙂 Of course, I asked her a question about why transparency, which she called out as a reason that users in her Stuff I’ve Seen work preferred to explicitly sort results by date rather than accept the systems best-first relevance ranking, was so absent in web search. Her answer was interesting: she feels that transparency is most useful for re-finding, and least useful for discovery. I’m not sure I agree with that explanation, but I’ll at least think about it a bit before I commit to disagreeing with it.

Some coffee, and then off to the first session of research papers. The presentation that stood out for me in this session was “Refined Experts“, presented by Paul Bennett. The paper offers a nice technique for improve on hierarchical classification (by addressing the problems of error propagation through the hierarchy and the inherent non-linearity of hierarchies), and Paul is an outstanding presenter.

Then Diane Cerra and Gary Marchionini took the Morgan & Claypool authors (and a few authors-to-be) to lunch at Brasserie Jo. Great food, and even better company. My only regret is that I missed one of the talks in the first session after lunch, “A Statistical Comparison of Tag and Query Logs“. I did like David Carmel‘s talk on “Enhancing Cluster Labeling Using Wikipedia” in that same session, though I’ll need to do some homework to figure out what distinguishes it from other work in this area, such as an ICDE 2008 paper by Wisam Dakka and Panos Ipeirotis on “Automatic Extraction of Useful Facet Hierarchies from Text Databases“.

In the following session, I attended a couple of the efficiency talks. The talks were well presented, but in both cases I wondered if they were addressing the right problems. I’ve felt this way before at SIGIR efficiency talks, so perhaps my tastes are just idiosyncratic.

Then came the poster / demo reception. Even with three hours, there was far too much to take in–and of course that session is as much about networking as it is about the posters and demos. I enjoyed the three hours, but I’ll have to go back to the proceedings to learn more about what I saw–and what I missed.

Finally, I wrapped up by leading a crew to Tapeo for dinner–apparently a popular choice for attendees, since another table of 6 arrived shortly afterward. It was a nice cap to a fantastic but exhausting day.

I can’t promise I’ll keep this up daily, but I will blog about the rest of the conference when I have the chance. Meanwhile, here are some other folks blogging about SIGIR ’09:

David Karger is live blogging on the Haystack blog.
Jon Elsas is blogging and posting some pictures.
Jeff Dalton posted detailed notes about Sue’s keynote.
Mary McKenna is blogging at the SemanticHacker blog.
John Battelle and Gene Golovchinsky couldn’t attend, but have both blogged about the conference.

And follow on Twitter. The preferred hashtag is #sigir09, but I follow sigir OR sigir09 to be safe.

Uncategorized

Heading to SIGIR

Post author By Daniel Tunkelang
Post date July 19, 2009

Hope to see lots of you at SIGIR! Sounds like there are already great tutorials underway. I’ll get there tonight for the reception, where they will announce the triennial Gerard Salton Award winner (who will deliver tomorrow’s opening keynote). I’m looking forward to the paper, poster, and demo presentations, and of course to the Industry Track on Wednesday. Unfortunately, I have to return to my day job on Thursday, so I won’t be able to attend any of the workshops.

If you’re attending, I hope you’ll find me and say hi–after over a year of blogging, there are far too many people I’ve gotten to know but never met face to face! If you’re not attending, then I encourage you to follow the coverage on Twitter. Since there seems to be some confusion about which hashtag to use, I suggest you follow sigir OR sigir09 OR sigir2009 (yes, there is sometimes value to favoring recall). I promise to blog about it when I get back, but I hope you’ll forgive me if The Noisy Channel is a bit quiet over the next few days.

General

In Defense Of Recall

It’s not everyday that you see an essay in SIGIR Forum entitled “Against Recall”. Well, to be fair, the full title is “Against Recall: Is it Persistence, Cardinality, Density, Coverage, or Totality?” In it Justin Zobel, Alistair Moffat, and Laurence Park, all researchers at the University of Melbourne, conclude that “the use of recall as a measure of the effectiveness of ranked querying is indefensible.”

It’s a well-written and well-argued essay, and I think the authors at least have it half-right. I agree with their claim that, while precision dominates quantitative analysis of search effectiveness in the research literature, the expressed concerns about recall tend to be more qualitative. Part of the problem, as they note, is that recall is much harder to evaluate than precision (assuming the Cranfield perspective that the relevance of a document to a query is objective).

The authors propose a variety of alternate measures that, in their view, are more useful than recall and are actually what authors really mean when they allude to recall. The most interesting of these, in my view, is what they call “totality”. Indeed, I thought the authors were addressing me personally when they wrote:

It is usual for certain “high recall applications” to be cited to rebut suggestions that recall is of little importance. Examples that are routinely given include searching for precedents in legal cases; searching for medical research papers with results that relate to a particular question arising in clinical practice; and searching to recover a set of previously observed documents.

Yup, I’m listening. They continue:

While we agree that these are plausible search tasks, we dispute that they are ones in which recall provides an appropriate measurement scale. We argue that what distinguishes these scenarios is that the retrieval requirement is binary: the user seeks total recall, and will be (quite probably) equally dissatisfied by any approach that leaves any documents unfound. In such a situation, obtaining (say) 90% recall rather than a mere 80% recall is of no comfort to the searcher, since it is the unretrieved documents that are of greatest concern to them, rather than the retrieved ones.

Whoa there, that’s quite a leap! Like total precision, total recall is certainly an aspiration (and a great Arnie flick), but not a requirement. There are lots of information retreival applications where false negatives matter a lot to us a lot more than false positives–notably in medicine, intelligence, and law. But often what is binary for us is not whether we find all of the “relevant” documents for each individual query–and here I use the scare quotes to assert the subjectivity and malleability of relevance–but rather whether or not we ultimately resolve our overall information need.

Let me use a concrete example from my own personal experience. When my wife was pregnant, she had gestational diabetes. She treated it through diet, and up through week 36 or so things were fine (modulo the trauma of a Halloween without candy). And then one of her doctors made an off-hand allusion to the risk of shoulder dystocia. She came home and told me this, and of course we spent the next several hours online trying to learn more. We had a very specific question: should we opt for a Cesarean section?

I can tell you that no search engine I used was particuarly helpful in making this decision. I was hoping there might be analysis out there comparing the risks of shoulder dystocia with the risks associated with a Cesarean, particularly for women who have gestational diabetes. I couldn’t find any. But worse, I had no idea if there was helpful information out there, and I had no idea when to stop looking. Ultimately we took our chances, and everything turned out great–no shoulder dystocia, no Cesarean, and a beautiful, healthy baby and mother. But it would have been nice to feel that our decision was informed, rather than a nerve-wracking coin toss.

Let’s abstract from this concrete example and consider what I characterize as the information availability problem, where the information seeker faces uncertainty as to whether the information of interest is available at all. The natural evaluation measures associated with information availability are the correctness of the outcome (does the user correctly conclude whether the information of interest is available?); efficiency, i.e., the user’s time or labor expenditure; and the user’s confidence in the outcome.

It’s worth noting that recall is not on the list. But neither is precision. We’re trying to measure the effectiveness of information seeking at a task level, not a query level. But it’s pretty easy to see how precision and recall fit into this scenario. Precision at a query level helps most with improving effeciency at a task level, while recall helps improve correctness of outcome. Finally, perceived recall should help inspire user confidence in the outcome.

To circle back to the essay, I said that the authors were at least half right. They criticize the usefulness of recall for measuring ranked retrieval, and I think they have a point there–ranked retrieval inherently is more about precision than recall. Recall is much more useful as a set retrieval measure. The authors also note that “the idea that a single search will be used to find all relevant documents is simplistic.”

Indeed, I’d go beyond the authors and assert, straight from the HCIR gospel, that the idea that a single search will be used to fully address an information seeking problem is simplistic. But that assumption is the rut where most information retrieval research is stuck. The authors make legitimate points about the problems of recall as a measure, but I think they are missing the big picture. They do cite Tefko Saracevic; perhaps they should look more at his communication-based framework for thinking about relevance in the information seeking process.

General

LinkedIn Rolling Out Faceted Search!

Post author By Daniel Tunkelang
Post date July 15, 2009
4 Comments on LinkedIn Rolling Out Faceted Search!

I’m glad I have a Twitter alert for “faceted search”, since it alerted me (via @getzsch) to a post in TechCrunch announcing that LinkedIn now has a People Search beta that offers faceted search. I can disclose now that I known about this project for a while–they’d reached out to me after I offered a lukewarm review of their search–but I was asked to be discreet about that knowledge.

In any case, I wish I’d known about the beta launch earlier today, when I was looking for Boston-area colleagues to help me publicize the SIGIR Industry Track! The current interface is much more supportive of exploration.

It’s a nice implementation. The interface lets you refine the text search results by location, relationship (1st degree, 2nd degree, group, and other), industry, current / past company, and school. For a facet with a large number of values, like company, the interface only displays the top 10 values, and then lets you use type-ahead to refine by other companies. Unfortunately, the type-ahead was a bit buggy for me–but hey, it is a beta.

The application is fairly responsive, even for my search for “software”, which returns 2.4M results, 120K of which are 2nd-degree connections. Other than at Endeca, I haven’t seen anyone else mix faceted search with social networks, and LinkedIn has done a nice job of it.

So, if anyone from LinkedIn is reading this, congratulations and welcome to the wonderful world of faceted search. Count me a delighted customer. I hope my enthusiasm today makes up for my past criticism.

General

SIGIR: Meet the Who’s Who of Search and Information Retrieval

Post author By Daniel Tunkelang
Post date July 15, 2009
2 Comments on SIGIR: Meet the Who’s Who of Search and Information Retrieval

Matt Cutts. danah boyd. Bruce Croft. Marti Hearst. What do these people have in common? If you’re thinking that they are some of the biggest names in the research and practice of search and information retrieval, then you have at least part of the answer. For full credit, the answer is that they are some of the people who will be presenting during the SIGIR Industry Track on Wednesday, July 22nd at the Sheraton Boston Hotel.

There have been some changes from the original program. As noted above, Bruce Croft and Marti Hearst are now participating. They will offer research counterpoints to the panels of industry practitioners (vendors and analysts). Autonomy bowed out of the vendor panel; instead, we’re including Raul Valdes-Perez, the executive chairman (and founder) of Vivisimo. We also received regrets from the Open Calais folks at Thomson Reuters; instead, we’ll hear from Evan Sandhaus, semantic technologist at the New York Times.

Here is the full list of participants, in order of appearance:

Matt Cutts, Head of Webspam Team, Google
danah boyd, Social Media Researcher, Microsoft
Vanja Josifovski, Principal Research Scientist, Yahoo!
Evan Sandhaus, Semantic Technologist, New York Times
Tip House, Chief Architect, OCLC
Whit Andrews, VP and Distinguished Analyst, Gartner
Susan Feldman, VP of Search and Discovery Technologies, IDC
Theresa Regli, Principal, CMS Watch
Marti Hearst, Professor, University of California, Berkeley
Jeff Fried, Senior Product Manager, Microsoft (FAST)
Raul Valdes-Perez, Executive Chairman, Vivisimo
Adam Ferrari, Chief Technology Officer, Endeca
Bruce Croft, Distinguished Professor, University of Massachusetts
Elizabeth Liddy, Dean and Trustee Professor, Syracuse University

If you’re registered for the full SIGIR conference, then you are entitled to attend any or all of the Industry Track at no additional charge.

Otherwise, there’s also a one-day registration option for people only interested in attending the Industry Track. The cost for that one-day option is $350 (half of that for students). I should be able to get you the early registration rate of $300 if you contact me as soon as possible, and I can try to negotiate a group rate if a company want to send at least four people.

If you live in the Boston area and are interested in search and information retrieval (or you know people who are), this is an incredible opportunity to see the worlds of research and practice come together. I’m excited to be organizing it, and looking forward to attending it. If you have not yet registered but are interested in attending, please let me know ASAP, and I’ll see what I can do.

General

Design For Interaction: My SIGMOD Slides

Post author By Daniel Tunkelang
Post date July 12, 2009
1 Comment on Design For Interaction: My SIGMOD Slides

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=designforinteraction-090712215609-phpapp02&stripped_title=design-for-interaction

These are the slides I presented at SIGMOD a couple of weeks ago. The animation on a few of the slides doesn’t come through on SlideShare, but you can always download the PowerPoint if you are so inclined. The other talks in the invited session on Human-Computer Interaction with Information, by Ed Chi and Jeff Heer, were fantastic, as was the joint keynote on visualization by Fernanda Viégas and Martin Wattenberg.

Uncategorized

Faceted Search Book Is Shipping

Post author By Daniel Tunkelang
Post date July 10, 2009
1 Comment on Faceted Search Book Is Shipping

Amazon and Barnes & Noble are both shipping the faceted search book, so hopefully all of our pre-orders are finally leaving the warehouses. My apologies for the delays, and my thanks to everyone who has been so patient. Apparently a few people had trouble using the publisher’s site; at this point I suggest using Amazon or BN, since both offer competitive prices.

General

Catching Up On Last Week’s News

Post author By Daniel Tunkelang
Post date July 10, 2009

I hope everyone had a great week! It looks like I missed some interesting / controversial stories in the tech news / blogosphere, the most notable being:

Quick reactions:

Regarding the anti-SQL movement, I would have thought the main complaint would be that SQL is too arcane a language for ordinary users to ever use it directly. Instead, the article discusses developers’ complaints about databases, and these are mostly about price, speed, and scale. Evidently even free, open-source databases like MySQL are losing favor relative to tools like Hadoop and Hypertable that don’t offer support for SQL. Of course, this picture comes from a meetup of 150 people that might not be entirely representative of information technology workers.

I know first-hand from my experience at Endeca that, to quote Michael Stonebraker, the “one size fits all” approach to databases is an idea whose time has come and gone. At Endeca, we have built our own special-purpose database to address information needs ill-served by the available OLTP and OLAP technologies. Still, I think it’s premature to declare the death of SQL or of relational databases. But why let that stand in the way of a good story?

On to the open-source search engine comparison. I won’t rehash the critique of the study, which you can find in the 80+ comments from folks like Jeff Dalton, Bob Carpenter, and Otis Gospodnetic. Perhaps the most salient point is that it’s not clear how much sense it makes to perform “out of the box” evaluations. In any case, my impression is that Lucene is by far the dominant player in the open-source search space; the study, if it has any effect, will only be to reinforce that dominance.

And finally, the big news from the big G: a Google Operating System. Even my mom (who couldn’t name an existing operating system) was asking me about it, so clearly this one has made it into the mainstream media. And yet I don’t see why this is such a big deal. We have netbooks, and we even have Linux-based netbooks. As far as I’ve heard, the latter are popular with geeks and cheapskates, but that’s about it–most people are willing to pony up the few extra dollars for Windows XP. Will Google launching a netbook-oriented OS significantly affect this market? I suspect the only route to success is if they meet non-technical users’ needs (browsing, email, media, light document editing) while minimizing their overhead (maintenance, security, compatibility). Will they be a better Ubuntu? Perhaps, much in the way that Chrome is trying to be a better Firefox. Why Google choose to build its own free, open-source products rather than contribute to mature open-source projects is a mystery to me, but it’s their money and time to spend.

I think that cover’s the week’s big stories–or at least those that matter most to Noisy Channel readers. Somehow I didn’t manage to come up with an IR / HCIR angle on the Michael Jackson story, or perhaps it’s just that Danny Sullivan beat me to it.

Anyway, I’m back in the saddle, and should soon be back to my normal posting volume. Thank you all for being patient.

Uncategorized

Taking Time Off

I’ll be offline for about a week, returning on July 13th. No, I’m not going to Argentina or even hiking the Appalachian trail, but I am going off the grid to spend quality time with my wife and daughter. See you all soon!