Categories
General

SIGIR 2009: Day 3, Industry Track: Analyst Panel

The morning sessions of the SIGIR 2009 Industry Track consisted of five individual presentations; the afternoon consisted of two panels. The requirement to synchronize with the research talks led to the allocation of 90 minutes for each panel–which was a bit more than I’d originally planned on (and this change, like many, occurred in the two weeks before the conference). James Allan, one of the SIGIR 2009 co-chairs, suggested that we add an academic responder to each of the panels to account for the additional time, and we went with that approach.

The first of the two afternoon panels consisted of industry analysts: Whit Andrews (Gartner), Sue Feldman (IDC), and Theresa Regli (CMS Watch). I moderated the panel–or, more accurately, attempted to moderate it. Marti Hearst served as the academic responder.

The panel opened with each of the three panelists making an opening statement, sharing their perspectives about the key business concerns and trends in the search industry. I asked them to talk about enterprise search in the broadest sense of the term–search applications that companies buy or build)–rather than in the narrow sense of no-frills intranet search.

It became immediately clear from their opening statements that the panelists had wildly different perspectives and styles. While I think the term “food fight” that I heard bandied around afterward is a bit of an exaggeration, they certainly engaged in a heated debate.

One topic that attracted particular controversy was how enterprise search applications should assign relevance to search results. Whit suggested that, in an enterprise setting, the main objective function is the profitability of the enterprise, and that relevance should essentially be money driven. Sue and Theresa disagreed sharply, mainly arguing that relevance should be user-controlled.

I’m probably oversimplifying their arguments, and in any case shouldn’t take sides in a debate among analysts! Still, Whit probably won’t be surprised that my sympathies generally lie with the users. That said, Whit is right that enterprise search companies sell to enterprises, not directly to customers, and those enterprises (like web search companies that sell to advertisers) may have interests that aren’t always aligned with those of users. At Endeca, we advise our customers on how to configure and communicate a relevance ranking strategy, but ultimately our customers make their own decisions. After all, it’s their site and their money.

And that leads to the other topic that caught my attention and came up during Marti’s responder session: the question of how analyst firms make money. All three of the panelists were open about how their employers make money, whether from enterprise buyers, vendors, or some combination thereof. My personal preference would be that analysts make money primarily from enterprise buyers–but of course I work for a frugal vendor. I’ve heard from a variety of sources that Endeca “doesn’t spend enough” on analyst services–or on corporate marketing in general. Since neither vendors nor analyst firms open up their books, I can only speculate. Fortunately, it’s clear the analysts on the panel not only have integrity, but also have strong enough views that they can’t probably couldn’t be swayed by money or pressure.

As organizer of the track, I inrended for the panel to offer an audience of mostly academic types a chance to see people whose opinions influence tens (if not hundreds) of millions of dollars in purchasing decisions. I hope I accomplished that. I’m especially grateful to these highly billable analysts for freely sharing their time and ideas. Neither SIGIR nor I could possibly have afforded their market rates!

For other perspectives, take a look at Theresa’s blog post, “Know Your Relevance“, or Mary McKenna’s summary post.

Categories
General

SIGIR 2009: Day 3, Industry Track: Nick Craswell

One of the things I didn’t consider when I signed on to organize the SIGIR 2009 Industry Track was that I’d have to replace speakers and panelists on less than two weeks’ notice. But what I couldn’t even have imagined was replacing a speaker on less than 24 hours’ notice!

Tuesday morning, the second day of the conference and the day before the Industry Track, I woke up to an email from Tip House of the OCLC, whom I’d planned to have speak about his experiences developing Worldcat.org, the world’s largest bibliographic database. Unfortunately, he had fallen ill and would not be able to make it to the conference.

I was determined not to have a hole in the program. I immediately sent an email to the Director of Search at LinkedIn, whom I had just met at the poster session the previous evening, hoping he might have a presentation tucked away about LinkedIn’s recent launch of faceted people search. I turned to Twitter–which actually earned me a plausible suggestion.

But it was during the morning coffee break that serendipity struck. As I walked by the Bing exhibitor table, I saw Jan Pedersen, Chief Scientist of Core Search at Microsoft, chatting with Peter Bailey, an applied researcher on the Bing team. I turned to them and, in my most charming voice, asked if they might be interested in having someone on their team talk about Bing the next day. They took a few minutes to think it over, and then replied in the affirmative, producing Nick Craswell, also an applied researcher. Problem solved, and I can proudly say that I Binged for it!

Nick talked about how query modeling, focusing issues like query ambiguity, session context, and temporal query dynamics (particularly seasonality). He talked a bit about a technique that involved random walks on click logs–a technique I remember striking me when I first heard him talk about it at ECIR 2008.

The talk was a bit raw–understandably so given the short notice. But it was great to see a major web search practitioner connecting information retrieval research to actual product. Yes, there were the standard caveats about not revealing secret sauce, but the talk was open and substantive. Indeed, I hope Nick will be able to share the slides!

UPDATE: Nick emailed me the slides and gave me permission to post them here.

Categories
Uncategorized

An Apology to Vijay Gill

I don’t know if Google’s Vijay Gill reads this blog. But a post of his just caught my attention, and I feel I owe him an apology.

A little over a month ago, I wrote a post entitled “Even Google Should Beware Of Hubris“. I stand by much of that post. But I specifically said:

And, just a few days ago, Google’s senior manager of engineering and architecture punctuated a panel discussion at the Structure 09 conference–where he was sharing a stage with a counterpart from Microsoft–with the punchline “If you Bing for it, you can find it.”

Apparently I shouldn’t believe everything I read in The Register. Vijay Gill, the manager quoted above, wrote a post on his blog that appeared shortly after The Register article (and after my post), entitled “Google Does Not Mock Bing“. Here’s the most relevant paragraph:

I wasn’t mocking Bing when I said “Bing for it, you can find it.” I meant that seriously, in the spirit of giving props to a competitor, and a good one at that. Najam and I have been friends since before Google had a business plan, and I have the greatest respect for him and for Microsoft as a company. The Microsoft approach has some good points, which work for their business plan. I was speaking of one particular approach, among several others, which can solve the same problem. There was no undercutting anything, there are two approaches and thats that.

Vijay, if you’re reading this, I’m sorry for taking so long to notice your clarifying post. I hope most of your fellow Googlers are as respectful of your competitors.

Categories
General

SIGIR 2009: Day 3, Industry Track: Evan Sandhaus

Back to our regularly scheduled blogging about the SIGIR 2009 Industry Track. For those who haven’t been reading along, we covered the first three talks:

As you can see, that covered the three major web search engine companies, at least at the time of the conference. Sure, danah’s talk wasn’t exactly what people might have expected, but I had some creative license as an organizer, and the audience loved her talk. Besides, as we’ll get to in the next post, Microsoft had other opportunities to present representatives of its more conventional information retrieval divisions.

The next speaker, according to the original plan, was to be Tom Tague, who leads the Open Calais project at Thomson Reuters. Unfortunately, a week before SIGIR, I found out that he would be unable to make it. One of his colleagues offered to present in his stead less than 24 hours after his cancellation, but by then I’d already found a replacement on my own: Evan Sandhaus, Semantic Technologist in the New York Times Research and Development Labs, agreed to talk about “Corpus Linguistics and Semantic Technology at the New York Times”.

You can get a good idea of his talk from these slides–or, better, yet, from this video. Both are from the closing keynote that he and his colleague Rob Larson delivered at the 2009 Semantic Technology Conference.

I won’t try to recapture Evan’s fascinating narrative about the history of information storage and retrieval at the New York Times. Rather, I’ll skip to the parts that should matter most to information retrieval researchers and practitioners: the availability of the New York Times Annotated Corpus through the Linguistic Data Consortium (LDC), and the New York Times’s intention to contribute to the Linked Data Cloud.

For me personally, the annotated corpus is the bigger deal. It represents 1.8 million articles written over 20 years. It is annotated both with manually-supplied summaries and tags–the latter drawn from a controlled vocabulary of people, organizations, locations and topic descriptors–and with algorithmically-supplied tags that are manually verified. My colleagues and I at Endeca have been working with the annotated corpus, and it is a delight. I hope that the information retriwval community will make heavy use of this wonderful new resource.

Categories
General

Are Academic Conferences Broken? Can We Fix Them?

I’d hoped to get through all of the SIGIR 2009 Industry Track before blogging about anything else (such as Yahoo! search going bada-Bing), but clearly I’m taking too long. So I’m following Daniel Lemire’s suggestion that I post a recent comment on Lance Fortnow’s blog (actually a response to his CACM column entitled “Time for Computer Science to Grow Up“) here at The Noisy Channel.

It’s nice to see this piece joining a growing chorus questioning the way we conflate the distinct concerns of disseminating knowledge, establishing professional reputation, and building community. This problem is not unique to computer science, but we are certainly in a position to lead by example in addressing it.

In age where distribution is nearly free, I agree that we should move the filtering role from content publishers to content consumers. There’s no economic reason today why scholarship (or purported scholarship) shouldn’t be published online. Of course, the ability to publish digital content for free (or close to free) does not imply anyone will (or should) read what you write. The blogosphere offers an instructive example: the overwhelming majority of blogs attract few (if any) readers. I suspect that the same holds true for arXiv.org. Of course, peer-reviewed content may not fare that much better, particularly given the proliferation of peer-reviewed venues. Regardless, it makes no sense for publishers to act as filters in an age of nearly-free digital distribution.

That brings us to the question of how researchers should establish their professional reputation–and, in the case of academics, obtain tenure and promotion. Today, they have to publish in peer-reviewed journals and conferences. Even if we accept the weaknesses of the current peer-review regime, we should be able to separate content assessment from distribution. The peer-review process (and review processes in general) should serve to endorse content–and ideally even to improve it–rather than to filter it.

Finally, conferences should primarily serve to build community. I find the main value of conferences and workshops to be face-to-face interaction, and I’ve heard many people express similar sentiments. Part of the problem is that so few presenters at conferences invest in (or have the skills for) delivering strong presentations. But more fundamentally it’s not even clear that the presentations are the point of a conference–after all, an author’s main motive for submitting an article to a conference seems to be getting it into the proceedings.

Here are some questions I’d like to suggest we consider as a community:

What if presentation at a conference were optional, and an author’s decision to present had no effect on inclusion in the proceedings? Would there be significantly fewer presentations? Would those fewer presentation be of higher quality?

What if the process of peer-reviewing conference submissions required the submission of presentation materials rather than (or in addition to) a paper? Would the accepted presentations be of higher quality? Would researchers invest more in presentation skills? What would happen to strong researchers without such skills?

Can we update the traditional conference format to foster more productive interaction among researchers? For example, should we have more poster sessions and fewer paper presentations?

I’d love to see the computer science community take the lead in evolving what increasingly feel like dated procedures for disseminating knowledge, establishing professional reputation, and building community. I’ve tried to do my small part, co-organizing workshops on Human-Computer Interaction and Information Retrieval (HCIR) that emphasize face-to-face interaction and organizing the SIGIR 2009 Industry Track as a series of invited talks and panels from strong presenters. But I’m encouraged to see “establishment” types like Moshe Vardi and Lance Fortnow leading the charge to question the status quo.

Categories
General

SIGIR 2009: Day 3, Industry Track: Vanja Josifovski

After the conference banquet at JFK Library and Museum, a few of us went to Bukowski for beers. At one point in the conversation, a friend of mine railed against computational advertising as a research topic. I didn’t quite have the nerve to reply that it was one of the topics I’d picked for the SIGIR 2009 Industry Track that would take place the following day.

Finding a speaker for this subject was relatively straightforward. I hadn’t yet recruited anyone from Yahoo!, and I knew that Yahoo! was the place to look for computational advertising experts. So I emailed Prabhakar Raghavan, and he suggested Vanja Josifovski. I’d never met Vanja or heard him speak, but a quick look at his publications and experience was more than enough to convince me. I was delighted when Vanja agreed to participate, presenting “Ad Retrieval – A New Frontier of Information Retrieval“.

I was even more delighted the actual presentation, which you can download here. Perhaps more than any of the other speakers, Vanja embodied the spirit of the Industry Track, which was to bring together the worlds of research and practice in information retrieval.

He started by making the case for textual advertising as an area worthy of study. He pointed out that, while advertising supports much of our access to search engines and online content, most users perceive ads as less relevant than the other content content they access. In other words, there is a significant opportunity for those in the advertising business to broadly improve the online user experience while making money.

He then proceeded to explain the anatomy of a textual ad. If you’re not familiar with the details, I encourage you to look at his presentation. But I’ll reproduce what I feel was his most important slide here, slide #15, titled “Ads as Information”:

  • Treat the ads as documents in IR
    • [Ribeiro-Neto et al. SIGIR 2005] [Broder et al. SIGIR2007] [Broder et al. CIKM2008]
  • Retrieve the ads by evaluating the query over the ad corpus
  • Use multiple features of the query and the ad
  • How does Ad retrieval relate to Web search?
    • Web search:
      • Large corpus
      • Reorder the pages that contain all the query terms
    • Ad retrieval:
      • Smaller corpus
      • Similarity search rather than conjunction of the query terms: recall in the first phase important

There’s a lot more to the talk, but hopefully that slide conveys how well Vanja posed ad retrieval as a distinctive information retrieval problem worthy of researchers’ attention.

Ironically, I’m not a big fan of advertising, and I see the dominance of the ad-supported model as a bug, rather than a feature, of our current online ecosystem. But I’m realistic enough to know that this dominance is a fact of life for the forseeable future, and I appreciate that better targeted advertising is a win/win for both advertisers and their audiences.

More importantly, I expect that efforts to improve advertising will result in advances in information retrieval that have broader applications. Vanja’s presentation advertised those benefits brilliantly.

Categories
General

SIGIR 2009: Day 3, Industry Track: danah boyd

After I secured Matt Cutts as a speaker for the SIGIR 2009 Industry Track, I suppose I became a bit cocky. I decided that I wanted another speaker who would not only be interesting, but also would have the star power to put the event on the map. One of my topics on my list was social media. So I decided, why not, I’ll try to get danah boyd.

This turned out to be no easy task. At the time, danah was on an email sabbatical (she was just wrapping up her dissertation at Berkeley). I’d actually been in touch with her several years ago, when Friendster was the only social network in town, and danah and Jeff Heer (whom I recently met at SIGMOD 2009) were working on visualizing it. I have my own history in network visualization, and I’d hoped to get access to their data. But no such luck. Still, discovering danah led me to read her master’s thesis (at the MIT Media Lab) on “Faceted Id/entity: Managing representation in a digital world“.

I actually given up on reaching her after a few weeks–reluctantly, I fell back to plan B. Fortuitously, however,  my plan B fell through, and I decided to try again. At long last I did reach her, only to discover I had to accomplish what I worried would be an even harder job: convincing her that her ethnographic research would be a good fit for an information retrieval audience. So I sent her this pitch:

I’d love to hear you talk about how the evolution of social media has changed the context of search. Going back to your master’s thesis, the collapsing of situational context in searchable archives has not only wreaked havoc on personal identity, but also made it difficult for searchers to meaningfully navigation those archives. And, since publication and search are flip sides of the same coin, we need to understand this dynamic better, especially as there is an increasing trend towards conducting public conversation.

It worked! The next thing I knew, she was on board to give a talk about “The Searchable Nature of Acts in Networked Publics”. As I expected, she was a fantastic speaker, and she had no problem engaging the audience. Rather than try to summarize her talk in detail, I refer you to a recent blog post of hers that covers very similar material. You can also read summaries of the actual talk by Mary McKenna at SemanticHacker and Daniele Quercia at MobBlog.

I saw the most salient theme of her talk as the need to intepret people’s behavior on online social networks (which, as she points out, come in many different flavors) in terms of their intentions. For example, teens on MySpace lie about their age, but they don’t see this behavior as deceptive behavior–after all, their online friends (who are the people for whom they publish their profiles) all know how old they are. In general, all of the information we provide online needs to be viewed through an appropriate contextual lens. Unfortunately (or perhaps fortunately for those who still are clinging to hopes of privacy), our data mining practices are a bit behind the curve.

The proliferation of user-generated content and public conversation–both of which amount to a democratization of publishing–is changing the ways we need to approach information retrieval in practice. danah’s talk left me with more questions than answers, but I appreciate her insightful snapshot of how people are using social networks today. And that snapshot only reinforced my certainty that I want to devote my life (or at least the next several years) to understanding and improving how people interact with information.

Categories
General

SIGIR 2009: Day 3, Industry Track: Matt Cutts

At last we arrive at the SIGIR 2009 Industry Track.Since I organized this track (which mainly involved coming up with a program and then actually producing the speakers), I’m not exactly an impartial observer. But hopefully the organizers of future industry tracks will benefit from my perspective as an organizer.

Last December (New Year’s Eve, to be precise), I started recruiting speakers. I started with a list of topics I wanted to see covered, and one of those topics was spam / adversarial information retrieval. My top two choices were Matt Cutts and Amit Singhal, both members of the Search Quality group at Google. I’d heard Amit speak before: he delivered one of the keynotes at ECIR 2008 (and inspired one of my first blog posts!). So I decided to aim for Matt Cutts, despite having no way to contact him (the head of Google’s Webspam team is understandably a bit protective of his personal email address). And, just two weeks later, I had Matt locked in to the program.

Matt was an incredible speaker, and he had the unenviable task of opening the Industry Track at 8:30 AM, the morning after the banquet. His title, “WebSpam and Adversarial IR: The Road Ahead”, gave him a fair amount of maneuvering room, and he used his 45 minutes to give the audience a peek into his world.

He opened the talk by inducing the audience to try to think like a spammer. He then game examples of social engineering attacks, to put us in a “black hat” mindset. He also pointed out the danger of punishing sites with spammy inlinks: people and companies would use this knowledge against their competitors / enemies (the practice has been called “Google bowling“).

He then moved on to examples of spam techniques. He showed examples of pages whose spaminess is only detectable by parsing JavaScript, something I wasn’t aware that Google could do (though apparently this has been public knowledge for a while). The theoretical computer scientist in me wonders about using random self-reducibility as obfuscation on steroids, but hopefully spammers aren’t quite that sophisticated yet!

He offered a common-sense framework for fighting spam: reduce the return on investment. Unfortunately, he sees a trend in spam where spammers are aiming for faster, higher payoffs by hacking sites and installing malware. Indeed, the democratizing effect of social media means that a lot more people have pages that can serve spam, including their Twitter and Facebook pages. He invited the information retrieval community to invest effort in learning how to automatically detect  that a page or server has been hacked.

My only quibble with the talk is that Matt did not discuss the inherent subjectivity of spam. Sure, there are many cases that are black and white, but ultimately spam (like relevance) is in the eye of the user. I’d love to see more use of techniques like attention bond mechanisms that accommodate a subjective definition of spam, e.g., “any email that you would rather have not received.”

But I quibble. Matt delivered an excellent talk to a packed audience, and it was a real privilege to have him kick off the Industry Track.

ps. You can also read Jeff Dalton’s notes on Matt’s presentation.

Categories
General

SIGIR 2009: Day 2, Albert-László Barabási’s Keynote

Albert-László Barabási is one of the biggest names in networking theory, up there with Jon Kleinberg and Duncan Watts. Since he was the only one of those three whom I hadn’t met, I was thrilled to discover that he was giving a keynote at SIGIR 2009, entitled “From Networks to Human Behavior“.

Much the keynote was a stump speech about the failure of Erdős–Rényi networks to explain real-world network phenomena and the surprising prevalence of scale-free networks that exhibit power law degree distributions (these are heavy-tailed distributions, unfortunately known to many as “long tail”). But it was an extremely compelling stump speech, and I found myself as mesmerized as those who were hearing this material for the first time. Barabási also pulled out some examples that I’d never seen before–in particular, that of a bot passing a sort of Turing test by emulating the bursty pattern of human-generated traffic. He also tried to explain competitive market dynamics (specifically Google vs. competitors) in terms of scale-free networks–an explanation that I thought was a bit of a stretch (the model included a fitness function vague enough for me to wonder if it was falsifiable), but nonetheless interesting. All in all, it was an excellent talk, especially given that it was a physicist talking about network theory to an audience of information retrieval researchers.

But what impressed me just as much was how Barabási handled questions. First, whatever information retrieval system his brain uses has incredible recall–he seemed to have a reference at his fingertips for any topic a questioner brought up. Second, he was incredibly warm, fielding questions that must have struck him as basic without showing even a hint of condescension.

Indeed, I asked a question that had plagued me for years. As Barabási explained, scale-free networks arise from preferential attachment–basically a pattern of network growth in which the rich (nodes) get richer (i.e., more edges). As a simple example, think of citation networks, like the World Wide Web. You’re more likely to cite (link to) pages that already have a lot of links, and hence there is a positive feedback loop. Yes, we’ll get to my question in a moment. Barabási further explained how scale-free networks are at once robust and vulnerable. Their connectedness is robust in the face of random failure: since the vast majority of nodes are those with small degree, a random failure is unlikely to take out a high-degree hub. But they are vulnerable to calculated attack, since removing the few hubs can have devastating consequences–an observation not lost on guerrillas or terrorists.

Yes, my question. I asked Barabási if the ubiquity of scale-free networks suggested that they had evolved because robustness in the face of random failure was a real-world fitness (in the Darwinian sense), and whether inter-species (or intra-species) competition would lead to the disappearance of those networks because of their vulnerability to calculated attack. In other words, were scale-free networks a transitional phase in our evolution as a global ecosystem?

Barabási’s answer surprised me: apparently one of his colleagues tried for two years to produce a compelling evolutionary argument for the ubiquity of scale-free networks, and failed to do so. Indeed, the best he could suggest is that scale-free networks, despite their obvious weaknesses, represent a good-enough solution from nature. Evolution satisfices.

I was stunned–both by the answer and by Barabási’s candor. I’d briefly worked on this problem myself as an amateur a few years ago (I wanted to reuse my graph drawing code!), but I assumed that the professionals had figured it all out. It’s humbling to know that nature and its tangled web of connections still holds deep mysteries that defy the intuition of its most astute observers.

ps. You can also read Jeff Dalton’s notes on the keynote.

Categories
General

SIGIR 2009: Day 2, Interactive Search Session

At the two previous SIGIR conferences that I attended, the interactive search sessions were the most interesting, and this one was no exception. Ironically, even though many of us (myself included) feel that interaction is marginalized within the SIGIR conference and even the information retrieval research community, the few interaction talks at SIGIR consistently draw large audiences and lots of questions. Of course, it couldn’t hurt this time that Sue Dumais received the Salton Award for her contributions to HCIR.

This particular session consisted of three talks.

Like Jeff Dalton, I really liked Peter Bailey’s talk on “Predicting User Interests from Contextual Information“, work done in collaboration with Liwei Chen and my HCIR co-conspirator Ryen White. They analyze the predictive performance of contextual information sources (interaction, task, collection, social, historic) for different temporal durations (short, medium, large). Like Jeff, I’m a bit surprised that they used the Open Directory Project for their evaluation, but I do find their results compelling–if not entirely surprising. And here is irony for you: despite my at-best ambivalence toward advertising, I’d love to see their analysis applied to ad targeting, which strikes me as the best way to test their approach. You can also read a more complete summary from Max Van Kleek at the Haystack blog.

Diane Kelly presented the second paper, “A Comparison of Query and Term Suggestion Features for Interactive Searching” (unfortunately not available online yet), work done with UNC co-authors Karl Gyllstrom and Earl Bailey. David Karger and Gene Golovchinsky have already blogged about this talk, so rather than summarize I’ll add my personal reaction: I hope that their future work will consider query previews as a way of increasing the value of suggestion features. Her UNC colleague Gary Marchionini has made extensive use of such previews in the RAVE project, and I think they should join forces. I’m also hoping to present some of what we at Endeca have been doing in this area at HCIR ’09.

Robert Villa presented the third paper, “An Aspectual Interface for Supporting Complex Search Tasks“, a team effort with University of Glasgow co-authors Iván Cantador, Hideo Joho, and Joemon Jose. Again, David and Gene have already blogged great summaries. My micro-summary: nice research, but unfortunately the results are unfortunately inconclusive.

Given the evident interest in HCIR among SIGIR attendees, I hope the SIGIR community will make an effort to solicit and accept more papers in this area.