Categories
General

Micro vs. Macro Information Retrieval

The Probably Irrelevant blog has been quiet for a while, but I was happy to see a new post there by Miles Efron about “micro-IR“. He characterizes micro-IR, as distinct from macro or general IR, as follows:

  1. In ad hoc (text) IR a principal intellectual challenge lies in modeling ‘aboutness.’  In micro-IR settings, the creativity comes into play in posing a useful (and tractable) question to answer.  The engineering comes easily after that.
  2. The constrained nature of micro-IR applications leads to a lightweight articulation of information need.  There is a tight coupling here between task, query, and the unit of retrieval, a dynamic that I think is compelling.  Pushing this a bit farther, we might consider the simple act of choosing to use a particular application from those apps on a user’s palette as part of the information need expression.
  3. The tight coupling of task to data to ‘query’ enables a strong contextual element to inform the interaction.  Context constitutes the foreground of the micro-IR interaction.

He then asks: “is micro-IR something at all?  Is it actually related to IR?” Fernando Diaz answers that “the only difference between micro and macro IR is text.” Jinyoung Kim adds that in micro-IR “the context (searcher goal) is known, with domain-specific notion of relevance (goodness) and similarity measures.”

I hadn’t thought of making this particular distinction, but I like it. While I prefer to think about distinguishing the needs of information seekers–rather than the characteristics of search applications–I would be the first to argue that a well-designed search application caters to particular user needs. Indeed, I think the definition of a good micro-IR application implies that it addresses a highly constrained space of information needs. Just as importantly, micro-IR applications can often assume that their users are highly familiar with the information space the applications address, and thus that those users need less of the basic orienteering support that can be critical for success using macro-IR systems. That said, micro-IR users have (or should have) higher expectations of support for more sophisticated information seeking.

The other day, I speculated about why Google holds back on faceted search. I feel that the distinction between macro- and micro-IR is in the same vein: micro-IR settings (e.g., site search, enterprise search,vertical search) drive needs for more richer interfaces and support for interaction, while macro-IR application developers (e.g., general web search) worry mostly about producing a reasonable answer for the query–and often lead users to micro-IR destinations that offer their own support for information seeking within their constrained domains.

In short, it’s a nice way to think about the IR application space, and it’s increasingly relevant (no pun intended!) as we see a proliferation of micro-IR applications. And it’s great to see activity on the Probably Irrelevant blog after all these months of radio silence!

Categories
Uncategorized

Yahoo on Key Scientific Challenges in Search and Machine Learning

Like many folks, I’ve assumed that Yahoo’s partnership with Bing–assuming it is approved–offers the best chance of validating CEO Carol Bartz’s claim that Yahoo has “never been a search company“. She may not be able to change the past, but she certainly is making up for lost time. To be clear, I agree with her 100% that Yahoo should have accepted Microsoft’s $40B acquisition offer last year–in her words, “Sure, do you think I’m stupid?” But I’m still struggling to understand the rationale behind the deal Yahoo did accept.

In any case, Yahoo researchers haven’t stopped thinking about search. As Jeff Dalton reports, Yahoo recently issued a press release about its Key Scientific Challenges Summit. Jeff was kind enough to post Henry Feild’s notes about the presentations by Andrew Tomkins on search and by Sathiya Keerthi Selvaraj on machine learning. I’d love to hear more detail about how they perceive (and hope to address) the search challenges of optimizing task-aware relevance and measuring / predicting generating user engagement.

Regardless of Yahoo’s fate, I’m certainly glad that there are still people at Yahoo working on these big problems. I hope they find a way to develop solutions and bring those solutions to the users who need them.

Categories
Uncategorized

The Ethics of Blogging

A few people have commented that the events I advertise here tend to be expensive–or, worse, require a lot of work to get into! So I’m glad to announce a freebie that I hope will be as much fun for me as for attendees.

I’ve been invited to participate in a webinar on the ethics of blogging that will take place Thursday, September 24th at 1 PM EST. It’s free to attend; just register online here.

Maggie Fox, founder and CEO of Social Media Group, will moderate. My two co-panelists are Augie Ray, who blogs at Experience: The Blog) and is Managing Director of Experiential Marketing at interactive and social media agency Fullhouse, and John Jantsch, who blogs at Duct Tape Marketing and is a marketing and digital technology coach.

Among the topics to be discussed:

  • Transparency: How and when should a blogger reveal revenue sources?
  • Pay for play: Blog posts, tweets, and more as marketing tools
  • Online privacy
  • Astroturfing: Organizations creating artificial “grassroots” campaigns
  • Compliance and Legal: What should a corporate blog policy look like? What are a blogger’s legal obligations?

I hope some of you will be able to attend! Regardless, please use the comment thread make suggestions here about topics you’d like me to cover or concerns you’d like to see me address. I know that a lot of you have thought hard about these issues, and I’d like to ethically exploit your collective wisdom.

Categories
Uncategorized

CIKM 2009 Accepted Papers

The two biggest academic conferences for information retrieval are SIGIR and CIKM (a site which, sadly, is still hacked). Hopefully some of you enjoyed my coverage of SIGIR 2009–or, better yet, attended and experienced it for yourselves.

Anyway, thanks to Jeff Dalton for alerting me that the CIKM 2009 accepted papers list is now available. I don’t plan to make it to Hong Kong this November, but I hope that those who do are kind enough to blog about it!

Also, I see mention of an industry track, but not of an Industry Event like the widely acclaimed one held at CIKM 2008–which inspired my own organization of the SIGIR 2009 Industry Track. I’m curious whether such industry events will prove to be one-time phenomena or will become a staple of these  conferences. I hope for the latter, but I am admittedly biased, given my industry-centric perspective.

Categories
General

Not All Google Critics Are Bigots

Jeff Jarvis wrote a post today entitled “Google bigotry“, in which he asserts that:

Google has an image problem – not a PR problem (that is, not with the public) but a press problem (with whining old media people).

He then goes on to launch a tirade against a Le Monde journalist whose offense was to say she was writing “an article about Google facing a rising tide of discontent concerning privacy and monopoly.” He proceeds to stereotype the French as having “national insanity” of Google bigotry. I’ll leave analysis of irony as an exercise to the reader.

But the true irony is that Jarvis has a point. While I haven’t done a rigorous analysis, my impression is that there has been a sensationalist press overreaction against Google, singling out Google for behavior for which all other companies get a pass. As even one of the most vocal Google critics admits, “Google’s [privacy] policies are essentially no different than the policies of Microsoft, Yahoo, Alexa and Amazon.” Moreover, some of the newspapers criticizing Google as parasitic are the same ones who once turned–and still turn–to Google with open arms as a source of traffic–when they could easily cut Google off by configuring robots.txt. Granted, the newspapers are now locked into a prisoner’s dilemma, but they should at least take some responsibility for putting themselves in that position.

That said, there are lots of legitimate reasons to criticize Google, specifically concerning privacy and monopoly. While Google may not have engaged in any illegal or unethical practices to get there, it now holds a position as the primary gatekeeper to the internet for a substantial majority of Americans, as well as much of the western world. On the content creation side, site owners don’t ask “What Would Google Do?“–rather they ask how Google will index their sites. Meanwhile, on the consumption side, the broadening scope of Google’s role in ordinary people’s lives is legitimate cause for concern about privacy. It’s not insane or bigoted to raise these issues.

Moreover, Google claims to hold itself to a higher standard than other companies, so it’s not that surprising that people actually do hold them to it and criticize it when it flls short. Still, that’s no excuse for exaggeration or outright hallucination.

As I commented on Jarvis’s blog, I don’t think he’s the most credible judge of Google’s critics. He responded in kind. Touché. I accept that exchanging personal attacks doesn’t advance the argument. Perhaps more detached voices can chime in.

Categories
Uncategorized

John Battelle: “I don’t know what to ask about”

John Battelle has a pair of posts on BingTweets (yes, I know, horrible name) entitled “Decisions Are Never Easy – So Far“. In his second post, he sums up the problem with conventional search engines in a nutshell: “I don’t know what to ask about”. His describing the need for a “decision engine” is a bit too obvious a nod to his sponsor, but he is nonetheless right in calling for information seeking support tools based on HCIR.

Categories
General

HCIR: Better Than Magic!

I’m a big fan of using machine learning and automated information extraction to improve search performance and generally support information seeking. I’ve had some very good experiences with both supervised (e.g., classification) and unsupervised (e.g., terminology extraction) learning approaches, and I think that anyone today who is developing an application to help people access text documents should at least give serious consideration to both kinds of algorithmic approaches. Sometimes automatic techniques work like magic!

But sometimes they don’t. Netbase‘s recent experience with HealthBase is, unfortunately, a case study in why you shouldn’t have too much faith in magic. As Jeff Dalton noted, the “semantic search” is hit-or-miss. The hits are great, but it’s the misses that generate headlines like this one in TechCrunch: “Netbase Thinks You Can Get Rid Of Jews With Alcohol And Salt”. Ouch.

It seems unfair to single out Netbase for a problem endemic to fully automated approaches, but they did invite the publicity. It would be easy to dig up a host of other purely automated approaches that are just as embarassing, if less publicized.

Dave Kellogg put it well (if a bit melodramatically) when he characterized this experience as a “tragicomedy” that reveals the perils of magic. His argument, in a nutshell, is that you don’t want to be completely dependent on an approach for which 80% accuracy is considered good enough. As he says, the problem with magic is that it can fail in truly spectacular ways.

Granted, there’s a lot more nuance to using automated content enrichment approaches. Some techniques (or implementations of general techniques) optimize for precision (i.e., minimizing false positives), while others optimize for recall (i.e., minimizing false negatives). Supervised techniques are generally more conservative than unsupervised ones: you might incorrectly assert that a document is about disease, but that’s less dramatic a failure than adding the word “Jews” to an automatically extracted medical vocabulary. In general, the more human input into the process, the more opportunity to improve the effectiveness and avoid embarassing mistakes.

Of course, the whole point of automation is to reduce the need for human input. Human labor is a lot more expensive that machine labor! But there’s a big difference between the mirage of eliminating human labor and the realistic aspiration to make its use more efficient and effective. That what human-computer information retrieval (HCIR) is all about, and all of the evidence I’ve encountered confirms that it’s the right way to crack this nut. Look for yourselves at the proceedings of HCIR ’07 and ’08. Having just read through all of the submissions to HCIR ’09, I can tell you that the state of the art keeps getting better.

Interestingly, even Google CEO Eric Schmidt may be getting around to drinking the kool-aid. In an interview published today in TechCrunch, he says: “We have to get from the sort of casual use of asking, querying…to ‘what did you mean?’.” Unfortunately, he then goes into science-fiction-AI land and seems to end up suggesting a natural language question-answering approach like Wolfram Alpha. Still, at least his heart is in the right place.

Anyway, as they say, experience is the best teacher. Hopefully Netbase can recover from what could generously be called a public relations hiccup. But, as the aphorism continues, it is only the fool that can learn from no other. Let’s not be fools–and instead take away the moral of this story: instead of trying to automate everything, optimize the division of labor between human and machine. HCIR.

Categories
Uncategorized

Another Project to Measure Twitter Influence

Just noticed that the Web Ecology Project has published “The Influentials: New Approaches for Analyzing Influence on Twitter“. The blog post includes a link to their full report.

Their approach strikes me as a generalization of measuring retweets, but perhaps I’m giving it too cursory a read. I did compare their results to TunkRank: we at least agree that mashable is more influential than CNN–though even as simple a measure as follower count would confirm that judgment.

Anyway, I am delighted to see serious researchers looking at this problem. I’m still hoping to investigate hypotheses regarding TunkRank and friend:follower ratios.

Categories
Uncategorized

Great Series of Posts on Medical Literature Search

Gene Golovchinsky at FXPAL has written a great series of posts on medical literature search, specifically looking at how MeSH (Medical Subject Headings) has been used to augment conventional text search, and whether its use improves the overall effectiveness of information seeking.

Here are the posts:

Even if you’re not specifically interested in medical literature search, I recommend you check these posts out. Much of the interesting work on information seeking is taking place in specialized domains like this one, where the value of getting it right offers far more promising returns than incremental improvements to general web search.