Author: Daniel Tunkelang

High-Class Consultant.

Yahoo on Key Scientific Challenges in Search and Machine Learning

Post author By Daniel Tunkelang
Post date September 11, 2009
2 Comments on Yahoo on Key Scientific Challenges in Search and Machine Learning

Like many folks, I’ve assumed that Yahoo’s partnership with Bing–assuming it is approved–offers the best chance of validating CEO Carol Bartz’s claim that Yahoo has “never been a search company“. She may not be able to change the past, but she certainly is making up for lost time. To be clear, I agree with her 100% that Yahoo should have accepted Microsoft’s $40B acquisition offer last year–in her words, “Sure, do you think I’m stupid?” But I’m still struggling to understand the rationale behind the deal Yahoo did accept.

In any case, Yahoo researchers haven’t stopped thinking about search. As Jeff Dalton reports, Yahoo recently issued a press release about its Key Scientific Challenges Summit. Jeff was kind enough to post Henry Feild’s notes about the presentations by Andrew Tomkins on search and by Sathiya Keerthi Selvaraj on machine learning. I’d love to hear more detail about how they perceive (and hope to address) the search challenges of optimizing task-aware relevance and measuring / predicting generating user engagement.

Regardless of Yahoo’s fate, I’m certainly glad that there are still people at Yahoo working on these big problems. I hope they find a way to develop solutions and bring those solutions to the users who need them.

Uncategorized

The Ethics of Blogging

A few people have commented that the events I advertise here tend to be expensive–or, worse, require a lot of work to get into! So I’m glad to announce a freebie that I hope will be as much fun for me as for attendees.

I’ve been invited to participate in a webinar on the ethics of blogging that will take place Thursday, September 24th at 1 PM EST. It’s free to attend; just register online here.

Maggie Fox, founder and CEO of Social Media Group, will moderate. My two co-panelists are Augie Ray, who blogs at Experience: The Blog) and is Managing Director of Experiential Marketing at interactive and social media agency Fullhouse, and John Jantsch, who blogs at Duct Tape Marketing and is a marketing and digital technology coach.

Among the topics to be discussed:

Transparency: How and when should a blogger reveal revenue sources?
Pay for play: Blog posts, tweets, and more as marketing tools
Online privacy
Astroturfing: Organizations creating artificial “grassroots” campaigns
Compliance and Legal: What should a corporate blog policy look like? What are a blogger’s legal obligations?

I hope some of you will be able to attend! Regardless, please use the comment thread make suggestions here about topics you’d like me to cover or concerns you’d like to see me address. I know that a lot of you have thought hard about these issues, and I’d like to ethically exploit your collective wisdom.

Uncategorized

CIKM 2009 Accepted Papers

The two biggest academic conferences for information retrieval are SIGIR and CIKM (a site which, sadly, is still hacked). Hopefully some of you enjoyed my coverage of SIGIR 2009–or, better yet, attended and experienced it for yourselves.

Anyway, thanks to Jeff Dalton for alerting me that the CIKM 2009 accepted papers list is now available. I don’t plan to make it to Hong Kong this November, but I hope that those who do are kind enough to blog about it!

Also, I see mention of an industry track, but not of an Industry Event like the widely acclaimed one held at CIKM 2008–which inspired my own organization of the SIGIR 2009 Industry Track. I’m curious whether such industry events will prove to be one-time phenomena or will become a staple of these conferences. I hope for the latter, but I am admittedly biased, given my industry-centric perspective.

General

Not All Google Critics Are Bigots

Post author By Daniel Tunkelang
Post date September 5, 2009
14 Comments on Not All Google Critics Are Bigots

Jeff Jarvis wrote a post today entitled “Google bigotry“, in which he asserts that:

Google has an image problem – not a PR problem (that is, not with the public) but a press problem (with whining old media people).

He then goes on to launch a tirade against a Le Monde journalist whose offense was to say she was writing “an article about Google facing a rising tide of discontent concerning privacy and monopoly.” He proceeds to stereotype the French as having “national insanity” of Google bigotry. I’ll leave analysis of irony as an exercise to the reader.

But the true irony is that Jarvis has a point. While I haven’t done a rigorous analysis, my impression is that there has been a sensationalist press overreaction against Google, singling out Google for behavior for which all other companies get a pass. As even one of the most vocal Google critics admits, “Google’s [privacy] policies are essentially no different than the policies of Microsoft, Yahoo, Alexa and Amazon.” Moreover, some of the newspapers criticizing Google as parasitic are the same ones who once turned–and still turn–to Google with open arms as a source of traffic–when they could easily cut Google off by configuring robots.txt. Granted, the newspapers are now locked into a prisoner’s dilemma, but they should at least take some responsibility for putting themselves in that position.

That said, there are lots of legitimate reasons to criticize Google, specifically concerning privacy and monopoly. While Google may not have engaged in any illegal or unethical practices to get there, it now holds a position as the primary gatekeeper to the internet for a substantial majority of Americans, as well as much of the western world. On the content creation side, site owners don’t ask “What Would Google Do?“–rather they ask how Google will index their sites. Meanwhile, on the consumption side, the broadening scope of Google’s role in ordinary people’s lives is legitimate cause for concern about privacy. It’s not insane or bigoted to raise these issues.

Moreover, Google claims to hold itself to a higher standard than other companies, so it’s not that surprising that people actually do hold them to it and criticize it when it flls short. Still, that’s no excuse for exaggeration or outright hallucination.

As I commented on Jarvis’s blog, I don’t think he’s the most credible judge of Google’s critics. He responded in kind. Touché. I accept that exchanging personal attacks doesn’t advance the argument. Perhaps more detached voices can chime in.

Uncategorized

John Battelle: “I don’t know what to ask about”

Post author By Daniel Tunkelang
Post date September 5, 2009

John Battelle has a pair of posts on BingTweets (yes, I know, horrible name) entitled “Decisions Are Never Easy – So Far“. In his second post, he sums up the problem with conventional search engines in a nutshell: “I don’t know what to ask about”. His describing the need for a “decision engine” is a bit too obvious a nod to his sponsor, but he is nonetheless right in calling for information seeking support tools based on HCIR.

General

HCIR: Better Than Magic!

I’m a big fan of using machine learning and automated information extraction to improve search performance and generally support information seeking. I’ve had some very good experiences with both supervised (e.g., classification) and unsupervised (e.g., terminology extraction) learning approaches, and I think that anyone today who is developing an application to help people access text documents should at least give serious consideration to both kinds of algorithmic approaches. Sometimes automatic techniques work like magic!

But sometimes they don’t. Netbase‘s recent experience with HealthBase is, unfortunately, a case study in why you shouldn’t have too much faith in magic. As Jeff Dalton noted, the “semantic search” is hit-or-miss. The hits are great, but it’s the misses that generate headlines like this one in TechCrunch: “Netbase Thinks You Can Get Rid Of Jews With Alcohol And Salt”. Ouch.

It seems unfair to single out Netbase for a problem endemic to fully automated approaches, but they did invite the publicity. It would be easy to dig up a host of other purely automated approaches that are just as embarassing, if less publicized.

Dave Kellogg put it well (if a bit melodramatically) when he characterized this experience as a “tragicomedy” that reveals the perils of magic. His argument, in a nutshell, is that you don’t want to be completely dependent on an approach for which 80% accuracy is considered good enough. As he says, the problem with magic is that it can fail in truly spectacular ways.

Granted, there’s a lot more nuance to using automated content enrichment approaches. Some techniques (or implementations of general techniques) optimize for precision (i.e., minimizing false positives), while others optimize for recall (i.e., minimizing false negatives). Supervised techniques are generally more conservative than unsupervised ones: you might incorrectly assert that a document is about disease, but that’s less dramatic a failure than adding the word “Jews” to an automatically extracted medical vocabulary. In general, the more human input into the process, the more opportunity to improve the effectiveness and avoid embarassing mistakes.

Of course, the whole point of automation is to reduce the need for human input. Human labor is a lot more expensive that machine labor! But there’s a big difference between the mirage of eliminating human labor and the realistic aspiration to make its use more efficient and effective. That what human-computer information retrieval (HCIR) is all about, and all of the evidence I’ve encountered confirms that it’s the right way to crack this nut. Look for yourselves at the proceedings of HCIR ’07 and ’08. Having just read through all of the submissions to HCIR ’09, I can tell you that the state of the art keeps getting better.

Interestingly, even Google CEO Eric Schmidt may be getting around to drinking the kool-aid. In an interview published today in TechCrunch, he says: “We have to get from the sort of casual use of asking, querying…to ‘what did you mean?’.” Unfortunately, he then goes into science-fiction-AI land and seems to end up suggesting a natural language question-answering approach like Wolfram Alpha. Still, at least his heart is in the right place.

Anyway, as they say, experience is the best teacher. Hopefully Netbase can recover from what could generously be called a public relations hiccup. But, as the aphorism continues, it is only the fool that can learn from no other. Let’s not be fools–and instead take away the moral of this story: instead of trying to automate everything, optimize the division of labor between human and machine. HCIR.

Uncategorized

Another Project to Measure Twitter Influence

Post author By Daniel Tunkelang
Post date September 3, 2009

Just noticed that the Web Ecology Project has published “The Influentials: New Approaches for Analyzing Influence on Twitter“. The blog post includes a link to their full report.

Their approach strikes me as a generalization of measuring retweets, but perhaps I’m giving it too cursory a read. I did compare their results to TunkRank: we at least agree that mashable is more influential than CNN–though even as simple a measure as follower count would confirm that judgment.

Anyway, I am delighted to see serious researchers looking at this problem. I’m still hoping to investigate hypotheses regarding TunkRank and friend:follower ratios.

Uncategorized

Great Series of Posts on Medical Literature Search

Post author By Daniel Tunkelang
Post date September 2, 2009
2 Comments on Great Series of Posts on Medical Literature Search

Gene Golovchinsky at FXPAL has written a great series of posts on medical literature search, specifically looking at how MeSH (Medical Subject Headings) has been used to augment conventional text search, and whether its use improves the overall effectiveness of information seeking.

Here are the posts:

Even if you’re not specifically interested in medical literature search, I recommend you check these posts out. Much of the interesting work on information seeking is taking place in specialized domains like this one, where the value of getting it right offers far more promising returns than incremental improvements to general web search.

General

Finding, Locating, Discovering

Post author By Daniel Tunkelang
Post date August 31, 2009
13 Comments on Finding, Locating, Discovering

Thanks to Tony Hollingsworth for alerting me to a post by Alex Campbell entitled “Stark realisation: I no longer depend on Google to find stuff“. The title is provocative link bait, but the take-away is very down to earth: Google is primarily useful for locating information than for discovering it.

Library scientists make a distinction between known-item and exploratory search. The former is about locating information: as an information seeker, you know the information exists, and you can even characterize it unambiguously; but the challenge is to convert that description into a location that allows you to retrieve the information. The latter is about discovery: you don’t know that the information you seek exists, and you may be sure of how to characterize what you are looking for–or even know what exactly you want until you’ve learned something about what is available.

These are extreme points on the information seeking spectrum, and most real-world tasks are in the middle, or combine subtasks of both types. For example, in physical libraries (yes, I’m that old!), I remember finding a book in the stacks and then browsing the nearby books in the hopes of serendipitous discovery. These days, I’d be more likely to scan its bibliography–or to look at the books and articles citing it. A known item can be an excellent entry point for exploration. Conversely, exploration can lead you to discover the existence of information that you then simply need to retrieve.

In common use, words like searching and finding cover this entire spectrum of information seeking activity. This breadth of meaning causes a lot of confusion. I’ve blogged about this before: “What is (Not) Search?“:

At the very least, I propose that we distinguish “search” as a problem from “search” as a solution. By the former, I mean the problem of information seeking, which is traditionally the domain of library and information scientists. By the latter, I mean the approach most commonly associated with information retrieval, in which a user enters a query into the system (typically as free text) and the system returns a set of objects that match the query, perhaps with different degrees of relevancy.

Back to Campbell’s article. His main points:

Social networks have dramatically expanded our network of contacts.
Search engine optimization (SEO) experts have killed their own game.
The flow of information has changed: information now comes to us, rather than us having to go out and find it.

I like the spirit of the post, but I think he overstates his case. SEO isn’t all bad–in fact, it’s probably a key factor in Google’s effectiveness. And, while social networks enable social search in theory, and information does come to us; we are experiencing filter failure (Clay Shirky’s term) in a big way.

My conclusion: I agree with him about Google’s limitations–Google is primarily a locating tool, not a discovery tool. Unfortunately, I’m not persuaded that social networks and our theoretical ability to construct an ideal in-flow of information have actually delivered on the promise of more efficient information access. But I’m optimistic that we’ll eventually get there.

Uncategorized

Blogs I Read: Chris Dixon (cdixon.org)

Post author By Daniel Tunkelang
Post date August 30, 2009
5 Comments on Blogs I Read: Chris Dixon (cdixon.org)

I’ve started reading a few different blogs in the past months, and one that I particularly like is Chris Dixon’s, which has the simple (if uncreative) title cdixon.org.

Chris has an interesting history that includes heading R&D at a hedge fund, co-founding SiteAdvisor, investing in a number of technology companies (including Skype and Postini), and most recently co-founding Hunch (which I’ve blogged about here a few times). As a karaoke junkie, I can’t help noting that he developed the software that became MySpace Karaoke.

Not surprisingly, Chris brings the combined perspective of an investor and a technologist to his blog. Here are some examples of recent posts that illustrate his range.

Thoughts on machine learning:

Career advice for entrepreneurs:

And of course he occasionally blogs about Hunch, his current venture.

Chris has a strong personality that comes through as a blogger. I think that’s critical for making a blog both informative and entertaining, and I try to channel my own personality (which I’m told, for better or worse, is quite distinctive) through this blog.

In short, check out cdixon.org if you’re interested in the perspective of a practical (and successful) technologist-entrepreneur.