Category: Uncategorized

The Long Tail of Search

The “long tail” is one of the most abused buzzwords of recent years, and I hesitate to use it myself in respectable company.

Nonetheless, SEO veteran Dustin Woodard has a nice guest post at the Hitwise Intelligence blog entitled “Sizing Up the Long Tail of Search“. Here are some statistics he cites about the distribution of search term frequency for web search data collected by Hitwise:

Top 100 terms: 5.7% of the all search traffic
Top 500 terms: 8.9% of the all search traffic
Top 1,000 terms: 10.6% of the all search traffic
Top 10,000 terms: 18.5% of the all search traffic

It’s nice to see concrete data to validate conventional wisdom. Of course, I’d be curious to see the corresponding distribution of ad revenue associated with terms.

Uncategorized

IRF Symposium on Patent Retrieval

Post author By Daniel Tunkelang
Post date November 7, 2008

Thanks to Jeff for writing up notes on the annual IR Facility Symposium 2008. Related links:

Uncategorized

Daniel Lemire on What Makes Database Indexes Work

Post author By Daniel Tunkelang
Post date November 7, 2008
2 Comments on Daniel Lemire on What Makes Database Indexes Work

Daniel Lemire has a great post today entitled “Understanding what makes database indexes work“. There’s nothing that should be surprising for folks who live and breathe this stuff, but it’s a great introduction for those who don’t. Here are his bullet points:

You expect specific queries: restructure your data!
You expect specific queries: materialize them!
You expect specific queries: redundancy is (sometimes) your friend
Use multiresolution!
Your data is not random: compress it!
In any case: optimize your code

Read his post to get the details.

Uncategorized

No Correlation Between Reading Difficulty and Popularity?

Post author By Daniel Tunkelang
Post date November 6, 2008

Paul Ogilvie just started blogging at mSpoke, and his first post asks “What makes a blog post popular? Part I: Comparing popularity and reading difficulty“. Specifically, he explores “whether well-written feed items are more likely to receive attention than poorly-written ones”. At the risk of stealing his thunder, I’ll deliver the punchline: he found no correlations between surface features of reading difficulty and popularity. Fortunately, he’s not planning to give up on writing quality!

Like Paul, I find that the absence of correlation goes against common sense wisdom. I’m curious whether the problem is the measures he’s using (which he admits are crude), or other factors that confound the popularity statistics.

Via Jon Elsas.

Uncategorized

Knowledge Management is a Process

Post author By Daniel Tunkelang
Post date November 4, 2008

Kudos to Lynda Moulton at the Enteprise Search Practice Blog for a post entitled “Apples and Orangutans: Enterprise Search and Knowledge Management“. She criticises some commentary in CIO Magazine that “search is being implemented in enterprises as the new knowledge management”. Her thesis in a nutshell is that “knowledge management (KM) is not now, nor has it ever been, a software product or even a suite of products”. If that’s not enough to get you to read the full article, here’s an excerpt:

Because I follow enterprise search for the Gilbane Group while maintaining a separate consulting practice in knowledge management, I am struggling with his conflation of the two terms or even the migration of one to the other. The search we talk about is a set of software technologies that retrieve content. I’m tired of the debate about the terminology “enterprise search” vs. “behind the firewall search.” I tell vendors and buyers that my focus is on software products supporting search executed within (or from outside looking in) the enterprise on content that originates from within the enterprise or that is collected by the enterprise. I don’t judge whether the product is for an exclusive domain, content type or audience, or whether it is deployed with the “intent” of finding and retrieving every last scrap of content lying around the enterprise. It never does nor will do the latter but if that is what an enterprise aspires to, theirs is a judgment call I might help them re-evaluate in consultation.

Uncategorized

MIT Talk Now Available Online

Post author By Daniel Tunkelang
Post date November 4, 2008
7 Comments on MIT Talk Now Available Online

I haven’t gotten to look at it from my wi-fi connection on the Amtrak Acela (which is clearly a work in progress, but nonetheless a very exciting development), but here’s a link to the video.

Note that you’ll have to install Microsoft Silverlight to view it.

Alternatively, you can watch the slideshow below, but I’m not sure how much sense it will make without the voice-over.

http://static.slideshare.net/swf/ssplayer2.swf?doc=set-retrieval-20-1225903234513602-9&stripped_title=set-retrieval-20-presentation

Uncategorized

Google Defends Its Appliance, Part 2

Post author By Daniel Tunkelang
Post date November 4, 2008
2 Comments on Google Defends Its Appliance, Part 2

Ron Miller at Fierce Content Management has published an interview with Google Search Appliance Nitin Mangtani to “ask some questions of [his] own about [Mangtani’s] reason for publishing the Forbes piece and his rebuttal to Google Search Appliance critics.” I’m still not convinced by Mangtani’s relevance-centric pitch, but I’ll let you read it and draw your own conclusions. I’ve reached out to Ron Miller and look forward to talking with him about a more HCIR-centric vision of enterprise search.

Uncategorized

Giving a Talk at MIT Tomorrow

Post author By Daniel Tunkelang
Post date November 4, 2008
3 Comments on Giving a Talk at MIT Tomorrow

My traffic reports tell me that a number of readers are in the Boston/Cambridge area. For those who are interested, I’m giving a talk tomorrow at MIT:

Set Retrieval 2.0

Speaker: Daniel Tunkelang, Endeca

Date: Tuesday, November 4 2008

Time: 11:00AM to 12:00PM

Refreshments: 10:45AM

Location: Patil Conference Room (32-G449)

Host: Rob Miller, MIT CSAIL

The abstract is available here. The talk is open to the public, and is part of a seminar series designed to attract a mixed audience of academic and industry types. The talk is being recorded, and I’ll post a link to the video when it is available.

Uncategorized

CAPTCHA Me If You Can

Post author By Daniel Tunkelang
Post date November 4, 2008

An article in ComputerWorld today notes a recent report from security vendor MessageLabs that the proliferation of CAPTCHA-breaking tools has resulted in a surge of spam. Worse:

The report paints a cascade effect, which has allowed hole in CAPTCHA to let criminals set up large numbers of fake blogs and content, which are then used to feed bogus profiles to social networking systems. Messages and requests from these domains are a simple way around reputation-based anti-spam technology because they emanate from trusted sites not as aggressively filtered by such software.

It is amazing how what once seemed like a clever idea now plays an instrumental role in the balance of power of our adversarial attention economy. Naturally, all techniques like these fuel an arms race. I suppose many will argue that we need better CAPTCHAs. I suppose that’s a short-term fix, but I wonder if we need to stop being quite so dependent on models that are inherently adversarial.

Uncategorized

Business Intelligence Goes Back to the Future

Post author By Daniel Tunkelang
Post date November 2, 2008
3 Comments on Business Intelligence Goes Back to the Future

Seth Grimes has a great piece in Intelligent Enterprise entitled “BI at 50 Turns Back to the Future“. He reminds us of the vision that Hans Peter Luhn put forward for business intelligence back in 1958, in the IBM Journal article “A Business Intelligence System“. He further notes:

Ironically, the BI implementation Luhn described far more closely resembles the reference-library setting of the 1957 Katharine Hepburn-Spencer Tracy movie comedy “Desk Set” than it does any system operating today. Luhn wrote of an information requestor who “telephones the librarian and states the information wanted. The librarian will then interpret the inquiry and will solicit sufficient background information… This query document is transmitted to the auto-encoding device in machine-readable form. An information pattern is then derived,” and so on.

As I discussed before, the goal of an information access system, regardless of whether we are talking about enterprise search, business intelligence, or any other variation of information seeking, is to emulate a reference librarian. It’s nice to see that this is what the pioneers had in mind.