Categories
Uncategorized

The Long Tail of Search

The “long tail” is one of the most abused buzzwords of recent years, and I hesitate to use it myself in respectable company.

Nonetheless, SEO veteran Dustin Woodard has a nice guest post at the Hitwise Intelligence blog entitled “Sizing Up the Long Tail of Search“. Here are some statistics he cites about the distribution of search term frequency for web search data collected by Hitwise:

 

  • Top 100 terms: 5.7% of the all search traffic
  • Top 500 terms: 8.9% of the all search traffic
  • Top 1,000 terms: 10.6% of the all search traffic
  • Top 10,000 terms: 18.5% of the all search traffic

It’s nice to see concrete data to validate conventional wisdom. Of course, I’d be curious to see the corresponding distribution of ad revenue associated with terms.

Categories
Uncategorized

IRF Symposium on Patent Retrieval

Thanks to Jeff for writing up notes on the annual IR Facility Symposium 2008. Related links:

Categories
Uncategorized

Daniel Lemire on What Makes Database Indexes Work

Daniel Lemire has a great post today entitled “Understanding what makes database indexes work“. There’s nothing that should be surprising for folks who live and breathe this stuff, but it’s a great introduction for those who don’t. Here are his bullet points:

  1. You expect specific queries: restructure your data!
  2. You expect specific queries: materialize them!
  3. You expect specific queries: redundancy is (sometimes) your friend
  4. Use multiresolution!
  5. Your data is not random: compress it!
  6. In any case: optimize your code 
Read his post to get the details.
Categories
Uncategorized

No Correlation Between Reading Difficulty and Popularity?

Paul Ogilvie just started blogging at mSpoke, and his first post asks “What makes a blog post popular? Part I: Comparing popularity and reading difficulty“. Specifically, he explores “whether well-written feed items are more likely to receive attention than poorly-written ones”. At the risk of stealing his thunder, I’ll deliver the punchline: he found no correlations between surface features of reading difficulty and popularity. Fortunately, he’s not planning to give up on writing quality!

Like Paul, I find that the absence of correlation goes against common sense wisdom. I’m curious whether the problem is the measures he’s using (which he admits are crude), or other factors that confound the popularity statistics.

Via Jon Elsas.

Categories
Uncategorized

Knowledge Management is a Process

Kudos to Lynda Moulton at the Enteprise Search Practice Blog for a post entitled “Apples and Orangutans: Enterprise Search and Knowledge Management“. She criticises some commentary in CIO Magazine that “search is being implemented in enterprises as the new knowledge management”. Her thesis in a nutshell is that “knowledge management (KM) is not now, nor has it ever been, a software product or even a suite of products”. If that’s not enough to get you to read the full article, here’s an excerpt:

Because I follow enterprise search for the Gilbane Group while maintaining a separate consulting practice in knowledge management, I am struggling with his conflation of the two terms or even the migration of one to the other. The search we talk about is a set of software technologies that retrieve content. I’m tired of the debate about the terminology “enterprise search” vs. “behind the firewall search.” I tell vendors and buyers that my focus is on software products supporting search executed within (or from outside looking in) the enterprise on content that originates from within the enterprise or that is collected by the enterprise. I don’t judge whether the product is for an exclusive domain, content type or audience, or whether it is deployed with the “intent” of finding and retrieving every last scrap of content lying around the enterprise. It never does nor will do the latter but if that is what an enterprise aspires to, theirs is a judgment call I might help them re-evaluate in consultation.

Categories
Uncategorized

MIT Talk Now Available Online

I haven’t gotten to look at it from my wi-fi connection on the Amtrak Acela (which is clearly a work in progress, but nonetheless a very exciting development), but here’s a link to the video.

Note that you’ll have to install Microsoft Silverlight to view it.

Alternatively, you can watch the slideshow below, but I’m not sure how much sense it will make without the voice-over.

http://static.slideshare.net/swf/ssplayer2.swf?doc=set-retrieval-20-1225903234513602-9&stripped_title=set-retrieval-20-presentation

Categories
Uncategorized

Google Defends Its Appliance, Part 2

Ron Miller at Fierce Content Management has published an interview with Google Search Appliance Nitin Mangtani to “ask some questions of [his] own about [Mangtani’s] reason for publishing the Forbes piece and his rebuttal to Google Search Appliance critics.” I’m still not convinced by Mangtani’s relevance-centric pitch, but I’ll let you read it and draw your own conclusions. I’ve reached out to Ron Miller and look forward to talking with him about a more HCIR-centric vision of enterprise search.

Categories
Uncategorized

Giving a Talk at MIT Tomorrow

My traffic reports tell me that a number of readers are in the Boston/Cambridge area. For those who are interested, I’m giving a talk tomorrow at MIT:

Set Retrieval 2.0

Speaker: Daniel Tunkelang, Endeca
Date: Tuesday, November 4 2008
Time: 11:00AM to 12:00PM
Refreshments: 10:45AM
Location: Patil Conference Room (32-G449)
Host: Rob Miller, MIT CSAIL
The abstract is available here. The talk is open to the public, and is part of a seminar series designed to attract a mixed audience of academic and industry types. The talk is being recorded, and I’ll post a link to the video when it is available.
Categories
Uncategorized

CAPTCHA Me If You Can

An article in ComputerWorld today notes a recent report from security vendor MessageLabs that the proliferation of CAPTCHA-breaking tools has resulted in a surge of spam. Worse:

The report paints a cascade effect, which has allowed hole in CAPTCHA to let criminals set up large numbers of fake blogs and content, which are then used to feed bogus profiles to social networking systems. Messages and requests from these domains are a simple way around reputation-based anti-spam technology because they emanate from trusted sites not as aggressively filtered by such software.

It is amazing how what once seemed like a clever idea now plays an instrumental role in the balance of power of our adversarial attention economy. Naturally, all techniques like these fuel an arms race. I suppose many will argue that we need better CAPTCHAs. I suppose that’s a short-term fix, but I wonder if we need to stop being quite so dependent on models that are inherently adversarial.

Categories
Uncategorized

Business Intelligence Goes Back to the Future

Seth Grimes has a great piece in Intelligent Enterprise entitled “BI at 50 Turns Back to the Future“. He reminds us of the vision that Hans Peter Luhn put forward for business intelligence back in 1958, in the IBM Journal article “A Business Intelligence System“. He further notes:

Ironically, the BI implementation Luhn described far more closely resembles the reference-library setting of the 1957 Katharine Hepburn-Spencer Tracy movie comedy “Desk Set” than it does any system operating today. Luhn wrote of an information requestor who “telephones the librarian and states the information wanted. The librarian will then interpret the inquiry and will solicit sufficient background information… This query document is transmitted to the auto-encoding device in machine-readable form. An information pattern is then derived,” and so on. 

As I discussed before, the goal of an information access system, regardless of whether we are talking about enterprise search, business intelligence, or any other variation of information seeking, is to emulate a reference librarian. It’s nice to see that this is what the pioneers had in mind.