Categories
Uncategorized

Knowledge Management is a Process

Kudos to Lynda Moulton at the Enteprise Search Practice Blog for a post entitled “Apples and Orangutans: Enterprise Search and Knowledge Management“. She criticises some commentary in CIO Magazine that “search is being implemented in enterprises as the new knowledge management”. Her thesis in a nutshell is that “knowledge management (KM) is not now, nor has it ever been, a software product or even a suite of products”. If that’s not enough to get you to read the full article, here’s an excerpt:

Because I follow enterprise search for the Gilbane Group while maintaining a separate consulting practice in knowledge management, I am struggling with his conflation of the two terms or even the migration of one to the other. The search we talk about is a set of software technologies that retrieve content. I’m tired of the debate about the terminology “enterprise search” vs. “behind the firewall search.” I tell vendors and buyers that my focus is on software products supporting search executed within (or from outside looking in) the enterprise on content that originates from within the enterprise or that is collected by the enterprise. I don’t judge whether the product is for an exclusive domain, content type or audience, or whether it is deployed with the “intent” of finding and retrieving every last scrap of content lying around the enterprise. It never does nor will do the latter but if that is what an enterprise aspires to, theirs is a judgment call I might help them re-evaluate in consultation.

Categories
Uncategorized

MIT Talk Now Available Online

I haven’t gotten to look at it from my wi-fi connection on the Amtrak Acela (which is clearly a work in progress, but nonetheless a very exciting development), but here’s a link to the video.

Note that you’ll have to install Microsoft Silverlight to view it.

Alternatively, you can watch the slideshow below, but I’m not sure how much sense it will make without the voice-over.

http://static.slideshare.net/swf/ssplayer2.swf?doc=set-retrieval-20-1225903234513602-9&stripped_title=set-retrieval-20-presentation

Categories
Uncategorized

Google Defends Its Appliance, Part 2

Ron Miller at Fierce Content Management has published an interview with Google Search Appliance Nitin Mangtani to “ask some questions of [his] own about [Mangtani’s] reason for publishing the Forbes piece and his rebuttal to Google Search Appliance critics.” I’m still not convinced by Mangtani’s relevance-centric pitch, but I’ll let you read it and draw your own conclusions. I’ve reached out to Ron Miller and look forward to talking with him about a more HCIR-centric vision of enterprise search.

Categories
Uncategorized

Giving a Talk at MIT Tomorrow

My traffic reports tell me that a number of readers are in the Boston/Cambridge area. For those who are interested, I’m giving a talk tomorrow at MIT:

Set Retrieval 2.0

Speaker: Daniel Tunkelang, Endeca
Date: Tuesday, November 4 2008
Time: 11:00AM to 12:00PM
Refreshments: 10:45AM
Location: Patil Conference Room (32-G449)
Host: Rob Miller, MIT CSAIL
The abstract is available here. The talk is open to the public, and is part of a seminar series designed to attract a mixed audience of academic and industry types. The talk is being recorded, and I’ll post a link to the video when it is available.
Categories
Uncategorized

CAPTCHA Me If You Can

An article in ComputerWorld today notes a recent report from security vendor MessageLabs that the proliferation of CAPTCHA-breaking tools has resulted in a surge of spam. Worse:

The report paints a cascade effect, which has allowed hole in CAPTCHA to let criminals set up large numbers of fake blogs and content, which are then used to feed bogus profiles to social networking systems. Messages and requests from these domains are a simple way around reputation-based anti-spam technology because they emanate from trusted sites not as aggressively filtered by such software.

It is amazing how what once seemed like a clever idea now plays an instrumental role in the balance of power of our adversarial attention economy. Naturally, all techniques like these fuel an arms race. I suppose many will argue that we need better CAPTCHAs. I suppose that’s a short-term fix, but I wonder if we need to stop being quite so dependent on models that are inherently adversarial.

Categories
Uncategorized

Business Intelligence Goes Back to the Future

Seth Grimes has a great piece in Intelligent Enterprise entitled “BI at 50 Turns Back to the Future“. He reminds us of the vision that Hans Peter Luhn put forward for business intelligence back in 1958, in the IBM Journal article “A Business Intelligence System“. He further notes:

Ironically, the BI implementation Luhn described far more closely resembles the reference-library setting of the 1957 Katharine Hepburn-Spencer Tracy movie comedy “Desk Set” than it does any system operating today. Luhn wrote of an information requestor who “telephones the librarian and states the information wanted. The librarian will then interpret the inquiry and will solicit sufficient background information… This query document is transmitted to the auto-encoding device in machine-readable form. An information pattern is then derived,” and so on. 

As I discussed before, the goal of an information access system, regardless of whether we are talking about enterprise search, business intelligence, or any other variation of information seeking, is to emulate a reference librarian. It’s nice to see that this is what the pioneers had in mind.

Categories
General

Why Does Latent Semantic Analysis Work?

Warning to regular readers: this post is a bit more theoretical than average for this blog.

Peter Turney has a nice post today about “SVD, Variance, and Sparsity“. It’s actually a follow-up to a post last year entitled “Why Does SVD Improve Similarity Measurement?” that apparently has remained popular despite its old age in blog years.

For readers unfamiliar with singular value decomposition (SVD), I suggest a brief detour to the Wikipedia entry on latent semantic analysis (also known as latent semantic indexing). In a nutshell, latent semantic analysis is an information retrieval techinque that applies SVD to the term-document matrix of a corpus in order to reduce this sparse, high-dimensional matrix to a denser, lower-dimensional matrix whose dimensions correspond to the “latent” topics in the corpus.

Back to Peter’s thesis. He’s observed that document similarity is more accurate in the lower-dimensional vector space produced by SVD than in the space defined by the original term-document matrix. This isn’t immediately obvious; after all, SVD is a lossy approximation of the term-document matrix, so you might expect accuracy to decrease.

In his 2007 post, Peter offers three hypotheses for why SVD improves the similarity measure:

  1. High-order co-occurrence: Dimension reduction with SVD is sensitive to high-order co-occurrence information (indirect association) that is ignored by PMI-IR and cosine similarity measures without SVD. This high-order co-occurrence information improves the similarity measure.
  2. Latent meaning: Dimension reduction with SVD creates a (relatively) low-dimensional linear mapping between row space (words) and column space (chunks of text, such as sentences, paragraphs, or documents). This low-dimensional mapping captures the latent (hidden) meaning in the words and the chunks. Limiting the number of latent dimensions forces a greater correspondence between words and chunks. This forced correspondence between words and chunks (the simultaneous equation constraint) improves the similarity measurement.
  3. Noise reduction: Dimension reduction with SVD removes random noise from the matrix (it smooths the matrix). The raw matrix contains a mixture of signal and noise. A low-dimensional linear model captures the signal and removes the noise. (This is like fitting a messy scatterplot with a clean linear regression equation.)

In today’s follow-up post, he drills down on this third hypothesis, noting that noise can come from either variance and sparsity. He then proposes independently adjusting the sparsity-smoothing and variance-smoothing effects of SVD to split this third hypothesis into two sub-hypotheses.

I haven’t done much work with latent semantic analysis. But work that I’ve done with other statistical information retrieval techinques, such as using Kullback-Leibler divergence to measure the signal of a document set, suggest a similar benefit from preprocessing steps that reduce noise. Now I’m curious about the relative benefits of variance vs. sparsity reduction.

Categories
General

Is Search Advertising Recession-Proof?

A recent article asked, “Is Search Recsssion-Proof?” The author cited research claiming that the average person spends an additional 87 minutes per day online during a recession. In more detail:

Apparently the surfing the Web is a form of “escapism” for consumers, with 76% of respondents citing the Internet as their primary means of escape — over such activities as reading books, watching movies, and taking walks. Furthermore, search engines were selected more than any other “types of websites visited more frequently during a recession” — nearly double the number of social networking sites.

And Google is showing an increase in paid search clicks–up 18% in Q3 over last year. That is cetainly good news for web search companies and the ecosystem of ad-supported sites that depend on them.

But online ad rates aren’t so oblivious to the broader economy. According to an article in Wired, online ad prices have recently hit the skids. Here’s a snap shot of ad rates over the past year.

While the relationship to the stock market is hardly perfect, there certainly seems to be some correlation. At the very least, it would give me pause before going around claiming that search advertising is recession-proof.