Category: General

General posts, typically analyzing HCIR issues.

The Word of the Day is…Ambient

Post author By Daniel Tunkelang
Post date November 9, 2008

No, not Ambien or ambiance. but ambient as in ambient findability.

Two items caught my attention this weekend. The first was a post by Oscar Berg at The Content Economy about ambient awareness and findability. The second was a presentation by Marianne Sweeny, posted at Ambient Insight, about SEO for Web 2.0.

An excerpt from Oscar’s post:

I am however more fond of the term “ambient awareness” and I am especially interested in how ambient awareness relates to findability which has traditionally been focused mainly on active methods of finding information such as searching and browsing.

I dare to say that humans are lazy by nature and that we are likely to use the method that requires the least effort when we look for information. We even tend to use less reliable information if it’s just easy to find and use. Instead of actively looking for information we prefer to passively monitor the flow of information in our environment. In fact, some say that actively looking for information is a relatively new phenomenon in human history. So, just being in an environment and becoming passively aware about things that happen in it is something we find very natural and convenient.

It’s an interesting point. Most of the systems we build for finding information presume an active information-seeking motive, but perhaps such systems are not optimizing for the way people are used to obtaining information. Still, I think that, until systems can passively surmise what information people need, we are stuck with requiring at least some active expression of that intent.

That leads us to the Sweeny presentation. It traces the history of search from an SEO point of view:

Human-Mediated
Human-Mediated plus Catalogs
Machine-Mediated
Human-Directed / Machine-Mediated
Human-Like Machine Mediation (aspirational)

It’s a nice presentation, and I recommend you give it a look. I’m delighted to see someone in the SEO community express a version of history and vision that is largely in line with that of the information seeking support folks.

General

Another Difference Between Enterprise Search and Web Search

Post author By Daniel Tunkelang
Post date November 6, 2008
14 Comments on Another Difference Between Enterprise Search and Web Search

As long-time readers know, one of my recurring themes is that there is a world of difference between web search and enterprise search–at least as those concepts are understood today. The other day, I had a conversation with my friend Carl Eklof, and we arrived at an aspect of that difference that I have at best understated in the past. Let me try to elaborate it now.

In web search, the immediate results for a query are pages on web sites. But these pages aren’t necessarily “documents”. In fact, the most popular web sites are portals or destinations, designed to help a user shop, research specialized information, communicate with other people, etc. When a web search takes a user to a page on such a site, the site (if it is well designed) takes on the responsibility for contextualizing the user’s experience.

In contrast, enterprise content often consists of a heterogeneous collection of content whose organization is at best implicit in its physical and logical arrangement. Departments within an enterprise may build user-centered portals, but it’s rare to see the sort of symbiosis that occurs between web search engines and the sites they index.

As a result, one of the challenges of an enterprise search application is that it must deliver a holistic user experience that compensates for the lack of effort on the part of the documents it indexes. Users still need context and guidance, but now the responsibility falls almost entirely on the search engine to deliver it.

Admittedly this picture is oversimplified. I don’t even like the term “enterprise search” because it’s often construed so narrowly. But I realize that many folks struggle with the idea that finding information within a proprietary document collection could be harder than doing so on the web. I hope this explanation helps shed some light.

General

Modista: Similarity Browsing…for Shoes!

Post author By Daniel Tunkelang
Post date November 5, 2008
1 Comment on Modista: Similarity Browsing…for Shoes!

Let me start with a disclaimer. My idea of “finding shoes” is finding the one pair of shoes I own in the closet. In general, I’m not much of a shopper, let alone a shoe shopper.

That said, I really love what Arlo Faria and AJ Shankar, two Berkeley PhD students on leave, have done with Modista. In their own words:

Modista simplifies online shopping by searching inventories across multiple retailers and displaying results in an intuitive interface. Our patent-pending technology organizes items according to their visual similarity using digital image processing and machine learning algorithms.

All that is true, but it doesn’t capture what makes Modista cool. Modista delivers what m c schraefel calls the “joy of search”. Even for someone like me who only buys classic black loafers, they’ve created a fun exploratory experience. To see what a real shoe-shopper thinks of it, check out this post at ShoeBlog.

I’ve been skeptical of both similarity browsing and visual search. I’m still skeptical about the breadth of either techinque’s applicability. But I am impressed with this application.

General

Transparency 2.0

Anyone who doubts the impact of blogging, Twitter, and other Web 2.0 technologies would do well to read yesterday’s New York Times article, “In Era of Blog Sniping, Companies Shoot First“.

While the article focuses on the more drastic aspects of corporate communication (“In the age of transparency, the layoff will be blogged”), there is a larger point here. NDAs not withstanding, employees talk–especially disgruntled employees who have lost or are about to lose their jobs. Even before Web 2.0, there were sites that encouraged anonymous tipsters to supply news of companies experiencing financial or moral difficulty. But blogs and Twitter have made the propagation of juicy information almost instantaneous.

Our notions of privacy and secrecy are changing as we no longer have privacy through difficulty. Many people–as well as governments and institutions–are reacting with alarm, trying to find ways to safeguard individual or corporate confidentiality in an age of hypercommunication. Perhaps we would do better to accept that privacy as we used to know it is lost, and come up with legal and social norms that reflect the world we live in today.

General

Why Does Latent Semantic Analysis Work?

Post author By Daniel Tunkelang
Post date November 1, 2008
2 Comments on Why Does Latent Semantic Analysis Work?

Warning to regular readers: this post is a bit more theoretical than average for this blog.

Peter Turney has a nice post today about “SVD, Variance, and Sparsity“. It’s actually a follow-up to a post last year entitled “Why Does SVD Improve Similarity Measurement?” that apparently has remained popular despite its old age in blog years.

For readers unfamiliar with singular value decomposition (SVD), I suggest a brief detour to the Wikipedia entry on latent semantic analysis (also known as latent semantic indexing). In a nutshell, latent semantic analysis is an information retrieval techinque that applies SVD to the term-document matrix of a corpus in order to reduce this sparse, high-dimensional matrix to a denser, lower-dimensional matrix whose dimensions correspond to the “latent” topics in the corpus.

Back to Peter’s thesis. He’s observed that document similarity is more accurate in the lower-dimensional vector space produced by SVD than in the space defined by the original term-document matrix. This isn’t immediately obvious; after all, SVD is a lossy approximation of the term-document matrix, so you might expect accuracy to decrease.

In his 2007 post, Peter offers three hypotheses for why SVD improves the similarity measure:

High-order co-occurrence: Dimension reduction with SVD is sensitive to high-order co-occurrence information (indirect association) that is ignored by PMI-IR and cosine similarity measures without SVD. This high-order co-occurrence information improves the similarity measure.
Latent meaning: Dimension reduction with SVD creates a (relatively) low-dimensional linear mapping between row space (words) and column space (chunks of text, such as sentences, paragraphs, or documents). This low-dimensional mapping captures the latent (hidden) meaning in the words and the chunks. Limiting the number of latent dimensions forces a greater correspondence between words and chunks. This forced correspondence between words and chunks (the simultaneous equation constraint) improves the similarity measurement.
Noise reduction: Dimension reduction with SVD removes random noise from the matrix (it smooths the matrix). The raw matrix contains a mixture of signal and noise. A low-dimensional linear model captures the signal and removes the noise. (This is like fitting a messy scatterplot with a clean linear regression equation.)

In today’s follow-up post, he drills down on this third hypothesis, noting that noise can come from either variance and sparsity. He then proposes independently adjusting the sparsity-smoothing and variance-smoothing effects of SVD to split this third hypothesis into two sub-hypotheses.

I haven’t done much work with latent semantic analysis. But work that I’ve done with other statistical information retrieval techinques, such as using Kullback-Leibler divergence to measure the signal of a document set, suggest a similar benefit from preprocessing steps that reduce noise. Now I’m curious about the relative benefits of variance vs. sparsity reduction.

General

Is Search Advertising Recession-Proof?

Post author By Daniel Tunkelang
Post date November 1, 2008

A recent article asked, “Is Search Recsssion-Proof?” The author cited research claiming that the average person spends an additional 87 minutes per day online during a recession. In more detail:

Apparently the surfing the Web is a form of “escapism” for consumers, with 76% of respondents citing the Internet as their primary means of escape — over such activities as reading books, watching movies, and taking walks. Furthermore, search engines were selected more than any other “types of websites visited more frequently during a recession” — nearly double the number of social networking sites.

And Google is showing an increase in paid search clicks–up 18% in Q3 over last year. That is cetainly good news for web search companies and the ecosystem of ad-supported sites that depend on them.

But online ad rates aren’t so oblivious to the broader economy. According to an article in Wired, online ad prices have recently hit the skids. Here’s a snap shot of ad rates over the past year.

While the relationship to the stock market is hardly perfect, there certainly seems to be some correlation. At the very least, it would give me pause before going around claiming that search advertising is recession-proof.

General

HCIR ’08: A Great Interaction!

Post author By Daniel Tunkelang
Post date October 24, 2008

I’m back from HCIR ’08 and pleased to report that it was a rousing success. We had about 40 attendees, including such HCIR luminaries as Gary Marchionini, Marti Hearst, and mc schraefel. Microsoft Research supplied us not only with space and great food, but also workshop co-chair Ryen White and keynote speaker Sue Dumais, not to mention distinguished attendees Ken Church and Ashok Chandra. With a group like that, it was clear we were in for a great workshop.

And a great workshop we had! Some highlights:

Sue Dumais’s keynote on “Thinking Outside the (Search) Box” reviewed a variety of projects she and her colleagues at MSR have pursued in personal and personalized information retrieval.
Marti Hearst discussed design issues in faceted search interfaces, as well as extensions to the faceted model.
Steven Voida discussed a novel activitiy-based approach to personal information and task management.

At the beginning of the day, program chair Bill Kules had us write up our top HCIR concerns on post-it notes. We clustered these to form the basis for four breakout groups that discussed interactivity, task / workflow integration, sharing/collaboration, and results presentation. The results of these discussions, as well as the accepted papers, will be published online soon.

The workshop room also served as a space for posters. Posters were displayed throughout the day, and attendees congregated around posters and demos during the various breaks between sessions.

We concluded the workshop by soliciting feedback on how to improve it for next year. A fair number of attendees expressed interest in making the structure less formal, reducing the time spent on presentations and increasing the time available for more informal interaction. A number of folks expressed interest in continuing the discussions online, though there was no consensus on the best forum for doing so.

All in all, I was delighted by the energy of the group, and I believe that these workshops are helping to support HCIR efforts in both academia and industry.

Finally, I was delighted to recognize a number of Noisy Channel readers among the attendees. Conversely, Raman Chandrasekar was nice enough to (inadvertently) advertise this blog by leaving this screen up for a few minutes at the end of his presentation while he took Q&A:

I’ll keep folks here posted as more materials from the workshop become available. And, of course, you’ll be among the first to know where and when HCIR ’09 will take place.

General

In Defense of Web 2.0

Dave Kellogg has a nice post entitled “Web 2.Over?” in which he eloquently reviews the various reasons that most web 2.0 startups are “in for a reality check”.

But what I liked most about the post was his defense of the spirit of web 2.0:

While a swarm of eyeball-catching, oddly-named, twenty-something-led startups may get obliterated — outside venture circles at least — that wasn’t the point of web 2.0. To me, web 2.0 was, is, and remains an important collection of concepts that will endure:

A read/write web, where we can participate, update, annotate, comment, etc.

A social web, where there is awareness of relationships that can be leveraged appropriately

User-generated content, which is here to stay and always has been (think: radio call-in shows, Kids Say the Darndest Things, or America’s Funniest Home Videos)

The use of the web for communication and entertainment. People are natural communicators. We will always adapt our tools to that fundamental need.

A personalized web, that understands what we like and how we like to get it

Amen! The good news is that there is no turning back on this vision of a more interactive online medium. Today it’s blogs and tweets; tomorrow it may be something we haven’t even imagined. But, now that an increasing number of us fancy ourselves as publishers and communicators, I don’t see us giving up that power without a fight.

General

Disincenting Spam

Post author By Daniel Tunkelang
Post date October 20, 2008

Greg called my attention today to news that Digg is shifting from popularity-based aggregation to personalized news. I can’t say I’m thrilled at the prospect of a system that “would make guesses about what [users] like based on information mined from the giant demographic veins of social networks”. I don’t suppose the results are necessarily worse than showing users stories based solely on their popularity, but at least the latter offers me some transparency.

But it was an older post Greg pointed to that caught my attention: “Combating web spam with personalization“. Here is his argument in a nutshell:

Personalized search shows different search results to different people based on their history and their interests. Not only does this increase the relevance of the search results, but also it makes the search results harder to spam.

In this 2006 post, Greg is specifically referring to the personalized search that Google was beta testing back in 2004. Google has since implemented personalized search, but without sharing much detail about how it works.

Nonetheless, Greg’s argument reminds me of one of the first posts I wrote on this blog. I was criticizing Google’s approach of keeping its relevance approach secret and particularly the argument that Amit Singhal has advanced to justify it–that the subjectivity of relevance makes it harder to develop an open approach to relevance. My response: “the subjectivity of relevance should make the adversarial problem easier rather than harder, as has been observed in the security industry”.

I suppose personalization can help fight spam even if it is not coupled with transparency to the user. But what a great opportunity to do both by providing more user control over the information seeking process.