Author: Daniel Tunkelang

High-Class Consultant.

Catching Up With Hunch

Last week, I stopped by the Hunch office to learn more about what they’re doing, as well as to contribute my own thoughts about socially enhanced decision making. I consider Hunch, like Aardvark, to be an example of social search, but I recognize that I use the term in a broad sense. Perhaps, as Jeremy suggests, it’s better to think of social search and collaborative search being different aspects of multi-person search.

In any case, Hunch is doing some interesting things. Their mission, roughly speaking, is to become a Wikipedia for decision making. They are inspired by human computation success stories like 20Q.net and presumably the ESP Game. Their general approach is to learn about people by asking them multiple-choice questions that help cluster them demographically (“Teach Hunch About You”), and then to create customized decision trees to help people find their own answers to questions. The questions themselves are crowd-sourced from users (though now they are vetted first in a “workshop”).

They’re learning as they go along. For example, they’ve recognized that it’s important to distinguish between objective questions (e.g., concerning the price of a product) and questions of taste (e.g., what is art?). They’re also experimenting with interface tweaks, including giving users more control over what information their algorithms use to rank potential answers, and allowing users to short-circuit the decision tree at any time by skipping to the end.

Perhaps of particular interest to readers here, they’ve made an API available, which you can also play with in a widget on their blog.

As I told my friend at Hunch, I’m still skeptical about decision trees. Maybe I’m a bit too biased toward faceted search, but I don’t like having such a rigid decision making process. Apparently they’re not wedded to decision trees, but they are understandably concerned about creating a richer interface that might turn off or intimidates ordinary users. I can’t deny that decision trees are simple to use, and I can’t argue with their 77% success rate.

Still, the rigidity of a decision tree leaves me a bit cold. Even if it leads me to the right choice, it doesn’t give me the necessary faith in that choice. Transparency helps, and I like that you can click on “Why did Hunch pick this?” to see what in your question-specific or personal profile led Hunch to recommend that answer. But I’d like more freedom and less hand-holding.

I still have a handful of invites; let me know if you’re interested. As usual, first come, first serve.

Uncategorized

Designing for Faceted Search

Post author By Daniel Tunkelang
Post date May 9, 2009
5 Comments on Designing for Faceted Search

While I was inundated with conferences a couple of weeks ago, I missed s a nice article by Stephanie Lemieux at User Interface Engineering (a site I recommend in general) entitled “Designing for Faceted Search“. It briefly explains faceted search and offers some usability tips. It’s not quite as comprehensive as my upcoming book, but it’s also free and is somewhat less than 100 pages.

Of course, I’m delighted that she uses a couple of Endeca-powered examples (NCSU Libraries, Buzzilions). She also cites the Financial Times, but links to the ft.com (which I believe is powered by FAST, a subsidiary of Microsoft) rather than the recently launched Newssift, which uses Endeca.

Just one quibble: she says that “Just 3 facets with 5 terms each can represent 243 possible combinations.” I suspect she transposed the 3 and the 5. The right number of combinations is 125 = 5^3, since a combination represents 3 independent selections from 5 possible choices.

General

The Twouble with Twitter Search

Post author By Daniel Tunkelang
Post date May 9, 2009
10 Comments on The Twouble with Twitter Search

There has been a flurry of reports about Twitter search–whether about Twitter’s plans to improve their search functionality or about alternative ways to search Twitter. But Danny Sullivan makes a great point in a recent post about Google:

Ironically, Google gets a taste of its own medicine with Twitter. It still can’t access the “firehose” of information that Twitter has, in order to build a decent real-time search service. If it can’t strike a deal, expect to hear the company start pushing on how “real-time data” should be open.

Of course, that logic applies not only to Google, but also to anyone with aspirations to build a better mousetrap for Twitter search. As things stand, applications can’t do much better than post-processing native Twitter search results–which makes it hard to offer any noticeable improvement on them. If Twitter offered full Boolean set retrieval (e.g., if a search for star trek returned the set of all tweets containing both words), then applications could implement lots of interesting algorithms and interfaces on top of their API. I’d love to work on exploratory search applications myself! But the trickle that Twitter returns is hardly enough.

I believe this limitation is by design–that Twitter knows the value of such access and isn’t about to give it away. I just hope Twitter will figure out a way to provide this access for a price, and that an ecology of information access providers develops around it. Of course, if Google or Microsoft buys Twitter first, that probably won’t happen.

General

What Should I Say About Social Search?

Post author By Daniel Tunkelang
Post date May 8, 2009
26 Comments on What Should I Say About Social Search?

I’ll be at the Enterprise Search Summit in New York next week, participating on a panel Tuesday morning to discuss “Emergent Social Search Experience”. Our game plan as a panel is to discuss what social search is, why it matters, and how to implement it.

Obviously these are broad questions, but here are my rough notes:

WHAT: Social search means many things, but they have one common thread: improving information seeking through the knowledge and efforts other people. Back in the mid 90s, researchers distinguished between semantic and social navigation as the ability to explore information based on its objective, semantic structure, versus choosing a perspective based on the activity of another person or group of people. Perhaps the earliest instance of social search was collaborative filtering, still popular today as driver for product recommendations on sites like Amazon. But social search is much more than collaborative filtering. Building on the 90s vision of social navigation, we can give users full control over a social lens through which to view information, e.g., show me the local restaurants where women in my mom’s demographic like to eat brunch. Social search also includes explicit and implicit collaborative approaches, such as finding an expert to help you with a search, or building shared knowledge management artifacts that increase the collective efficiency of information seeking.

WHY: The “why” of social search depends on the specific aspect of social search that we’re discussing. But the common theme is this: we all know that, for a large swath of information needs, we prefer to turn to a person than to ask a machine. Sometimes that’s appropriate, and it’s a question of finding the right person to ask. But often we have no need to bother any one; we just want to borrow someone else’s perspective—or to assemble a composite perspective. There’s an efficiency gain of not reinventing the wheel, as well as an upside of discovering people (or information by way of those people) that may be valuable to you in ways you didn’t anticipate.

HOW: Again, it depends on the aspect of social search. We need rich knowledge representations that treat both information and people as first-class objects, and interfaces that let people seamlessly use both. Endeca does this by supporting record relationship navigation for multiple entity types (e.g., documents, people), as do interfaces like David Huynh’s Freebase Parallax. To facilitate collective knowledge management, we need to make contribution both easy and rewarding: the reason people don’t contribute to such systems today is that they are onerous and don’t work. Some of the work Endeca has done with folksonomies is encouraging: we found that we can productively recycle folksonomies (or even search logs) in combination with automatic text mining techniques. Finally, we need to rethink our attitudes toward privacy, anonymity, and reputation. Consumer social networks like Facebook and Twitter have shown us that users are willing to forgo privacy in order to gain social benefits. Wikipedia has shown us that a group of strangers can assemble a valuable collective knowledge store. But Wikipedia, product reviews, blog comments, etc. have shown us that the default of anonymity can undermine the trust we have in these socially constructed artifacts. As we evolve these tools—and as we work to apply them within the enterprise, we need to simultaneously work to evolve our social norms.

Those are my thoughts. But, in the spirt of social search, I’d love to reach out to experts here for ideas. If you were attending a panel about social search, specifically in the context of an event target to enterprise search practitioners, what would you want to hear about? For that matter, if you were participating on such a panel, what would you talk about? Bear in mind that the audience will consist of practitioners, not researchers, and I’ll only have one third of a 45-minute session–some of that reserved for Q&A.

Uncategorized

More Thoughts on Image Retrieval

Post author By Daniel Tunkelang
Post date May 8, 2009
6 Comments on More Thoughts on Image Retrieval

After my recent posts about Google’s similarity browsing for images, a colleague reached out to me to educate me about some of the recent advances in image retrieval. This colleague is involved with an image retrieval startup and felt uncomfortable posting comments publicly, so we agreed that I would paraphrase them in a post under my own name. I thus accept accountability for the post, but cannot take credit for expertise or originality.

Some of the discussion in the comment threads mentioned scale-invariant feature transform (SIFT), an algorithm to detect and describe local features in images. What I don’t believe anyone mentioned is that this approach is patented–certainly a concern for people with commercial interest in image retrieval.

There’s also the matter of scaling in a different sense–that is, handling large sets of images. People interested in this problem may want to look at “Scalable Recognition with a Vocabulary Tree” by David Nistér and Henrik Stewénius. They map image features to “visual words” using a hierarchical k-means approach. While mapping image retrieval to text retrieval approaches is not new, their large-vocabulary approach was novel and made significant improvement to scalability, as well as being robust to occlusion, viewpoint and lighting change. The paper has been highly cited.

But there are problems with this approach in practice. For example, images from cell phone cameras are low-quality and blurry, and Nistér and Stewénius’s approach is unfortunately not resilient to blur. Accuracy and latency are also challenges.

In general, some of the vision literature about which are the best features to use don’t seem to work so well outside the lab, and the reason may be that images used for such experiments in the literature are of much higher quality than those in the field–particuarly for cell phone images.

An alternative to SIFT is “gist”, an approach based on global descriptors. This approach is not resilient to occlusion or rotation, but it does scale much better than SIFT, and may serve well for some duplicate detection–a problem that, in my view, is a deal-breaker for applications like similarity browsing–and which certainly is a problem for Google’s current approach.

In short, image retrieval is still a highly active area, and different approaches are optimized for different problems. I was delighted to have a recent guest post from AJ Shankar of Modista about their approach, and I encourage others to contribute their thoughts.

General

Playing With Wolfram Alpha

Post author By Daniel Tunkelang
Post date May 7, 2009
18 Comments on Playing With Wolfram Alpha

Woo hoo, I have preview access to Wolfram Alpha! I’ve only had a short time to play with it, but I can already report that my experience confirms my previously expressed expectations: the NLP is very brittle, but there’s great potential for structured queries on quantitative data. Here is an example use case that, in my view, shows Wolfram Alpha’s strengths:

This bit of analysis tells a great story: Microsoft has almost three times as much revenue as Google, but Google has about 50% higher revenue per employee. Meanwhile, Yahoo is in third place on revenue, number of employees, and revenue per employee. Ouch.

As I said, this query shows Wolfram Alpha favorably. What you don’t see are the false starts it took me to get this query to work. The NLP interface, in my view, is a really bad idea. Instead, Wolfram Alpha should be helping users generate good structured queries–and, better yet, helping other businesses build such queries through APIs. Wolfram Alpha could deliver an excellent plug-in for Excel, if they can expose a workable query API. I have no idea whether the company is able or willing to go down this path, but I hope someone there is listening to this free advice.

I can’t share my account, but I’m willing to take suggestions for queries through the comment thread, and I’ll try my best to share what I learn.

Uncategorized

Got Hate Tweets?

I had the novel experience today of discovering that someone set up a Twitter account for the sole purpose of harassing me personally. I’m not sure what exactly I did to deserve this honor, but I’m amused by the personal attention, since I’m hardly a Twitter celebrity. Perhaps it’s someone I know, conducting a social experiment to see how I’ll react. Ah, the wonder of online anonymity.

Uncategorized

Wolfman vs. Googzilla

What’s not to love about a good fight? Check out David Talbot’s “Wolfram Alpha and Google Face Off” in Technology Review. I don’t come away with a sense that I’ll regularly use either Wolfram Alpha or Google Public Data, but it’s nice to start seeing people off-road with them and compare the results. The Wolfram Alpha launch is supposed to be this month, so presumably we’ll all be able to do that in a matter of weeks, if not days. Google Public Data is available already, intergrated into web search.

Sadly, neither of these guys seems interested in providing a non-NLP interface. In my view, that would be far more useful. But I suppose it’s not what sells papers.

General

YouTube vs. Unauthorized Advertising

Post author By Daniel Tunkelang
Post date May 4, 2009
1 Comment on YouTube vs. Unauthorized Advertising

I was intrigued to see a flurry of posts today about how YouTube is cracking down on unauthorized advertising. Naturally, since YouTube’s raison d’etre is to make videos available for free, they’d like a cut of any advertising revenue associated with the content they serve–particularly since they’re bleeding money to pay for bandwidth. But some uploaders work around Google’s revenue model for YouTube by embedding ads in the videos–in violation of YouTube’s terms of service.

Am I the only person who finds this situation comical, or at least a bit ironic? There’s been a lot of discussion about how newspapers–and publishers in general–are losing revenue because Google is monetizing their audiences through its own ads. But now the tables are turned, and it’s content publishers (though probably not the mainstream media) who are obtaining ad revenue at Google’s expense.

Google is certainly acting within its legal rights; the terms of service make it quite clear that YouTube prohibits unauthorized commercial use, with the noted exception of “uploading an original video to YouTube, or maintaining an original channel on YouTube, to promote your business or artistic enterprise”. In other words, you can promote yourself, but you can’t sell ads. Still, as Lawrence Lessig says, “code is law“, and Google will have an uphill battle to prevent unauthorized advertising without a lot of collateral damage.

I do hope that Google is experiencing a moment of empathy. As Google defends against this threat to YouTube’s business model, perhaps it will better understand what the newspaper industry is going through.

General

A Topology of Search Concepts

Post author By Daniel Tunkelang
Post date May 4, 2009
10 Comments on A Topology of Search Concepts

Vegard Sandvold has an interesting post entitled “Help Me Design a Topology of Search Concepts” in which he visualizes assorted search approaches in a two-dimensional space, the two dimensions being the degree of information accessibility and whether the approach is algorithm-powered or user-powered.

His four quadrants:

Low information accessibility + algorithm-powered = simple search (e.g., keyword search)
Low information accessibility + user-powered = superficial search (e.g., collaborative filtering)
High information accessibility + algorithm-powered = ingenious search (e.g., question answering)
High information accessibility + user-powered = diligent search (e.g., faceted search)

I’m not sure how I feel about the quadrant names (though I like how my employer and I are champions of diligence!), but I do like this attempt to lay out different approaches to supporting information seeking, and I like his choice of axes.

More importantly, I hope this analysis helps advance our ability as technologists to match solutions to information seeking problems. Many of us have an intuitive sense of how to do so, but I rarely see principled arguments–particularly from vendors who may be reluctant to forgo any use case that could translate into revenue.

Of course, it would be nice to quantify these axes, or at least to formalize them a bit more rigorously. For example, how do we measure the amount of user input into the process–particuarly for applications that may involve human input at both indexing and query time? Or how do we measure information accessibility in a corpus that might include junk (e.g., spam)?

Still, this is a nice start as a framework, and I’d be delighted to see it evolve into a tool that helps people make technology decisions.