Author: Daniel Tunkelang

High-Class Consultant.

Holding Back the Rise of the Machines?

Post author By Daniel Tunkelang
Post date February 20, 2010
6 Comments on Holding Back the Rise of the Machines?

Amazon’s Mechanical Turk is one of my favorite examples of leveraging the internet for innovation:

Amazon Mechanical Turk is a marketplace for work that requires human intelligence. The Mechanical Turk web service enables companies to programmatically access this marketplace and a diverse, on-demand workforce. Developers can leverage this service to build human intelligence directly into their applications.

But, in my view, Mechanical Turk does not take its vision far enough. In the conditions of use, Amazon makes it clear that only human participation need apply: “you will not use robots, scripts or other automated methods to complete the Services”.

On one hand, I can understand that Amazon’s vision for Mechanical Turk, like Luis Von Ahn‘s “games with a purpose“, explicitly aims to apply human intelligence to tasks where automated methods seem inadequate. On the other hand, what are automated methods but encapsulations of human methods? It seems odd for Amazon to be so particular about the human / machine distinction, especially given that terms of service impose practically no other constraints on execution (beyond the obvious legal ones), Moreover, Mechanical Turk offers developers a variety of ways to assure quality (redundancy, qualification tests, etc.).

Granted, there are some important concerns that would have to be addressed if Amazon were to relax the “humans-only” constraint. For example, a developer today can reasonably assume that two different human “Providers” execute tasks independently. With automated participation, there’s a far greater risk of dependence–e.g., from multiple programmers applying the same algorithms. This possibility would have to be taken into account in quality assurance.

Still, the benefits of allowing automated participants would seem to far outweigh the risks. At pennies a task, Mechanical Turk has a limited appeal to the human labor force–indeed, research by Panos Ipeirotis suggests that Amazon’s revenue from the service may be so law that it doesn’t even cover the costs of a single dedicated developer!

In contrast, there’s evidence that programmers would take an interest in participation, were it an option. Marketplaces like TopCoder and competitions like the Netflix Prize suggest that computer scientists take an interest in proving their mettle in many of the kinds of tasks for which organizations already use Mechanical Turk.

So, why not give algorithms a chance? Surely we’re not that afraid of Skynet or the “technological singularity“. Let’s give machines–and their programmers–a chance to show off the best of both worlds!

General

Guest Demo: Eric Iverson’s Itty Bitty Search

Post author By Daniel Tunkelang
Post date February 16, 2010
17 Comments on Guest Demo: Eric Iverson’s Itty Bitty Search

I’m back from vacation, and still digging my way out of everything that’s piled up while I’ve been offline.

While I catch up, I thought I’d share with you a demo that Eric Iverson was gracious enough to share with me. It uses Yahoo! BOSS to support an exploratory search experience on top of a general web search engine.

When you perform a query, the application retrieves a set of related term candidates using Yahoo’s key terms API. It then scores each term by dividing its occurrence count within the result set by its global occurrence count–a relevance measure similar to one my former colleagues and I used at Endeca in enterprise contexts.

You can try out the demo yourself at http://www.ittybittysearch.com/. While it has rough edges, it produces nice results–especially considering the simplicity of the approach.

Here’s an example of how I used the application to explore and learn something new. I started with [“information retrieval”]. I noticed “interactive information retrieval” as a top term, so I used it to refine. Most of the refinement suggestions looked familiar to me–but an unfamiliar name caught my attention: “Anton Leuski”. Following my curiosity, I refined again. Looking at the results, I immediately saw that Leuski had done work on evaluating document clustering for interactive information retrieval. Further exploration made it clear this is someone whose work I should get to know–check out his home page!

I can’t promise that you’ll have as productive an experience as I did, but I encourage you to try Eric’s demo. It’s simple examples like these that remind me of the value of pursuing HCIR for the open web.

Speaking of which, HCIR 2010 is in the works. We’ll flesh out the details over the next weeks, and of course I’ll share them here.

Uncategorized

Vacation

Post author By Daniel Tunkelang
Post date February 6, 2010

Just letting readers know that I’ll be on vacation for the next week. If you are starved for reading materials, check out some of the blogs I read.

General

WSDM 2010: Day 3

Note: this post is cross-posted at BLOG@CACM.

Today is the last day of WSDM 2010, and I unfortunately spent it at home drinking chicken soup. But I’ve been following the conference via the proceedings and tweets.

The day started with a short session on temporal interaction. Topics included clustering social media documents (e.g., Flickr photos) based on their association with events, statistical tests for early identification of popular social media content, and analysis of answers sites (like Yahoo! Answers) as evolving two-sided economic markets.

The next session focused on advertising. Two papers focused on click prediction: one proposing an Bayesian inference model to better predict click-throughs in the tail of the ad distribution; the other presenting a framework for personalized click models. Another paper addressed the closely related problem of predicting ad relevance. The remaining papers discussed other aspects of search advertising: one on estimating the value per click for channels like Google AdSense, where ad inventory is supplied by a third party; the other proposing an algorithmic approach to automate online ad campaigns based onlanding page content.

The following session was on systems and efficiency, a popular topic given the immense data and traffic associated with web search. Two papers proposed approaches to help short-circuit ranking computations: one by optimizing the organizations of inverted index entries to consider both the static ranks of documents and the upper bounds of term scores for all terms contained in each document; the other using early-exit strategies to optimize ensemble-based machine learning algorithms. Another used machine learning to mine rules for de-duplicating web pages based on URL string patterns. Another focused on compression, showing that web content is at least an order of magnitude more compressible that what can be achieved by gzip. The last paper proposed a method to perform efficient distance queries on graph (i.e., web graphs or social graphs) by pre-computing a collection of node-centered subgraphs.

The last session of the conference discussed various topics in web mining. One presented a system for identifying distributed search bot attacks. Another proposed an image search method using a combination of entity information and visual similarity. The final paper showed that shallow text features can be used for low-cost detection of boilerplate text in web documents.

All in all, WSDM 2010 was an excellent conference, and I’m sad to not to have been able to attend more of it in person. I’m delighted to see an even mix of academic and industry representatives sharing ideas and working to make the web a better place for information access.

General

WSDM 2010: Day 2

Post author By Daniel Tunkelang
Post date February 6, 2010

Note: this post is cross-posted at BLOG@CACM.

Unfortunately, I woke up this morning rather under the weather, so I’m having to resort to remotely reporting on the second day of WSDM 2010 conference, based on the published proceedings and the tweet stream.

The day started with a keynote from Harvard economist Susan Athey. Her research focuses on the design of auction-based markets, a topic core to the business of search which largely relies on auction-based advertising models (cf. Google AdWords). Then came a session focused on learning and optimization. One paper proposed a method to learn ranking functions and query categorization simultaneously, reflecting that different categories of queries leads users to have different expectations about ranking. Another combined traditional list-based ranking with pair-wise comparisons between results to separate the results into tiers reflecting grades of relevance. An intriguing approach to query recommendation treated it as an optimization problem, perturbing users’ query-reformulation path to maximize the expected value of a utility function over the search session. Another paper looked not at ranking per se, but rather at improving the quality of training data for using machine learning for ranking. The final paper of the session, which earned a best-paper nomination, modeled document relevance based not on click-through behavior, but rather on post-click user behavior.

The next session was about users and measurement. It opened with another best-paper nominee: a analysis of over a hundred million users to understand how they re-find web content. Another offered a rigorous analysis of the often sloppily presented “long-tail” hypothesis: it found that light users disproportionately prefer content at the head of distribution while heavy users disproportionately prefer the tail. Another log-analysis paper analyzed search logs using a partially observable Markov model, a variant of thehidden Markov model in which not all of the hidden state transitions emit observable events–and compared the latent variables with eye-tracking studies. An intriguing study demonstrated that user behavior models are more predictive of goal success than models based on document relevance. The final paper of the session proposed methods for quantifying the reusability of the test collections that lie at the heart of information retrieval evaluation.

The last session of the day focused on social aspects of search. Two of the papers were concerned with modeling authority and influence in social networks, a problem in which I take a deep personal interest. Another inferred attributes of social network users based on those of other users in their communities (cg. MIT’s Project Gaydar). Another analyzed Flickr and Last.fm user logs to show that users’ semantic similarity based on their tagging behavior is predictive of social links. The final paper tackled the sparsity of social media tags by inferring latent topics from shared tags and spatial information.

Not surprisingly, a disproportionate number of contributors to the conference work at major web search companies, who have both the motivation to improve results and the access to data that is needed for such research. One of the ongoing research challenges for the field is to find ways to make this data available to others while respecting the business concerns of search engine companies and the privacy concerns of their users.

General

WSDM 2010: Day 1

Note: this post is cross-posted at BLOG@CACM.

Today was the first day of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), held at the Polytechnic Institute of NYU in Brooklyn, NY. WSDM is a young conference that has already become a top-tier publication venue for research in these areas. In contrast to some of the larger conferences, WSDM is single-track and feels more intimate and coherent–even with over 200 attendees.

The day started with an ambitious keynote by Soumen Chakrabarti (IIT Bombay): “Bridging the Structured Un-Structured Gap”. He described a soup-to-nuts architecture to annotate web documents and perform complex reasoning on them using a structured query language. But perhaps this ambitious approach is a practical one: it uses the web we have–as opposed to waiting for the semantic web to emerge–and there is a prototype using half a billion documents.

The first paper session focused on web search. Of the five papers, two emphasized temporal aspects of content, one considered social media recommendation, and one focused on identifying concepts in multi-word queries. The last paper of the session proposed using anchor text as a more widely available input than query logs to support the query reformulation process. It also attracted the most audience attention–whileinteraction is often a niche at information retrieval conferences, it always elicits strong interest and opinions.

The following session focused on tags and recommendations. Some take-aways: users produce tags similar to the topics designed by experts; individual “personomies” can be translated into aggregated folksonomies; matrix factorization methods can produce interpretable recommendations.

The last session of the day covered information extraction. One of the papers used pattern-based information extraction approaches, demonstrating how far we’ve come since Marti Hearst‘s seminal work on the subject. Another offered a SQL-like system for typed-entity search, complete with a live, publicly accessible prototype. The final paper addressed an issue the came up repeatedly at the SSM workshop: the problem of distilling the truth from a collection of inconsistent sources.

After a full day of talks, we headed to The Park for an excellent banquet. I’m looking forward to another two days of great sessions.

General

Report on the Third Workshop on Search and Social Media (SSM 2010)

Post author By Daniel Tunkelang
Post date February 4, 2010
8 Comments on Report on the Third Workshop on Search and Social Media (SSM 2010)

Note: this post is cross-posted at BLOG@CACM.

It is my pleasure to report on the 3rd Annual Workshop on Search in Social Media (SSM 2010), a gathering of information retrieval and social media researchers and practitioners in an area that has captured the interest of computer scientists, social scientists, and even the broader public. The one-day workshop took place at the Polytechnic Institute of NYU in Brooklyn, NY, co-located with the ACM Conference on Web Search and Data Mining (WSDM 2010). The quality of the presenters, the overbooked registration, and the hundreds of live tweets with the #ssm2010 hashtag all attest to the success of this event.

The workshop opened with a warm welcome from Ian Soboroff (NIST), immediately followed by a keynote from Jan Pedersen, Chief Scientist of Bing Search. Jan established a clear business case for search in social media: the opportunity to deliver content that is fresh, local, and under-served by general web search. He drilled into particular types of content where social media search is most useful: expert opinions, breaking news, and tail content. The benefits of social media search include trust and personal interaction (as compared to web content that is often soulless and of uncertain provenance), low latency (though perhaps at the cost of accuracy), and access to niche or ephemeral information that web search rarely surfaces. But delivering social media results to searchers creates its own variety of challenges, such as weighing freshness against accuracy and relevance, coping with loss of social content’s conversational context, managing low update latency when search engines have not been optimized for it, and fighting new kinds of spam. Despite these challenges, it is clear that the major web search engines have embraced the brave new world of real-time social content.

Eugene Agitchein (Emory University) then moderated a panel representing the world’s leading search engines: Jeremy Hylton (Google), Matthew Hurst (Microsoft), Sihem Amer-Yahia (Yahoo!), and William Chang (Baidu). Jeremy justified the universal interface approach, pointing out that users don’t want to have to figure out what kind of search site to use for their queries, and that they expect a familiar interface. He also noted that Google has made great strides on update latency: it can index the Twitter firehose in the same amount of time as serving a query. Matthew offered various analyses of the social search problem, based on whether the information signal resides in content (e.g., web) or attention (e.g., Twitter), or whether the information need is expressed in an explicit search query or inferred from the user’s context. Sihem offered a counter-point to Jeremy, arguing that social media search queries often represent broad or vague information needs, and thus call for a more browsing-oriented interface than web search, which is optimized for highly specific needs. William noted that the biggest competitive threat he sees for web search engines comes from social media players–and he credits much of Baidu’s success to its surfacing of social media content.

Then came a flurry of questions, perhaps the most interesting of which was how to address identity management. William argued that people prefer interacting with real-named (or pseudonymous) people to whom they are directly connected. Sihem offered the counter-example of obtaining recommendations through community aggregation. Matthew noted the incongruity of there being no economic relationship between social network companies that maintain proprietary social graphs and people whose identities and relationships those graph represent. Jeremy pointed out that users benefit if the data is as open as possible.

Given the almost even split between academic and industry participation in the workshop, the panelists were also asked to present research challenges to academia. Jeremy posed the problem of determining when social media results are actually true. Matthew wants to see more interdisciplinary work between computer scientists and social scientists. Sihem offered two challenge problems: scalable community discovery and evaluation of collaborative recommendation systems. William wants to see a rigorous axiomatization of social media search behavior.

After lunch, Jeremy Pickens (FXPAL) moderated a panel representing social media / networking companies: Hilary Mason (bit.ly), Igor Perisic (LinkedIn), and David Hendi (MySpace). Hilary noted that, while bit.ly does not have access to an explicit social graph, it captures implicit connections from user behavior that may not be represented in the graph. Jeremy asked the panelists how much a person’s extended network matters; David and Igor pointed out research indicating correlations of mood and even medical conditions between people and their third-degree connections. Again, the audience was full of questions, especially for Igor. As a fan of faceted search, I was glad to see him touting LinkedIn’s success in making faceted search the primary means of performing people search on the site. For an in-depth view, I recommend “LinkedIn Search: A Look Beneath the Hood“.

The afternoon continued with a poster / demo session emphasizing work in progress: tools, interfaces, research studies, and position papers. I particularly enjoyed listening to the stream of interaction between academic researchers and industry practitioners.

The final panel session assembled academic researchers to discuss their views of the challenges in social media. Gene Golovchinsky (FXPAL) moderated a panel comprised of Meena Nagarajan (Wright State University), Liangjie Hong (Lehigh University),Richard McCreadie (University of Glasgow), Jonathan Elsas (CMU), and Mor Naaman (Rutgers University). Meena highlighted the need to build up meta-data to describe the context around social utteracnces. Liangjie took a position similar to William Cheng’s, calling for a framework to model the tasks and behavior of users who interact with social media. Richard focused on the intersection of social media and news search, and noted that some of the most useful information is private and proprietary (e.g., search and chat logs). Jonathan offered a variety of challenges: determining the right retrieval granularity, managing multiple axes of organization, aggregating author behavior, and multidimensional indexing of social media content. Finally, Mor noted that we’re moving from a world of email to a “social awareness stream”, in which the content we directed content at a group and have lower expectations of readership than email. As with all of the panels, there were countless questions from the moderator and audience, particularly about determining the truthfulness of social media content and delivering social content in an effective user interface.

The final conference session was a conference was a full-group discussion that dived into the various topics addressed throughout the day. But Gene Golovchinsky provided the “one more thing” at the end, showing us a glimpse of a faceted search interface to explore a Twitter stream. It was an elegant finish to a day filled with informative and engaging discussion, and I look forward to seeing many of the participants in the WSDM conference over the next few days.

Uncategorized

Blogging SSM 2010 and WSDM 2010

Post author By Daniel Tunkelang
Post date February 3, 2010
3 Comments on Blogging SSM 2010 and WSDM 2010

I’m delighted to report that I’ll be blogging about the Search and Social Media Workshop (SSM 2010) and the Web Search and Data Mining Conference (WSDM 2010) for Communications of the ACM.

Of course, I’ll cross-post here. I also encourage folks to follow the live tweet streams at #ssm2010 and #wsdm2010, as well as Gene and Jeremy’s posts at the FXPAL blog.

To those attending: see you all tomorrow through Saturday! To everyone else: I will try my best to communicate the substance and spirit of the conference.

Uncategorized

Blogs I Read: Search Facets

Post author By Daniel Tunkelang
Post date January 31, 2010

A couple of years ago, I started The Noisy Channel as a personal blog. Since my then-employer Endeca didn’t have a corporate blog, I became the company’s ambassador to the blogosphere, despite my protests that this was not a corporate blog.

But I’m pleased to report that Endeca now has is its own blog, aptly entitled Search Facets. I’m not usually a fan of corporate blogs, but I like the approach Endeca is taking to this one. The folks who have posted so far are Adam Ferrari (CTO), Vladimir Zelevinsky (Research Scientist), and Pete Bell (Co-Founder)–an indication that the blog will contain substance, rather than warmed-over press releases.

Indeed, the posts so far are nice and meaty. I particularly like Adam’s post about “Vertical stores for vertical web search?“–it’s nice to see read intelligent analysis from someone who understand the strengths of both MapReduce and column-oriented relational databases.

Anyway, I’m delighted that my former co-workers have taken to the blogosphere, and I look forward to reading what they have to say!

General

LinkedIn Search: A Look Beneath the Hood

Post author By Daniel Tunkelang
Post date January 31, 2010
13 Comments on LinkedIn Search: A Look Beneath the Hood

Last week, I had the good fortune to attend a presentation by John Wang, search architect at LinkedIn. You may have read my earlier posts about LinkedIn introducing faceted search and celebrating the interface from a user perspective. John’s presentation at the SDForum took a developer’s perspective, discussing the challenges of combining faceted search and social networking at scale.

John was kind enough to publish his slides, and I’ve embedded them above. Unfortunately, there’s no recording of the extensive Q&A (which included various attempts to get John to reveal the precise details of LinkedIn’s data volume), but the slides are quite meaty.

Personally, I learned two surprising things from the talk.

First, I was surprised that LinkedIn dismisses index/cache warming as “cheating”, instead computing almost everything in real time. Specifically, I would have expected LinkedIn to cache information like a user’s set of degree-two connections: these are expensive to compute at query time, especially when the social graph is distributed and sharded by user. I did ask John whether LinkedIn recomputes a user’s degree-two network during a session, and he admitted that LinkedIn is sensible enough to “cheat” and not perform this expensive but almost useless re-computation.

Second, I learned about reference search, a feature I may have missed because it is only available for premium LinkedIn accounts. It’s a nice feature, allowing you to search against company + date range pairs. People who are familiar with implementing faceted search may recognize the preservation of such associations between facet values as a gnarly implementation challenge.

All in all, it was a treat to get this look under the hood, as well as to finally meet John in person. I also ran into Gene Golovchinsky there–so much for my spending a few days on the west coast incognito!

In any case, I’m looking forward to seeing Gene, some of John’s colleagues, and many more interesting people at the Search and Social Media Workshop (SSM 2010) on Wednesday. My apologies to those who aren’t able to attend this oversubscribed event. I promise to blog about it!