The Noisy Channel

 

Google and Transparency

March 7th, 2010 by Daniel Tunkelang
Respond

Let me preface this post with a clear disclaimer: I work at Google, but the views I express on this blog are my own personal views.

Last week, Google head of webspam Matt Cutts posted a full-throated defense of Google’s transparency on Google’s European Policy Blog in response to complaints that a few companies raised to the European Commission. Long-time readers of my blog know that I’m a big fan of search engine transparency and have made my own calls on this blog for Google to be more transparent. The fact that I work at Google now doesn’t change my values. But being on the inside has informed my perspective.

In particular, as Matt elaborates in his post, Google deserves more credit for transparency than it often gets from its critics. For example, Google has published:

He goes onto describe the various webmaster tools and social media resources that Google has made available. The popularity of these tools is a testament to their utility.

Still, as Matt points out:

we don’t think it’s unreasonable for any business to have some trade secrets, not least because we don’t want to help spammers and crackers game our system. If people who are trying to game search rankings knew every single detail about how we rank sites, it would be easier for them to ’spam’ our results with pages that are not relevant and are frustrating to users — including porn and malware sites.

As I blogged back in 2008, I still hope that someday we won’t need to have to rely on a relevance analog of security through obscurity in order to deter spam and abusive SEO practices. But I recognize that we haven’t developed such an analog, and hence that complete transparency today for web search ranking algorithms would have a far greater downside than upside for ordinary users.

I suspect that a prerequisite for complete transparency in search requires moving from a ranking-based retrieval approach to a set-based approach. For many web search information needs (e.g., navigational queries), it’s hard to see how users would benefit from such a radical change. For queries that represent more exploratory information needs, a set-based approach would be (at least in my view) far preferable to one based on ranking. But there’s a lot of work to do on the content side before such exploratory interfaces for the web are usable.

In summary, I’m happy to see Matt taking a public stand in Google’s defense. I don’t always agree with my employer’s decisions, but I do believe that my colleagues act in good faith and with good intentions. I understand how many people–especially site owners–fixate on whatever Google keeps secret. In a world where so many people compete for attention, information is power. Google tries to provide maximum quality to users while keeping the playing field level for site owners. As Google Fellow Amit Singhal points out, “this stuff is tough“.

Tags: 13 Comments

Not All Queries Are Created Equal

March 7th, 2010 by Daniel Tunkelang
Respond

A topic with which I developed an obsession in my last few years at Endeca is understanding how to predict query difficulty and performance–performance in the information retrieval sense meaning results quality, not computational efficiency. If only we knew how well a search engine would do–or did–in meeting the user’s information need, we might adapt the user experience to reflect our degree of confidence.

I was particularly interested in work related to the query clarity score initially proposed by Steve Cronen-Townsend, Yun Zhou, and Bruce Croft in a 2002 paper entitled “Predicting Query Performance“. But there is a wide variety of work in this area, including methods to predict performance either before or after results retrieval.

Happily, Claudia Hauff just published a dissertation on this topic, entitled “Predicting the Effectiveness of Queries and Retrieval Systems“. It is very well written, and I recommend it to anyone interested in learning more about this subject. She presents not only her own original research, but also a comprehensive analysis of others’ efforts.

Here is an excerpt from the abstract:

In this thesis we consider users’ attempts to express their information needs through queries, or search requests and try to predict whether those requests will be of high or low quality. Intuitively, a query’s quality is determined by the outcome of the query, that is, whether the retrieved search results meet the user’s expectations. The second type of prediction methods under investigation are those which attempt to predict the quality of search systems themselves. Given a number of search systems to consider, these methods estimate how well or how poorly the systems will perform in comparison to each other.

I look forward to seeing researchers continue to build on these results, and I am excited for the day when search engines are more reflective on their own strengths and weakness.

Tags: No Comments.

HCIR 2010: A Pre-Announcement

March 7th, 2010 by Daniel Tunkelang
Respond

We’re gearing up to officially announce the HCIR 2010 workshop, but I wanted to give folks here a heads up, as well as to put out a call for a volunteer.

The Fourth Workshop on Human-Computer Interaction and Information Retrieval will take place on August 22nd at Rutgers University in New Brunswick, NJ. It will be an independent workshop as in previous years, but this year we are co-locating it with the Third Information Interaction in Context conference (IIiX 2010).

We’ve already lined up Google’s Dan Russell as a keynote speaker and are close to circulating a call for participation. We’re also planning to introduce something new to the workshop this year: an HCIR challenge! Participants will build applications around a specific data set that demonstrate the use of HCIR techniques. We’ll announce the data set and the challenge details as soon as we’ve confirmed the licensing details.

Meanwhile, we’re looking for a volunteer to help us build a baseline index for the challenge data set. Participants will be allowed–but not required–to use this index as a starting point for their entries. The volunteer should be comfortable using open-source packages Lucene or Solr. If you are interested in being that volunteer, please let me know, and I’ll be happy to share more details.

Tags: 2 Comments

You Can’t Hurry Relevance

February 28th, 2010 by Daniel Tunkelang
Respond

Lately, I’ve been musing about the Herb Simon quote that launched–or at least popularized–the concepts of information overload and attention economics:
in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it (Simon, 1971)

I hope everyone agrees that attention is a scarce good. But I’m curious how people measure it. After all, if we’re going to talk about an economic good being scarce, we ought to quantify it!

One approach is to measure attention at a specific moment in time, measuring how much of our instantaneous cognitive capacity we devote to a task. This approach is useful for evaluating a user interface–in particular, for determining how users allocate their attention among the various interface elements. Another approach is to measure attention in units of time, e.g., how many of our waking hours do we devote to a particular activity. This latter strikes me as more of what Herb Simon had in mind.

We can interpret the two definitions as equivalent–after all, cumulative attention devoted to a task is simply the sum (or integral) of instantaneous attention over time. But thinking this way so misses a key consideration: we pay a significant price for context switching.

A familiar example is email. The total time we spend reading email is a productivity concern, but the larger concern for many of us is the frequency with which email causes us to interrupt our workflow. Knowing this, I made a brief attempt in 2008 to check email only once a day. Unfortunately, this approach would have violated too many of my peers’ expectations. I returned to status quo, reading my email (or at least scanning headers) as it arrives. Other messaging tools, such as instant messaging and Twitter, only add to the challenge of managing our personal communication flow.

Of course, what I really want is for my messaging tools to distinguish urgent messages from non-urgent ones, and to only interrupt my workflow for the former. I know that no system, whether based on manual filtering or algorithmic analysis, can make this subjective classification with 100% accuracy, but I’d certainly accept a handful of false positives in exchange for far fewer interruptions. I suspect I’m not alone.

Moreover, this approach extends beyond personal communications to more public ones, such as social media platforms and even web search. On one hand, the passing of time offers an opportunity to accumulate reliable content analysis; on the other hand, we don’t want to miss time-sensitive content just because the system waited too long to determine the content’s relevance to our information needs. Still, the low signal-to-noise ratio on social media platforms suggests to me that many information consumers would be amenable to a different tradeoff than the one we experience today.

What I’d really like to see is systems take advantage of the differences in users’ personal senses of urgency. Some examples:

  • A widely broadcast email isn’t delivered all at once, but first goes to users with higher urgency settings. Because those users mark it as spam, the email is already marked as spam for users with lower urgency settings. Conversely, if enough high-urgency users mark it as important, then it may be sent to lower-urgency users sooner.
  • High-urgency users frequently check news sites and blogs. If an article attract a threshold level of engagement from high-urgency users, then low-urgency users are notified. This approach could apply to general news or to news in a specific topic that the user follows.
  • Same as above, but applied to activity feeds and based on engagement within your social network. But again, high-urgency users lead the way, seeing updates sooner but at the price of experiencing a noisier stream.

To some extent, our existing systems already approximate this approach. Mechanisms like favoriting and re-tweeting propagate signal from information scouts to their followers, as do algorithms that rank real-time information based on engagement. Still, as an information consumer, I’d appreciate an interface that explicitly and transparently adapts to my priorities, and that manages interruption of my workflow accordingly.

What do folks here think? Is information delayed tantamount to information denied? Or is time on our side, potentially offering us a better tradeoff than the one we experience today?

Tags: 20 Comments

Holding Back the Rise of the Machines?

February 20th, 2010 by Daniel Tunkelang
Respond

Amazon’s Mechanical Turk is one of my favorite examples of leveraging the internet for innovation:

Amazon Mechanical Turk is a marketplace for work that requires human intelligence. The Mechanical Turk web service enables companies to programmatically access this marketplace and a diverse, on-demand workforce. Developers can leverage this service to build human intelligence directly into their applications.

But, in my view, Mechanical Turk does not take its vision far enough. In the conditions of use, Amazon makes it clear that only human participation need apply: “you will not use robots, scripts or other automated methods to complete the Services”.

On one hand, I can understand that Amazon’s vision for Mechanical Turk, like Luis Von Ahn’s “games with a purpose“, explicitly aims to apply human intelligence to tasks where automated methods seem inadequate. On the other hand, what are automated methods but encapsulations of human methods? It seems odd for Amazon to be so particular about the human / machine distinction, especially given that terms of service impose practically no other constraints on execution (beyond the obvious legal ones), Moreover, Mechanical Turk offers developers a variety of ways to assure quality (redundancy, qualification tests, etc.).

Granted, there are some important concerns that would have to be addressed if Amazon were to relax the “humans-only” constraint. For example, a developer today can reasonably assume that two different human “Providers” execute tasks independently. With automated participation, there’s a far greater risk of dependence–e.g., from multiple programmers applying the same algorithms. This possibility would have to be taken into account in quality assurance.

Still, the benefits of allowing automated participants would seem to far outweigh the risks. At pennies a task, Mechanical Turk has a limited appeal to the human labor force–indeed, research by Panos Ipeirotis suggests that Amazon’s revenue from the service may be so law that it doesn’t even cover the costs of a single dedicated developer!

In contrast, there’s evidence that programmers would take an interest in participation, were it an option. Marketplaces like TopCoder and competitions like the Netflix Prize suggest that computer scientists take an interest in proving their mettle in many of the kinds of tasks for which organizations already use Mechanical Turk.

So, why not give algorithms a chance? Surely we’re not that afraid of Skynet or the “technological singularity“. Let’s give machines–and their programmers–a chance to show off the best of both worlds!

Tags: 6 Comments

Guest Demo: Eric Iverson’s Itty Bitty Search

February 16th, 2010 by Daniel Tunkelang
Respond

I’m back from vacation, and still digging my way out of everything that’s piled up while I’ve been offline.

While I catch up, I thought I’d share with you a demo that Eric Iverson was gracious enough to share with me. It uses Yahoo! BOSS to support an exploratory search experience on top of a general web search engine.

When you perform a query, the application retrieves a set of related term candidates using Yahoo’s key terms API. It then scores each term by dividing its occurrence count within the result set by its global occurrence count–a relevance measure similar to one my former colleagues and I used at Endeca in enterprise contexts.

You can try out the demo yourself at http://www.ittybittysearch.com/. While it has rough edges, it produces nice results–especially considering the simplicity of the approach.

Here’s an example of how I used the application to explore and learn something new. I started with ["information retrieval"]. I noticed “interactive information retrieval” as a top term, so I used it to refine. Most of the refinement suggestions looked familiar to me–but an unfamiliar name caught my attention: “Anton Leuski”. Following my curiosity, I refined again. Looking at the results, I immediately saw that Leuski had done work on evaluating document clustering for interactive information retrieval. Further exploration made it clear this is someone whose work I should get to know–check out his home page!

I can’t promise that you’ll have as productive an experience as I did, but I encourage you to try Eric’s demo. It’s simple examples like these that remind me of the value of pursuing HCIR for the open web.

Speaking of which, HCIR 2010 is in the works. We’ll flesh out the details over the next weeks, and of course I’ll share them here.

Tags: 17 Comments

Vacation

February 6th, 2010 by Daniel Tunkelang
Respond

Just letting readers know that I’ll be on vacation for the next week. If you are starved for reading materials, check out some of the blogs I read.

Tags: No Comments.

WSDM 2010: Day 3

February 6th, 2010 by Daniel Tunkelang
Respond

Note: this post is cross-posted at BLOG@CACM.

Today is the last day of WSDM 2010, and I unfortunately spent it at home drinking chicken soup. But I’ve been following the conference via the proceedings and tweets.

The day started with a short session on temporal interaction. Topics included clustering social media documents (e.g., Flickr photos) based on their association with events, statistical tests for early identification of popular social media content, and analysis of answers sites (like Yahoo! Answers) as evolving two-sided economic markets.

The next session focused on advertising. Two papers focused on click prediction: one proposing an Bayesian inference model to better predict click-throughs in the tail of the ad distribution; the other presenting a framework for personalized click models. Another paper addressed the closely related problem of predicting ad relevance. The remaining papers discussed other aspects of search advertising: one on estimating the value per click for channels like Google AdSense, where ad inventory is supplied by a third party; the other proposing an algorithmic approach to automate online ad campaigns based onlanding page content.

The following session was on systems and efficiency, a popular topic given the immense data and traffic associated with web search. Two papers proposed approaches to help short-circuit ranking computations: one by optimizing the organizations of inverted index entries to consider both the static ranks of documents and the upper bounds of term scores for all terms contained in each document; the other using early-exit strategies to optimize ensemble-based machine learning algorithms. Another used machine learning to mine rules for de-duplicating web pages based on URL string patterns. Another focused on compression, showing that web content is at least an order of magnitude more compressible that what can be achieved by gzip. The last paper proposed a method to perform efficient distance queries on graph (i.e., web graphs or social graphs) by pre-computing a collection of node-centered subgraphs.

The last session of the conference discussed various topics in web mining. One presented a system for identifying distributed search bot attacks. Another proposed an image search method using a combination of entity information and visual similarity. The final paper showed that shallow text features can be used for low-cost detection of boilerplate text in web documents.

All in all, WSDM 2010 was an excellent conference, and I’m sad to not to have been able to attend more of it in person. I’m delighted to see an even mix of academic and industry representatives sharing ideas and working to make the web a better place for information access.

Tags: 2 Comments

WSDM 2010: Day 2

February 6th, 2010 by Daniel Tunkelang
Respond

Note: this post is cross-posted at BLOG@CACM.

Unfortunately, I woke up this morning rather under the weather, so I’m having to resort to remotely reporting on the second day of WSDM 2010 conference, based on the published proceedings and the tweet stream.

The day started with a keynote from Harvard economist Susan Athey. Her research focuses on the design of auction-based markets, a topic core to the business of search which largely relies on auction-based advertising models (cf. Google AdWords). Then came a session focused on learning and optimization. One paper proposed a method to learn ranking functions and query categorization simultaneously, reflecting that different categories of queries leads users to have different expectations about ranking. Another combined traditional list-based ranking with pair-wise comparisons between results to separate the results into tiers reflecting grades of relevance. An intriguing approach to query recommendation treated it as an optimization problem, perturbing users’ query-reformulation path to maximize the expected value of a utility function over the search session. Another paper looked not at ranking per se, but rather at improving the quality of training data for using machine learning for ranking. The final paper of the session, which earned a best-paper nomination, modeled document relevance based not on click-through behavior, but rather on post-click user behavior.

The next session was about users and measurement. It opened with another best-paper nominee: a analysis of over a hundred million users to understand how they re-find web content. Another offered a rigorous analysis of the often sloppily presented “long-tail” hypothesis: it found that light users disproportionately prefer content at the head of distribution while heavy users disproportionately prefer the tail. Another log-analysis paper analyzed search logs using a partially observable Markov model, a variant of thehidden Markov model in which not all of the hidden state transitions emit observable events–and compared the latent variables with eye-tracking studies. An intriguing study demonstrated that user behavior models are more predictive of goal success than models based on document relevance. The final paper of the session proposed methods for quantifying the reusability of the test collections that lie at the heart of information retrieval evaluation.

The last session of the day focused on social aspects of search. Two of the papers were concerned with modeling authority and influence in social networks, a problem in which I take a deep personal interest. Another inferred attributes of social network users based on those of other users in their communities (cg. MIT’s Project Gaydar). Another analyzed Flickr and Last.fm user logs to show that users’ semantic similarity based on their tagging behavior is predictive of social links. The final paper tackled the sparsity of social media tags by inferring latent topics from shared tags and spatial information.

Not surprisingly, a disproportionate number of contributors to the conference work at major web search companies, who have both the motivation to improve results and the access to data that is needed for such research. One of the ongoing research challenges for the field is to find ways to make this data available to others while respecting the business concerns of search engine companies and the privacy concerns of their users.

Tags: No Comments.

WSDM 2010: Day 1

February 5th, 2010 by Daniel Tunkelang
Respond

Note: this post is cross-posted at BLOG@CACM.

Today was the first day of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), held at the Polytechnic Institute of NYU in Brooklyn, NY. WSDM is a young conference that has already become a top-tier publication venue for research in these areas. In contrast to some of the larger conferences, WSDM is single-track and feels more intimate and coherent–even with over 200 attendees.

The day started with an ambitious keynote by Soumen Chakrabarti (IIT Bombay): “Bridging the Structured Un-Structured Gap”. He described a soup-to-nuts architecture to annotate web documents and perform complex reasoning on them using a structured query language. But perhaps this ambitious approach is a practical one: it uses the web we have–as opposed to waiting for the semantic web to emerge–and there is a prototype using half a billion documents.

The first paper session focused on web search. Of the five papers, two emphasized temporal aspects of content, one considered social media recommendation, and one focused on identifying concepts in multi-word queries. The last paper of the session proposed using anchor text as a more widely available input than query logs to support the query reformulation process. It also attracted the most audience attention–whileinteraction is often a niche at information retrieval conferences, it always elicits strong interest and opinions.

The following session focused on tags and recommendations. Some take-aways: users produce tags similar to the topics designed by experts; individual “personomies” can be translated into aggregated folksonomies; matrix factorization methods can produce interpretable recommendations.

The last session of the day covered information extraction. One of the papers used pattern-based information extraction approaches, demonstrating how far we’ve come since Marti Hearst’s seminal work on the subject. Another offered a SQL-like system for typed-entity search, complete with a live, publicly accessible prototype. The final paper addressed an issue the came up repeatedly at the SSM workshop: the problem of distilling the truth from a collection of inconsistent sources.

After a full day of talks, we headed to The Park for an excellent banquet. I’m looking forward to another two days of great sessions.

Tags: 3 Comments

Clicky Web Analytics