Categories
General

You Can’t Hurry Relevance

Lately, I’ve been musing about the Herb Simon quote that launched–or at least popularized–the concepts of information overload and attention economics:
in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it (Simon, 1971)

I hope everyone agrees that attention is a scarce good. But I’m curious how people measure it. After all, if we’re going to talk about an economic good being scarce, we ought to quantify it!

One approach is to measure attention at a specific moment in time, measuring how much of our instantaneous cognitive capacity we devote to a task. This approach is useful for evaluating a user interface–in particular, for determining how users allocate their attention among the various interface elements. Another approach is to measure attention in units of time, e.g., how many of our waking hours do we devote to a particular activity. This latter strikes me as more of what Herb Simon had in mind.

We can interpret the two definitions as equivalent–after all, cumulative attention devoted to a task is simply the sum (or integral) of instantaneous attention over time. But thinking this way so misses a key consideration: we pay a significant price for context switching.

A familiar example is email. The total time we spend reading email is a productivity concern, but the larger concern for many of us is the frequency with which email causes us to interrupt our workflow. Knowing this, I made a brief attempt in 2008 to check email only once a day. Unfortunately, this approach would have violated too many of my peers’ expectations. I returned to status quo, reading my email (or at least scanning headers) as it arrives. Other messaging tools, such as instant messaging and Twitter, only add to the challenge of managing our personal communication flow.

Of course, what I really want is for my messaging tools to distinguish urgent messages from non-urgent ones, and to only interrupt my workflow for the former. I know that no system, whether based on manual filtering or algorithmic analysis, can make this subjective classification with 100% accuracy, but I’d certainly accept a handful of false positives in exchange for far fewer interruptions. I suspect I’m not alone.

Moreover, this approach extends beyond personal communications to more public ones, such as social media platforms and even web search. On one hand, the passing of time offers an opportunity to accumulate reliable content analysis; on the other hand, we don’t want to miss time-sensitive content just because the system waited too long to determine the content’s relevance to our information needs. Still, the low signal-to-noise ratio on social media platforms suggests to me that many information consumers would be amenable to a different tradeoff than the one we experience today.

What I’d really like to see is systems take advantage of the differences in users’ personal senses of urgency. Some examples:

  • A widely broadcast email isn’t delivered all at once, but first goes to users with higher urgency settings. Because those users mark it as spam, the email is already marked as spam for users with lower urgency settings. Conversely, if enough high-urgency users mark it as important, then it may be sent to lower-urgency users sooner.
  • High-urgency users frequently check news sites and blogs. If an article attract a threshold level of engagement from high-urgency users, then low-urgency users are notified. This approach could apply to general news or to news in a specific topic that the user follows.
  • Same as above, but applied to activity feeds and based on engagement within your social network. But again, high-urgency users lead the way, seeing updates sooner but at the price of experiencing a noisier stream.

To some extent, our existing systems already approximate this approach. Mechanisms like favoriting and re-tweeting propagate signal from information scouts to their followers, as do algorithms that rank real-time information based on engagement. Still, as an information consumer, I’d appreciate an interface that explicitly and transparently adapts to my priorities, and that manages interruption of my workflow accordingly.

What do folks here think? Is information delayed tantamount to information denied? Or is time on our side, potentially offering us a better tradeoff than the one we experience today?

Categories
General

Holding Back the Rise of the Machines?

Amazon’s Mechanical Turk is one of my favorite examples of leveraging the internet for innovation:

Amazon Mechanical Turk is a marketplace for work that requires human intelligence. The Mechanical Turk web service enables companies to programmatically access this marketplace and a diverse, on-demand workforce. Developers can leverage this service to build human intelligence directly into their applications.

But, in my view, Mechanical Turk does not take its vision far enough. In the conditions of use, Amazon makes it clear that only human participation need apply: “you will not use robots, scripts or other automated methods to complete the Services”.

On one hand, I can understand that Amazon’s vision for Mechanical Turk, like Luis Von Ahn‘s “games with a purpose“, explicitly aims to apply human intelligence to tasks where automated methods seem inadequate. On the other hand, what are automated methods but encapsulations of human methods? It seems odd for Amazon to be so particular about the human / machine distinction, especially given that terms of service impose practically no other constraints on execution (beyond the obvious legal ones), Moreover, Mechanical Turk offers developers a variety of ways to assure quality (redundancy, qualification tests, etc.).

Granted, there are some important concerns that would have to be addressed if Amazon were to relax the “humans-only” constraint. For example, a developer today can reasonably assume that two different human “Providers” execute tasks independently. With automated participation, there’s a far greater risk of dependence–e.g., from multiple programmers applying the same algorithms. This possibility would have to be taken into account in quality assurance.

Still, the benefits of allowing automated participants would seem to far outweigh the risks. At pennies a task, Mechanical Turk has a limited appeal to the human labor force–indeed, research by Panos Ipeirotis suggests that Amazon’s revenue from the service may be so law that it doesn’t even cover the costs of a single dedicated developer!

In contrast, there’s evidence that programmers would take an interest in participation, were it an option. Marketplaces like TopCoder and competitions like the Netflix Prize suggest that computer scientists take an interest in proving their mettle in many of the kinds of tasks for which organizations already use Mechanical Turk.

So, why not give algorithms a chance? Surely we’re not that afraid of Skynet or the “technological singularity“. Let’s give machines–and their programmers–a chance to show off the best of both worlds!

Categories
General

Guest Demo: Eric Iverson’s Itty Bitty Search

I’m back from vacation, and still digging my way out of everything that’s piled up while I’ve been offline.

While I catch up, I thought I’d share with you a demo that Eric Iverson was gracious enough to share with me. It uses Yahoo! BOSS to support an exploratory search experience on top of a general web search engine.

When you perform a query, the application retrieves a set of related term candidates using Yahoo’s key terms API. It then scores each term by dividing its occurrence count within the result set by its global occurrence count–a relevance measure similar to one my former colleagues and I used at Endeca in enterprise contexts.

You can try out the demo yourself at http://www.ittybittysearch.com/. While it has rough edges, it produces nice results–especially considering the simplicity of the approach.

Here’s an example of how I used the application to explore and learn something new. I started with [“information retrieval”]. I noticed “interactive information retrieval” as a top term, so I used it to refine. Most of the refinement suggestions looked familiar to me–but an unfamiliar name caught my attention: “Anton Leuski”. Following my curiosity, I refined again. Looking at the results, I immediately saw that Leuski had done work on evaluating document clustering for interactive information retrieval. Further exploration made it clear this is someone whose work I should get to know–check out his home page!

I can’t promise that you’ll have as productive an experience as I did, but I encourage you to try Eric’s demo. It’s simple examples like these that remind me of the value of pursuing HCIR for the open web.

Speaking of which, HCIR 2010 is in the works. We’ll flesh out the details over the next weeks, and of course I’ll share them here.

Categories
Uncategorized

Vacation

Just letting readers know that I’ll be on vacation for the next week. If you are starved for reading materials, check out some of the blogs I read.

Categories
General

WSDM 2010: Day 3

Note: this post is cross-posted at BLOG@CACM.

Today is the last day of WSDM 2010, and I unfortunately spent it at home drinking chicken soup. But I’ve been following the conference via the proceedings and tweets.

The day started with a short session on temporal interaction. Topics included clustering social media documents (e.g., Flickr photos) based on their association with events, statistical tests for early identification of popular social media content, and analysis of answers sites (like Yahoo! Answers) as evolving two-sided economic markets.

The next session focused on advertising. Two papers focused on click prediction: one proposing an Bayesian inference model to better predict click-throughs in the tail of the ad distribution; the other presenting a framework for personalized click models. Another paper addressed the closely related problem of predicting ad relevance. The remaining papers discussed other aspects of search advertising: one on estimating the value per click for channels like Google AdSense, where ad inventory is supplied by a third party; the other proposing an algorithmic approach to automate online ad campaigns based onlanding page content.

The following session was on systems and efficiency, a popular topic given the immense data and traffic associated with web search. Two papers proposed approaches to help short-circuit ranking computations: one by optimizing the organizations of inverted index entries to consider both the static ranks of documents and the upper bounds of term scores for all terms contained in each document; the other using early-exit strategies to optimize ensemble-based machine learning algorithms. Another used machine learning to mine rules for de-duplicating web pages based on URL string patterns. Another focused on compression, showing that web content is at least an order of magnitude more compressible that what can be achieved by gzip. The last paper proposed a method to perform efficient distance queries on graph (i.e., web graphs or social graphs) by pre-computing a collection of node-centered subgraphs.

The last session of the conference discussed various topics in web mining. One presented a system for identifying distributed search bot attacks. Another proposed an image search method using a combination of entity information and visual similarity. The final paper showed that shallow text features can be used for low-cost detection of boilerplate text in web documents.

All in all, WSDM 2010 was an excellent conference, and I’m sad to not to have been able to attend more of it in person. I’m delighted to see an even mix of academic and industry representatives sharing ideas and working to make the web a better place for information access.

Categories
General

WSDM 2010: Day 2

Note: this post is cross-posted at BLOG@CACM.

Unfortunately, I woke up this morning rather under the weather, so I’m having to resort to remotely reporting on the second day of WSDM 2010 conference, based on the published proceedings and the tweet stream.

The day started with a keynote from Harvard economist Susan Athey. Her research focuses on the design of auction-based markets, a topic core to the business of search which largely relies on auction-based advertising models (cf. Google AdWords). Then came a session focused on learning and optimization. One paper proposed a method to learn ranking functions and query categorization simultaneously, reflecting that different categories of queries leads users to have different expectations about ranking. Another combined traditional list-based ranking with pair-wise comparisons between results to separate the results into tiers reflecting grades of relevance. An intriguing approach to query recommendation treated it as an optimization problem, perturbing users’ query-reformulation path to maximize the expected value of a utility function over the search session. Another paper looked not at ranking per se, but rather at improving the quality of training data for using machine learning for ranking. The final paper of the session, which earned a best-paper nomination, modeled document relevance based not on click-through behavior, but rather on post-click user behavior.

The next session was about users and measurement. It opened with another best-paper nominee: a analysis of over a hundred million users to understand how they re-find web content. Another offered a rigorous analysis of the often sloppily presented “long-tail” hypothesis: it found that light users disproportionately prefer content at the head of distribution while heavy users disproportionately prefer the tail. Another log-analysis paper analyzed search logs using a partially observable Markov model, a variant of thehidden Markov model in which not all of the hidden state transitions emit observable events–and compared the latent variables with eye-tracking studies. An intriguing study demonstrated that user behavior models are more predictive of goal success than models based on document relevance. The final paper of the session proposed methods for quantifying the reusability of the test collections that lie at the heart of information retrieval evaluation.

The last session of the day focused on social aspects of search. Two of the papers were concerned with modeling authority and influence in social networks, a problem in which I take a deep personal interest. Another inferred attributes of social network users based on those of other users in their communities (cg. MIT’s Project Gaydar). Another analyzed Flickr and Last.fm user logs to show that users’ semantic similarity based on their tagging behavior is predictive of social links. The final paper tackled the sparsity of social media tags by inferring latent topics from shared tags and spatial information.

Not surprisingly, a disproportionate number of contributors to the conference work at major web search companies, who have both the motivation to improve results and the access to data that is needed for such research. One of the ongoing research challenges for the field is to find ways to make this data available to others while respecting the business concerns of search engine companies and the privacy concerns of their users.

Categories
General

WSDM 2010: Day 1

Note: this post is cross-posted at BLOG@CACM.

Today was the first day of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), held at the Polytechnic Institute of NYU in Brooklyn, NY. WSDM is a young conference that has already become a top-tier publication venue for research in these areas. In contrast to some of the larger conferences, WSDM is single-track and feels more intimate and coherent–even with over 200 attendees.

The day started with an ambitious keynote by Soumen Chakrabarti (IIT Bombay): “Bridging the Structured Un-Structured Gap”. He described a soup-to-nuts architecture to annotate web documents and perform complex reasoning on them using a structured query language. But perhaps this ambitious approach is a practical one: it uses the web we have–as opposed to waiting for the semantic web to emerge–and there is a prototype using half a billion documents.

The first paper session focused on web search. Of the five papers, two emphasized temporal aspects of content, one considered social media recommendation, and one focused on identifying concepts in multi-word queries. The last paper of the session proposed using anchor text as a more widely available input than query logs to support the query reformulation process. It also attracted the most audience attention–whileinteraction is often a niche at information retrieval conferences, it always elicits strong interest and opinions.

The following session focused on tags and recommendations. Some take-aways: users produce tags similar to the topics designed by experts; individual “personomies” can be translated into aggregated folksonomies; matrix factorization methods can produce interpretable recommendations.

The last session of the day covered information extraction. One of the papers used pattern-based information extraction approaches, demonstrating how far we’ve come since Marti Hearst‘s seminal work on the subject. Another offered a SQL-like system for typed-entity search, complete with a live, publicly accessible prototype. The final paper addressed an issue the came up repeatedly at the SSM workshop: the problem of distilling the truth from a collection of inconsistent sources.

After a full day of talks, we headed to The Park for an excellent banquet. I’m looking forward to another two days of great sessions.
Categories
General

Report on the Third Workshop on Search and Social Media (SSM 2010)

Note: this post is cross-posted at BLOG@CACM.

It is my pleasure to report on the 3rd Annual Workshop on Search in Social Media (SSM 2010), a gathering of information retrieval and social media researchers and practitioners in an area that has captured the interest of computer scientists, social scientists, and even the broader public. The one-day workshop took place at the Polytechnic Institute of NYU in Brooklyn, NY, co-located with the ACM Conference on Web Search and Data Mining (WSDM 2010). The quality of the presenters, the overbooked registration, and the hundreds of live tweets with the #ssm2010 hashtag all attest to the success of this event.

The workshop opened with a warm welcome from Ian Soboroff (NIST), immediately followed by a keynote from Jan Pedersen, Chief Scientist of Bing Search. Jan established a clear business case for search in social media: the opportunity to deliver content that is fresh, local, and under-served by general web search. He drilled into particular types of content where social media search is most useful: expert opinions, breaking news, and tail content. The benefits of social media search include trust and personal interaction (as compared to web content that is often soulless and of uncertain provenance), low latency (though perhaps at the cost of accuracy), and access to niche or ephemeral information that web search rarely surfaces. But delivering social media results to searchers creates its own variety of challenges, such as weighing freshness against accuracy and relevance, coping with loss of social content’s conversational context, managing low update latency when search engines have not been optimized for it, and fighting new kinds of spam. Despite these challenges, it is clear that the major web search engines have embraced the brave new world of real-time social content.

Eugene Agitchein (Emory University) then moderated a panel representing the world’s leading search engines: Jeremy Hylton (Google), Matthew Hurst (Microsoft), Sihem Amer-Yahia (Yahoo!), and William Chang (Baidu). Jeremy justified the universal interface approach, pointing out that users don’t want to have to figure out what kind of search site to use for their queries, and that they expect a familiar interface. He also noted that Google has made great strides on update latency: it can index the Twitter firehose in the same amount of time as serving a query. Matthew offered various analyses of the social search problem, based on whether the information signal resides in content (e.g., web) or attention (e.g., Twitter), or whether the information need is expressed in an explicit search query or inferred from the user’s context. Sihem offered a counter-point to Jeremy, arguing that social media search queries often represent broad or vague information needs, and thus call for a more browsing-oriented interface than web search, which is optimized for highly specific needs. William noted that the biggest competitive threat he sees for web search engines comes from social media players–and he credits much of Baidu’s success to its surfacing of social media content.

Then came a flurry of questions, perhaps the most interesting of which was how to address identity management. William argued that people prefer interacting with real-named (or pseudonymous) people to whom they are directly connected. Sihem offered the counter-example of obtaining recommendations through community aggregation. Matthew noted the incongruity of there being no economic relationship between social network companies that maintain proprietary social graphs and people whose identities and relationships those graph represent. Jeremy pointed out that users benefit if the data is as open as possible.

Given the almost even split between academic and industry participation in the workshop, the panelists were also asked to present research challenges to academia. Jeremy posed the problem of determining when social media results are actually true. Matthew wants to see more interdisciplinary work between computer scientists and social scientists. Sihem offered two challenge problems:  scalable community discovery and evaluation of collaborative recommendation systems. William wants to see a rigorous axiomatization of social media search behavior.

After lunch, Jeremy Pickens (FXPAL) moderated a panel representing social media / networking companies: Hilary Mason (bit.ly), Igor Perisic (LinkedIn), and David Hendi (MySpace). Hilary noted that, while bit.ly does not have access to an explicit social graph, it captures implicit connections from user behavior that may not be represented in the graph. Jeremy asked the panelists how much a person’s extended network matters; David and Igor pointed out research indicating correlations of mood and even medical conditions between people and their third-degree connections. Again, the audience was full of questions, especially for Igor. As a fan of faceted search, I was glad to see him touting LinkedIn’s success in making faceted search the primary means of performing people search on the site. For an in-depth view, I recommend “LinkedIn Search: A Look Beneath the Hood“.

The afternoon continued with a poster / demo session emphasizing work in progress: tools, interfaces, research studies, and position papers. I particularly enjoyed listening to the stream of interaction between academic researchers and industry practitioners.

The final panel session assembled academic researchers to discuss their views of the challenges in social media. Gene Golovchinsky (FXPAL) moderated a panel comprised of Meena Nagarajan (Wright State University), Liangjie Hong (Lehigh University),Richard McCreadie (University of Glasgow), Jonathan Elsas (CMU), and Mor Naaman (Rutgers University). Meena highlighted the need to build up meta-data to describe the context around social utteracnces. Liangjie took a position similar to William Cheng’s, calling for a framework to model the tasks and behavior of users who interact with social media. Richard focused on the intersection of social media and news search, and noted that some of the most useful information is private and proprietary (e.g., search and chat logs). Jonathan offered a variety of challenges: determining the right retrieval granularity, managing multiple axes of organization, aggregating author behavior, and multidimensional indexing of social media content. Finally, Mor noted that we’re moving from a world of email to a “social awareness stream”, in which the content we directed content at a group and have lower expectations of readership than email. As with all of the panels, there were countless questions from the moderator and audience, particularly about determining the truthfulness of social media content and delivering social content in an effective user interface.

The final conference session was a conference was a full-group discussion that dived into the various topics addressed throughout the day. But Gene Golovchinsky provided the “one more thing” at the end, showing us a glimpse of a faceted search interface to explore a Twitter stream. It was an elegant finish to a day filled with informative and engaging discussion, and I look forward to seeing many of the participants in the WSDM conference over the next few days.

Categories
Uncategorized

Blogging SSM 2010 and WSDM 2010

I’m delighted to report that I’ll be blogging about the Search and Social Media Workshop (SSM 2010) and the Web Search and Data Mining Conference (WSDM 2010) for Communications of the ACM.

Of course, I’ll cross-post here. I also encourage folks to follow the live tweet streams at #ssm2010 and #wsdm2010, as well as Gene and Jeremy’s posts at the FXPAL blog.

To those attending: see you all tomorrow through Saturday! To everyone else: I will try my best to communicate the substance and spirit of the conference.