Categories
General

Guided Exploration = Faceted Search, Backwards

Information Scent

In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.

Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.

For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.

Precision / Recall Asymmetry

Faceted search is a great tool for information seeking systems. But it offers a flow that is asymmetric with respect to precision and recall.

Let’s invert the flow of faceted search. Rather than starting from a large, imprecise result set and progressively narrowing it; let’s start from a small, precise result set and progressively expand it. Since faceted search is often called “guided navigation” (a term Fritz Knabe and I coined at Endeca), let’s call this approach “guided exploration” (which has a nicer ring than “guided expansion”).

Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.

That sounds great in theory, but how can we implement guided exploration in practice?

Let’s remind ourselves why faceted search works so well. Faceted search offers the user information scent: the facet values help the user identify regions of higher precision relative to his or her information need. By selecting a sequence of facet values, the user arrives at a non-empty set that consists entirely or mostly of relevant results.

How to Expand a Result Set

How do we invert this flow? Just as enlarging an image is more complicated than reducing one, increasing recall is more complicated than increasing precision.

If our initial set is the result of selecting multiple facet values, then we may be able to increase recall by de-selecting facet values (e.g., de-selecting San Francisco Bay Area and CMU in my previous example). If we are using hierarchical facets, then rather than de-selecting a facet value, we may be able to replace it with a parent value (e.g., replacing San Francisco Bay Area with California). We can also remove one or more search keywords to broaden the results (e.g., information or retrieval).

Those are straightforward query relaxations. But there are more interesting ways to expand our results:

  • We can replace a facet value with the union or that value and similar values (e.g., replacing CMU with CMU OR MIT).
  • We can replace the entire query (or any subquery) with a union of that query and the results for selecting a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR Company: Google)
  • We can replace the entire query (or any subquery) with a union of that query and the results for a keyword search a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR [faceted search]).

As we can see, there are many ways to progressively refine a query in a way that expands the result set. The question is how we provide users with options that increase recall while preserving as much precision as possible.

Frequency : Recall :: Similarity : Precision

Developers of faceted search systems don’t necessarily invest much thought into deciding which faceted refinement options to present to users. Some systems simply avoid dead ends, offer user all refinement options that least to a non-zero result set. This approach breaks down when there are too many options, in which case most systems offer users the most frequent facet values. A chapter in my faceted search book discusses some other options.

Unfortunately, the number of options for guided exploration — at least if we go beyond the very limited basic options — is too vast to apply such a naive approach. Unions never lead to dead ends, and we don’t have a simple measure like frequency to rank our options.

Or perhaps we do. A good reason to favor frequent values as faceted refinement options is that they tend to preserve recall. What we need is a measure that tends to preserving precision when we expand a result set.

That measure is set similarity. More specifically, it is the asymmetric similarity between a set and a superset containing it, which we can think of as the former’s representativeness of the latter. If we are working with facets, we can measure this similarity in terms of differences between distributions of the facet values. If the current set has high precision, we should favor supersets that are similar to it in order to preserve precision.

I’ll spare readers the math, but I encourage you to read about Kullback-Leibler divergence and Jensen-Shannon divergence if you are not familiar with measures of similarity between probability distributions. I’m also glossing over key implementation details  — such as how to model distributions of facet values as probability distributions, and how to handle smoothing and normalization for set size. I’ll try to cover these in future posts. But for now, let’s assume that we can measure the similarity between a set and a superset.

Guided Exploration: A General Framework

We now have the elements to put together a general framework for guided exploration:

  • Generate a set of candidate expansion options from the current search query using operations such as the following:
    • De-select a facet value.
    • Replace a facet value with its parent.
    • Replace a facet value with the union of it and other values from that facet.
    • Remove a search keyword.
    • Replace a search keyword with the union of it and related keywords.
    • Replace the entire query with the union of it and a related facet value selection.
    • Replace the entire query with the union of it and a related keyword search.
  • Evaluate each expansion option based on the similarity of the resulting set to the current one.
  • Present the most similar sets to the user as expansion options.

Visualizing Drift

It’s one thing to tell a user that two sets are distributionally similar based on an information-theoretic measure, and another to communicate that similarity in a language the user can understand. Here is an example of visualizing the similarity between [information retrieval] AND School: CMU and [information retrieval] AND School: (CMU or MIT):

As we can see from even this basic visualization, replacing CMU with (CMU OR MIT) increases the number of results by 70% while keeping a similar distribution of current companies — the notable exception being people who work for their almae matres.

Conclusion

Faceted search offers some of the most convincing evidence in favor of Gary Marchionini‘s advocacy that we “empower people to explore large-scale information bases but demand that people also take responsibility for this control”. Guided exploration aims to generalize the value proposition of faceted search by inverting the roles of precision and recall. Given the importance of recall, I hope to see progress in this direction. If this is a topic that interests you, give me a shout. Especially if you’re a student looking for an internship this summer!

Categories
General

Next Play!

Every year brings its own adventures, but for me 2011 will be a tough act to follow.

A year ago, I’d just started working at LinkedIn, and my biggest concern was selling our apartment in Brooklyn so that my family could join me in California.

Little did I imagine that my new manager, who had just recruited me from Google to LinkedIn (and persuaded my family to change coasts!), would leave three months later for a startup. Welcome to Silicon Valley! At the time, I felt unready for the abrupt transition into the product executive team. In retrospect, I’m thankful for the kick in the pants that helped me transform my role and brought the best out of a great team.

Summer brought the excitement of LinkedIn’s IPO. The process was exhilarating, especially to someone who had worked for over a decade at a pre-IPO company.

Nonetheless, we didn’t let the IPO distract us from our mission. In March, we celebrated our 100 millionth member; by November, we passed 135 million. And lots more. We released new data products like Skills and Alumni. We won the OSCON Data Innovation Award for contributions to the open source software for big data. We also acquired a few companies, including search engine startup IndexTank. In short, we heeded the two short words on the back of our commemorative IPO t-shirts: “next play”.

Fall was an intense season of conferences. Between CIKM, HCIR, RecSys, Strata, and a talk at CMU, it was a great opportunity to connect and reconnect with researchers and practitioners around the world. I am particularly proud of the success of this year’s HCIR workshop, which showed how much the workshop (now to become a 2-day symposium!) has grown up in five years.

But what capped my year off was seeing Endeca, the company I helped start in 1999, become one of Oracle’s largest acquisitions. Even though it’s been two years since I left, Endeca will always be a core facet of my professional identity. I look forward to great things from all the folks I worked with.

That brings us to 2012, ready to start a new year of adventures. Tough or not, our job is to make every new year more amazing than the previous ones. I’m ready for the challenge, and I hope you are too.

Here’s a teaser of what I have planned:

  • My team at LinkedIn is launching into 2012 with a strong focus on derived data quality and relevance. As regular readers know, I see data quality and richer interfaces for information seeking as inseparable concerns. And, speaking of quality, we’re hiring!
  • I’ll be speaking at Strata in a couple of months with Claire Hunsaker of Samasource about “Humans, Machines, and the Dimensions of Microwork“. I’m very excited to talk about the intersection of crowdsourcing and data science. And I’ll be joined by three of LinkedIn’s top data scientists: Monica Rogati, Sam Shah, and Pete Skomoroch.
  • I’ll be co-chairing the RecSys Industry Track this fall with Yehuda Koren. I’m honored to have the opportunity to work with Yehuda, who was part of the Netflix Grand Prize team and won best paper at RecSys 2011. We’re still putting together the program, but you can look at last year’s program to get an idea of what’s in store.
  • I’ll be at the CIKM Industry Event, this time as an invited speaker. CIKM will be take place in Maui this fall and I’m excited about the program that Evgeniy Gabrilovich is putting together for the Industry Event. It will be an all-invited program, just like last year.

I hope you’re also starting 2012 with a fresh sense of purpose. Let’s take a last moment to reflect on a great 2011, and then…NEXT PLAY!

Categories
General

HCIR 2011: Now on YouTube!

The Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011), held on October 20th at Google’s main campus in Mountain View, California, was a resounding success. We has almost a hundred people, presenting a wide array of papers, posters, and challenge entries. You can read my summary of the event in an earlier blog post: “HCIR 2011: We Have Arrived!“.

Better yet, you can now, for the first time in the workshop’s history, watch videos of the presentations. Embedded below are videos of Gary Marchionini’s keynote address and of the two paper presentation sessions. Thanks again to Google for being such a gracious host — now online as well as offline!

Keynote

Morning Presentations

Afternoon Presentations

Categories
General

Jim Adler: The Accidental Chief Privacy Officer

Privacy is the third rail of the cloud. On one hand, the ease of sharing information and the power of analytics have produced extraordinary value for consumers, as well as great business models for companies that serve those consumers. On the other hand, people have good reason to worry about the unintended consequences of over-sharing.

When I attended the O’Reilly Strata New York Conference in September, I had the pleasure to hear and meet Intelius’s Jim Adler talk about being his company’s “accidental chief privacy officer”. Intelius‘s main product is people search — an area that naturally brings up privacy concerns. Especially since Intelius aggregates and publishes information about people from databases of public records, eroding a history of “privacy through difficulty“. Impressed with Jim’s talk at Strata, I persuaded him to deliver a similar talk at LinkedIn, the video of which you can find above. You can also find his slides on SlideShare.

Jim brings nuance to the discussion of privacy — nuance that discussions of online privacy often lack. For example, he responded to the recent controversy about social networks’ “real names” policy with a measured post entitled “Nyms, Pseudonyms, or Anonyms? All of the Above“.

Jim appropriately opened his talk by disclosing a personal example. He shares his name with a more prominent personal injury lawyer who dominates search results for that name, raising the potential of taint by association. Intelius’s core technical problem is to cluster inputs from the sources it aggregates, thus mapping each person to exactly one record in its database.

Jim went on to note that we are at a stage in the privacy debate where we are likely to see more regulation. He makes a few key observations:

  • Social norms, which form the basis of our laws and regulations (the notion of a “reasonable expectation of privacy) have changed suddenly, leading to a “privacy vertigo” where suddenly the whole world now feels like a small town.
  • Sharing is a gateway from private to public, which often leads to violation of expectations. This problem is not new, but the efficiency of online sharing dramatically amplifies the unintended consequences of sharing. It is crucial that the parties involved in sharing data also have shared expectations around how that data will be used or disclosed.
  • We need to distinguish between data use and data access, and not to try to regulate data use with data access regulations. He cites the Fair Credit Reporting Act as one of the most inspired laws of the last 40 years to regulate data use. If you don’t have time to listen to the whole talk, I recommend you jump to 25:12, where he discusses this law in detail.

There’s a lot more in the talk, so I’m not going to try to summarize it all here. I strongly encourage you to check out the video (which includes lengthy Q&A) and the slides. Better yet, let’s use the comments to discuss!

Categories
General

CIKM 2011 Industry Event: Slides and Summaries

I’ve posted slides and summaries for all ten CIKM 2011 Industry Event presentations:

Categories
General

CIKM 2011 Industry Event: Ilya Segalovich on Improving Search Quality at Yandex

This post is last in a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The final talk of the CIKM 2011 Industry Event was a talk from Yandex co-founder and CTO Ilya Segalovich on “Improving Search Quality at Yandex: Current Challenges and Solutions“.

Yandex is the world’s #5 search engine. It dominates the Russian search market, where it has over 64% market share. Ilya focused on three challenges facing Yandex: result diversification, recency-specific ranking, and cross-lingual search.

For result diversification, Ilya focused on queries containing entities without any addition indicators of intent. He asserted that entities offer a strong but incomplete signal of query intent, and in particular that entities often call for suggested query reformulations. The first step in processing such a query is entity categorization. Ilya said that Yandex achieved almost 90% precision using machine learning, and over 95% precision by incorporating manually tuned heuristics. The second step is enumerating possible search intents for the identified category in order to optimize for intent-aware expected reciprocal rank. By diversifying entity queries, Yandex reduced abandonment on popular queries, increased click-through rates, and was able to highlight possible intents in result snippets.

Ilya then talked about the problem of balancing recency and relevance in handling queries about current events. He sees recency ranking as a diversification problem, since a desire for recent content is a kind of query intent. A challenge is managing recency-specific ranking is to predict the recency sensitivity of the user for a given query. Yandex considers factors such as the fraction of results found that are at most 3 days old, the number of news results, spikes in the query stream, lexical cues (e.g., searches for “explosion” or “fire”), and Twitter trending topics. He also referred to a WWW 2006 paper he co-authored on extracting news-related queries from web query logs. The results of these efforts led to measurable improvements in click-based metrics of user happiness.

Ilya talked about a variety of efforts to support cross-lingual search. Russian users enter a significant fraction (about 15%) of non-Russian queries, but many still prefer Russian-language results. For example, a search for a company name return that company’s Russian-language home page if one is available. Yandex implements language personalization by learning a user’s language knowledge and using it as a factor in relevance computation. Yandex also uses machine translation to serve results for Russian-language queries when there are no relevant Russian-language results.

Ilya concluded by pitching the efforts that Yandex is making to participate in and support the broader information retrieval community, including running (and releasing data for) a relevance prediction challenge. It’s great to see a reminder that there is more to web search than Google vs. Bing, and refreshing to see how much Yandex shares its methodology and results with the IR community.

Categories
General

CIKM 2011 Industry Event: Vanja Josifovski on Toward Deep Understanding of User Behavior on the Web

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Those of you who attended the SIGIR 2009 Industry Track had the opportunity to hear Yahoo researcher Vanja Josifovski make an eloquent case for ad retrieval as a new frontier of information retrieval. At the CIKM 2011 Industry Event, Vanja delivered an equally compelling presentation entitled “Toward Deep Understanding of User Behavior: A Biased View of a Practitioner”.

Vanja first offered a vision in which the web of the future will be  your life partner, delivering life-long pervasive personalized experience. Everything will be personalized, and the experience will pervade your entire online experience — from your laptop to your web-enabled toaster.

He then brought us back to the state of personalization today. For search personalization, the low entropy of query intent makes it difficult — or too risky — to significantly outperform the baseline of non-personalized search. In his view, the action today is in content recommendation and ad targeting, where there is high entropy of intent and lots of room for improvement over today’s crude techniques.

How do we achieve these improvements? We need more data, larger scale, and better methods for reasoning about data. In particular, Vanja noted the data we have today — searches, page views, connections, messages, purchases — represents the user’s thin observable state. In contrast, we lack data about the user’s internal state, e.g., is the user jet-lagged or worried about government debt. Vanja said that the only way to get more data is to motivate users by creating value for them with it — i.e., social is give to get.

Of course, we can’t talk about user’s hidden data without thinking about privacy. Vanja asserts that privacy is not dead, but that it’s in hibernation. So far, he argued, we’ve managed with a model of industry self-governance with relatively minor impact from data leaks — specifically as compared to the offline world. But he is apprehensive at the prospect of a major privacy breach inducing legislation that sets back personalization efforts for decades.

Vanja then talked about current personalization methods, including learning relationships among features, dimensionality reduction, and smoothing using external data. He argues that many of the models are mathematically very similar to one another, and it is difficult to analyze the relative merits of the models as opposed to other implementation details of the systems that use them.

Finally, Vanja touched on scale issues. He noted that the MapReduce framework imposes significant restrictions on algorithms used for personalization, and that we need the right abstractions for modeling in parallel environments.

Vanja concluded his talk by citing the role of CIKM as a conference in bringing together the communities that research deep user understanding, information retrieval, and databases. Given the exciting venue for next year’s conference, I’m sure we’ll continue to see CIKM play this role!

ps. My thanks to Jeff Dalton for live-blogging his notes.

Categories
General

CIKM 2011 Industry Event: Ed Chi on Model-Driven Research in Social Computing

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Given the extraordinary ascent of all things social in today’s online world, we could hardly neglect this theme at the CIKM 2011 Industry Event. We were lucky to have Ed Chi, who recently left the PARC Augmented Social Cognition Group to work on Google+, presenting “Model-Driven Research in Social Computing“.

Ed warned us at the beginning of the talk that his focus would be on work he’d done prior to joining Google. Nonetheless, he offered an interesting collection of public statistics about social activity associated with Google properties: 360M words per day being published on Blogger, 150 years of YouTube video being watched everyday on Facebook, and 40M+ people using Google+. Regardless of how Google has fared in the competition for social networking mindshare, Google is clearly no stranger to online social behavior.

Ed then dove into recent research that he and colleagues have done on Twitter activity. Since all of the papers he discussed are available online, I will only touch on highlights. I encourage you to read the full papers:

Ed talked at some length about language-dependent behavior on Twitter. For example, tweets in French are more likely to contain URLs than those in English, while tweets in Japanese are less likely (perhaps because the language is more compact relative to Twitter’s 140-character limit?). Tweets in Korean are far more likely to be conversational (i.e., explicitly mentioning or replying to other users) than those in English. These differences remind us to be cautious in generalizing our understanding of online social behavior from the behavior of English-speaking users. Ed also talked about cross-language “brokers” who tweet in multiple languages: he sees these as indicating connection strength between languages, as well as giving us insight to improve cross-­language communication.

Ed then talked about ways to reduce information overload in social streams. These included Eddi, a tool for summarizing social streams, and zerozero88, a closed experiment to produce a personal newspaper from a tweet stream. In analyzing the results of the zerozero88 experiment, Ed and his colleagues found that the most successful recommendation strategy combined users’ self-voting with social voting by their friends of friends. They also found that users wanted both relevance and serendipity — a challenge since the two criteria often compete with one another.

Ed concluded by offering the following design rule: since interaction costs determine number of the people who participate in social activity, get more people into the system by reducing interaction cost. He asserted that this is a key design principle for Google+.

My skepticism about Google’s social efforts is a matter of public record (cf. Social Utility, +/- 25%Google±?). But hiring Ed Chi was a real coup for Google, and I’m optimistic about what he’ll bring to the Google+ effort.

ps. My thanks to Jeff Dalton for live-blogging his notes.

Categories
General

CIKM 2011 Industry Event: David Hawking on Search Problems and Solutions in Higher Education

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

One of the recurring themes at the CIKM 2011 Industry Event was that not all search is web search. Stephen Robertson, in advocating why recall matters, noted that web search was exceptional rather than typical as an information retrieval domain. Khalid Al-Kofahi on spoke about the challenges of legal search. Focusing on a different vertical, Funnelback Chief Scientist David Hawking spoke about “Search Problems and Solutions in Higher Education“.

David spent most of the presentation focusing on work that Funnelback did for the Australian National University. Funnelback was originally developed by CSIRO and the ANU under the name Panoptic.

The ANU has a substantial web presence, comprised of hundreds of sites and over a million pages. Like many large sites, it suffers from propagation delay: the most important pages are fresh, but material on the outposts can be stale. Moreover, there is broad diversity of authorship.

The university also has a strong editorial stance for ranking search results: the search engine needs to identify and favor official content. Given the proliferation of unofficial content, it can be a challenge to identify official sites based on signals like incoming link count, click counts, and the use of official style templates.

David described a particular application that Funnelback developed for ANU: a university course finder. The problem is similar to that of ecommerce search and calls for similar solutions, e.g., faceted search, auto-complete, and suggestions of related queries. And, just as in ecommerce, we can evaluate performance in terms of conversion rate.

David ended his talk by touching on expertise finding (a problem I think about a lot as a LinkedIn data scientist!) and showing demos. And, while I no longer work in enterprise search myself, I still appreciate its unique challenges. I’m glad that David and his colleagues are working to overcome those challenges, especially in a domain as important as education.
Categories
General

CIKM 2011 Industry Event: Ben Greene on Large Memory Computers for In-Memory Enterprise Applications

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Large-scale computation was, not surprisingly, a major theme at the CIKM 2011 Industry Event. Ben Greene, Director of SAP Research Belfast, delivered a presentation on “Large Memory Computers for In-Memory Enterprise Applications“.

Ben started by defining in-memory computing as “technology that allows the processing of massive quantities of real time data in the main memory of the server to provide immediate results from analyses and transactions”. He then asked whether the cloud enables real-time computing, since there is a clear market hunger for cloud computing to solve the problems of our current enterprise systems.

Not surprisingly, he advocated in-memory computing as the solution for those problems. Like John Ousterhout and the RAMCloud team, he sees the need to scale DRAM memory independently from physical boxes. He proposed a model of coherent shared memory, using high-speed low-latency networks and separating the data transport and cache layers into a separate tier below the operating system. The goal: no server-side application caches, DRAM-like latency for physically distributed databases, and in fact no separation between the application server and the database server.

Ben argued that coherent shared memory can dramatically lower the cost of in-memory computing while minimizing the pain for application developers. He also offered some benchmarks for SAP’s BigIron system to demonstrate the performance improvements.

In short, Ben offered a vision of in-memory computing as a reincarnation of the mainframe. It was an interesting and provocative presentation, and my only regret is that we couldn’t stage a debate between him and Jeff Hammerbacher over the future of large-scale enterprise computing.