Year: 2011

HCIR 2011: Now on YouTube!

The Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011), held on October 20th at Google’s main campus in Mountain View, California, was a resounding success. We has almost a hundred people, presenting a wide array of papers, posters, and challenge entries. You can read my summary of the event in an earlier blog post: “HCIR 2011: We Have Arrived!“.

Better yet, you can now, for the first time in the workshop’s history, watch videos of the presentations. Embedded below are videos of Gary Marchionini’s keynote address and of the two paper presentation sessions. Thanks again to Google for being such a gracious host — now online as well as offline!

Keynote

Morning Presentations

Afternoon Presentations

General

Jim Adler: The Accidental Chief Privacy Officer

Post author By Daniel Tunkelang
Post date December 4, 2011
3 Comments on Jim Adler: The Accidental Chief Privacy Officer

Privacy is the third rail of the cloud. On one hand, the ease of sharing information and the power of analytics have produced extraordinary value for consumers, as well as great business models for companies that serve those consumers. On the other hand, people have good reason to worry about the unintended consequences of over-sharing.

When I attended the O’Reilly Strata New York Conference in September, I had the pleasure to hear and meet Intelius’s Jim Adler talk about being his company’s “accidental chief privacy officer”. Intelius‘s main product is people search — an area that naturally brings up privacy concerns. Especially since Intelius aggregates and publishes information about people from databases of public records, eroding a history of “privacy through difficulty“. Impressed with Jim’s talk at Strata, I persuaded him to deliver a similar talk at LinkedIn, the video of which you can find above. You can also find his slides on SlideShare.

Jim brings nuance to the discussion of privacy — nuance that discussions of online privacy often lack. For example, he responded to the recent controversy about social networks’ “real names” policy with a measured post entitled “Nyms, Pseudonyms, or Anonyms? All of the Above“.

Jim appropriately opened his talk by disclosing a personal example. He shares his name with a more prominent personal injury lawyer who dominates search results for that name, raising the potential of taint by association. Intelius’s core technical problem is to cluster inputs from the sources it aggregates, thus mapping each person to exactly one record in its database.

Jim went on to note that we are at a stage in the privacy debate where we are likely to see more regulation. He makes a few key observations:

Social norms, which form the basis of our laws and regulations (the notion of a “reasonable expectation of privacy) have changed suddenly, leading to a “privacy vertigo” where suddenly the whole world now feels like a small town.
Sharing is a gateway from private to public, which often leads to violation of expectations. This problem is not new, but the efficiency of online sharing dramatically amplifies the unintended consequences of sharing. It is crucial that the parties involved in sharing data also have shared expectations around how that data will be used or disclosed.
We need to distinguish between data use and data access, and not to try to regulate data use with data access regulations. He cites the Fair Credit Reporting Act as one of the most inspired laws of the last 40 years to regulate data use. If you don’t have time to listen to the whole talk, I recommend you jump to 25:12, where he discusses this law in detail.

There’s a lot more in the talk, so I’m not going to try to summarize it all here. I strongly encourage you to check out the video (which includes lengthy Q&A) and the slides. Better yet, let’s use the comments to discuss!

General

CIKM 2011 Industry Event: Slides and Summaries

Post author By Daniel Tunkelang
Post date November 27, 2011
2 Comments on CIKM 2011 Industry Event: Slides and Summaries

I’ve posted slides and summaries for all ten CIKM 2011 Industry Event presentations:

Stephen Robertson (Microsoft Research): Why Recall Matters
John Giannandrea (Google): Freebase – A Rosetta Stone for Entities
Jeff Hammerbacher (Cloudera): Experiences Evolving a New Analytical Platform: What Works and What’s Missing
Khalid Al-Kofahi (Thomson Reuters): Combining Advanced Technology and Human Expertise in Legal Research
Chavdar Botev (LinkedIn): Databus: A System for Timeline-Consistent Low-Latency Change Capture
Ben Greene (SAP): Large Memory Computers for In-Memory Enterprise Applications
David Hawking (Funnelback): Search Problems and Solutions in Higher Education
Ed Chi (Google): Model-Driven Research in Social Computing
Vanja Josifovski (Yahoo! Research): Toward Deep Understanding of User Behavior on the Web
Ilya Segalovich (Yandex): Improving Search Quality at Yandex: Current Challenges and Solutions

General

CIKM 2011 Industry Event: Ilya Segalovich on Improving Search Quality at Yandex

Post author By Daniel Tunkelang
Post date November 27, 2011

This post is last in a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The final talk of the CIKM 2011 Industry Event was a talk from Yandex co-founder and CTO Ilya Segalovich on “Improving Search Quality at Yandex: Current Challenges and Solutions“.

Yandex is the world’s #5 search engine. It dominates the Russian search market, where it has over 64% market share. Ilya focused on three challenges facing Yandex: result diversification, recency-specific ranking, and cross-lingual search.

For result diversification, Ilya focused on queries containing entities without any addition indicators of intent. He asserted that entities offer a strong but incomplete signal of query intent, and in particular that entities often call for suggested query reformulations. The first step in processing such a query is entity categorization. Ilya said that Yandex achieved almost 90% precision using machine learning, and over 95% precision by incorporating manually tuned heuristics. The second step is enumerating possible search intents for the identified category in order to optimize for intent-aware expected reciprocal rank. By diversifying entity queries, Yandex reduced abandonment on popular queries, increased click-through rates, and was able to highlight possible intents in result snippets.

Ilya then talked about the problem of balancing recency and relevance in handling queries about current events. He sees recency ranking as a diversification problem, since a desire for recent content is a kind of query intent. A challenge is managing recency-specific ranking is to predict the recency sensitivity of the user for a given query. Yandex considers factors such as the fraction of results found that are at most 3 days old, the number of news results, spikes in the query stream, lexical cues (e.g., searches for “explosion” or “fire”), and Twitter trending topics. He also referred to a WWW 2006 paper he co-authored on extracting news-related queries from web query logs. The results of these efforts led to measurable improvements in click-based metrics of user happiness.

Ilya talked about a variety of efforts to support cross-lingual search. Russian users enter a significant fraction (about 15%) of non-Russian queries, but many still prefer Russian-language results. For example, a search for a company name return that company’s Russian-language home page if one is available. Yandex implements language personalization by learning a user’s language knowledge and using it as a factor in relevance computation. Yandex also uses machine translation to serve results for Russian-language queries when there are no relevant Russian-language results.

Ilya concluded by pitching the efforts that Yandex is making to participate in and support the broader information retrieval community, including running (and releasing data for) a relevance prediction challenge. It’s great to see a reminder that there is more to web search than Google vs. Bing, and refreshing to see how much Yandex shares its methodology and results with the IR community.

General

CIKM 2011 Industry Event: Vanja Josifovski on Toward Deep Understanding of User Behavior on the Web

Post author By Daniel Tunkelang
Post date November 27, 2011

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Those of you who attended the SIGIR 2009 Industry Track had the opportunity to hear Yahoo researcher Vanja Josifovski make an eloquent case for ad retrieval as a new frontier of information retrieval. At the CIKM 2011 Industry Event, Vanja delivered an equally compelling presentation entitled “Toward Deep Understanding of User Behavior: A Biased View of a Practitioner”.

Vanja first offered a vision in which the web of the future will be your life partner, delivering life-long pervasive personalized experience. Everything will be personalized, and the experience will pervade your entire online experience — from your laptop to your web-enabled toaster.

He then brought us back to the state of personalization today. For search personalization, the low entropy of query intent makes it difficult — or too risky — to significantly outperform the baseline of non-personalized search. In his view, the action today is in content recommendation and ad targeting, where there is high entropy of intent and lots of room for improvement over today’s crude techniques.

How do we achieve these improvements? We need more data, larger scale, and better methods for reasoning about data. In particular, Vanja noted the data we have today — searches, page views, connections, messages, purchases — represents the user’s thin observable state. In contrast, we lack data about the user’s internal state, e.g., is the user jet-lagged or worried about government debt. Vanja said that the only way to get more data is to motivate users by creating value for them with it — i.e., social is give to get.

Of course, we can’t talk about user’s hidden data without thinking about privacy. Vanja asserts that privacy is not dead, but that it’s in hibernation. So far, he argued, we’ve managed with a model of industry self-governance with relatively minor impact from data leaks — specifically as compared to the offline world. But he is apprehensive at the prospect of a major privacy breach inducing legislation that sets back personalization efforts for decades.

Vanja then talked about current personalization methods, including learning relationships among features, dimensionality reduction, and smoothing using external data. He argues that many of the models are mathematically very similar to one another, and it is difficult to analyze the relative merits of the models as opposed to other implementation details of the systems that use them.

Finally, Vanja touched on scale issues. He noted that the MapReduce framework imposes significant restrictions on algorithms used for personalization, and that we need the right abstractions for modeling in parallel environments.

Vanja concluded his talk by citing the role of CIKM as a conference in bringing together the communities that research deep user understanding, information retrieval, and databases. Given the exciting venue for next year’s conference, I’m sure we’ll continue to see CIKM play this role!

ps. My thanks to Jeff Dalton for live-blogging his notes.

General

CIKM 2011 Industry Event: Ed Chi on Model-Driven Research in Social Computing

Post author By Daniel Tunkelang
Post date November 25, 2011

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Given the extraordinary ascent of all things social in today’s online world, we could hardly neglect this theme at the CIKM 2011 Industry Event. We were lucky to have Ed Chi, who recently left the PARC Augmented Social Cognition Group to work on Google+, presenting “Model-Driven Research in Social Computing“.

Ed warned us at the beginning of the talk that his focus would be on work he’d done prior to joining Google. Nonetheless, he offered an interesting collection of public statistics about social activity associated with Google properties: 360M words per day being published on Blogger, 150 years of YouTube video being watched everyday on Facebook, and 40M+ people using Google+. Regardless of how Google has fared in the competition for social networking mindshare, Google is clearly no stranger to online social behavior.

Ed then dove into recent research that he and colleagues have done on Twitter activity. Since all of the papers he discussed are available online, I will only touch on highlights. I encourage you to read the full papers:

Ed talked at some length about language-dependent behavior on Twitter. For example, tweets in French are more likely to contain URLs than those in English, while tweets in Japanese are less likely (perhaps because the language is more compact relative to Twitter’s 140-character limit?). Tweets in Korean are far more likely to be conversational (i.e., explicitly mentioning or replying to other users) than those in English. These differences remind us to be cautious in generalizing our understanding of online social behavior from the behavior of English-speaking users. Ed also talked about cross-language “brokers” who tweet in multiple languages: he sees these as indicating connection strength between languages, as well as giving us insight to improve cross-language communication.

Ed then talked about ways to reduce information overload in social streams. These included Eddi, a tool for summarizing social streams, and zerozero88, a closed experiment to produce a personal newspaper from a tweet stream. In analyzing the results of the zerozero88 experiment, Ed and his colleagues found that the most successful recommendation strategy combined users’ self-voting with social voting by their friends of friends. They also found that users wanted both relevance and serendipity — a challenge since the two criteria often compete with one another.

Ed concluded by offering the following design rule: since interaction costs determine number of the people who participate in social activity, get more people into the system by reducing interaction cost. He asserted that this is a key design principle for Google+.

My skepticism about Google’s social efforts is a matter of public record (cf. Social Utility, +/- 25%; Google±?). But hiring Ed Chi was a real coup for Google, and I’m optimistic about what he’ll bring to the Google+ effort.

ps. My thanks to Jeff Dalton for live-blogging his notes.

General

CIKM 2011 Industry Event: David Hawking on Search Problems and Solutions in Higher Education

Post author By Daniel Tunkelang
Post date November 22, 2011

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

One of the recurring themes at the CIKM 2011 Industry Event was that not all search is web search. Stephen Robertson, in advocating why recall matters, noted that web search was exceptional rather than typical as an information retrieval domain. Khalid Al-Kofahi on spoke about the challenges of legal search. Focusing on a different vertical, Funnelback Chief Scientist David Hawking spoke about “Search Problems and Solutions in Higher Education“.

David spent most of the presentation focusing on work that Funnelback did for the Australian National University. Funnelback was originally developed by CSIRO and the ANU under the name Panoptic.

The ANU has a substantial web presence, comprised of hundreds of sites and over a million pages. Like many large sites, it suffers from propagation delay: the most important pages are fresh, but material on the outposts can be stale. Moreover, there is broad diversity of authorship.

The university also has a strong editorial stance for ranking search results: the search engine needs to identify and favor official content. Given the proliferation of unofficial content, it can be a challenge to identify official sites based on signals like incoming link count, click counts, and the use of official style templates.

David described a particular application that Funnelback developed for ANU: a university course finder. The problem is similar to that of ecommerce search and calls for similar solutions, e.g., faceted search, auto-complete, and suggestions of related queries. And, just as in ecommerce, we can evaluate performance in terms of conversion rate.

David ended his talk by touching on expertise finding (a problem I think about a lot as a LinkedIn data scientist!) and showing demos. And, while I no longer work in enterprise search myself, I still appreciate its unique challenges. I’m glad that David and his colleagues are working to overcome those challenges, especially in a domain as important as education.

General

CIKM 2011 Industry Event: Ben Greene on Large Memory Computers for In-Memory Enterprise Applications

Post author By Daniel Tunkelang
Post date November 22, 2011
1 Comment on CIKM 2011 Industry Event: Ben Greene on Large Memory Computers for In-Memory Enterprise Applications

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

Large-scale computation was, not surprisingly, a major theme at the CIKM 2011 Industry Event. Ben Greene, Director of SAP Research Belfast, delivered a presentation on “Large Memory Computers for In-Memory Enterprise Applications“.

Ben started by defining in-memory computing as “technology that allows the processing of massive quantities of real time data in the main memory of the server to provide immediate results from analyses and transactions”. He then asked whether the cloud enables real-time computing, since there is a clear market hunger for cloud computing to solve the problems of our current enterprise systems.

Not surprisingly, he advocated in-memory computing as the solution for those problems. Like John Ousterhout and the RAMCloud team, he sees the need to scale DRAM memory independently from physical boxes. He proposed a model of coherent shared memory, using high-speed low-latency networks and separating the data transport and cache layers into a separate tier below the operating system. The goal: no server-side application caches, DRAM-like latency for physically distributed databases, and in fact no separation between the application server and the database server.

Ben argued that coherent shared memory can dramatically lower the cost of in-memory computing while minimizing the pain for application developers. He also offered some benchmarks for SAP’s BigIron system to demonstrate the performance improvements.

In short, Ben offered a vision of in-memory computing as a reincarnation of the mainframe. It was an interesting and provocative presentation, and my only regret is that we couldn’t stage a debate between him and Jeff Hammerbacher over the future of large-scale enterprise computing.

General

CIKM 2011 Industry Event: Chavdar Botev on Databus: A System for Timeline-Consistent Low-Latency Change Capture

Post author By Daniel Tunkelang
Post date November 20, 2011
1 Comment on CIKM 2011 Industry Event: Chavdar Botev on Databus: A System for Timeline-Consistent Low-Latency Change Capture

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

I’m of course delighted that one of my colleagues at LinkedIn was able to participate in the CIKM 2011 Industry Event. Principal software engineer Chavdar Botev delivered a presentation on “Databus: A System for Timeline-Consistent Low-Latency Change Capture“.

LinkedIn processes a massive amount of member data and activity. It has over 135M members and is growing faster than two new members per second. Based on recent measurements, those members are on track to perform more than four billion searches on the LinkedIn platform in 2011. All of this activity requires a data change capture mechanism that allows external systems, such as its graph index and real-time full-text search index Zoie, to act as subscribers in user space and stay up to date with constantly changing data in the primary stores.

LinkedIn has built the Databus system to meet these needs. Databus meets four key requirements: timeline consistency, guaranteed delivery, low latency, and user-space visibility. For example, edits to member profile fields, such as companies and job titles, need to be standardized. Also, in order to give recruiters act quickly on feedback to their job postings, we need to be able to propagate the changes to the job description in near-real-time.

Databus propagates data changes throughout LinkedIn’s architecture. When there is a change in a primary store (e.g., member profiles or connections), the changes are buffered in the Databus Relay through a push or pull interface. The relay can also capture the transactional semantics of updates. Clients poll for changes in the relay. If a client falls behind the stream of change events in the relay, it is redirected to a Bootstrap database that delivers a compressed delta of the changes since the last event seen by the client.

In contrast to generic message systems (including the Kafka system that LinkedIn has open-sourced through Apache), Databus has moreinsight in the structure of the messages and can thus do better than just guaranteeing message-level integrity andtransactional semantics for communication sessions.

I tend to live a few levels above core infrastructure, but I’m grateful that Chavdar and his colleagues build the core platform that makes all of our large-scale data collection possible. After all, without data we have no data science.

General

CIKM 2011 Industry Event: Khalid Al-Kofahi on Combining Advanced Search Technology and Human Expertise in Legal Research

Post author By Daniel Tunkelang
Post date November 19, 2011

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The original program for the CIKM 2011 Industry Event featured Peter Jackson, who was chief scientist at Thomson Reuters and author of numerous books and papers on natural language processing. Sadly, Peter died on August 3,2011. Thomson Reuters R&D VP of Research Khalid Al-Kofahi graciously agreed to speak in his place, delivering a presentation on “Combining Advanced Search Technology and Human Expertise in Legal Research“.

Khalid began by giving an “83-second” overview of the US legal system, laying out the roles of the law, the courts, and the legislature. He did so to provide the context for the domain that Thomson Reuters serves — namely, legal information. Legal information providers curate legal information, enhance it editorially and algorithmically, and work to make legal information findable and explainable in particular task contexts. He then worked through an example of how a case law document (specifically, Burger King v. Rudzewicz), appears in WestLawNext, with annotations that include headnotes, topic codes, citation data, and historical context.

Channelling William Goffman, Khalid asserted that a document’s content (words, phrases, metadata) are not sufficient to determine its aboutness and importance. Rather, we also have to consider what other people say about the document and how they interact with it. This is especially true in the legal domain because of the precedential nature of law. He then framed legal search in terms of information retrieval metrics, stating the requirements as completeness (recall), accuracy (precision), and authority. Not surprisingly, Khalid agreed with Stephen Robertson’s emphasis on the importance of recall.

Speaking more generally, Khalid noted that vertical search is not just about search. Rather, it’s about findability. which includes navigation, recommendations, clustering, faceted classification, collaboration, etc. Most importantly, it’s about satisfying a set of well-understood tasks. And, particularly in the legal domain, customers demand explainable models. Beyond this demand, explainability serves an additional purpose: it enables the human searcher to add value to the process (cf. human-computer information retrieval).

It is sad to lose a great researcher like Peter Jackson from our ranks, but I am grateful that Khalid was able to honor his memory by presenting their joint work at CIKM. If you’d like to learn more, I encourage you to read the publications on the Thomson Reuters Labs page.