The Noisy Channel


Enjoying Seattle’s Best: UW, WSDM, and SSS

February 12th, 2012 by Daniel Tunkelang

My excursion to Seattle was delightful, and I thought I’d share some details with readers.

I spent most of Friday at the University of Washington, meeting with graduating PhD students.  I’ve always known that UW is a top school, but I was particularly impressed with this batch. I was pleasantly surprised to see folks like Nodira Khoussainova and Kayur Patel working to bring together the often disparate worlds of databases, machine learning, and HCI in order to make people more effective at solving “big data” problems. I realize that I’m aiding and abetting other employers with whom I compete for top talent, but it would be wrong not to encourage everyone to find worthy challenges for these budding scientists.

I then went to the Space Needle to meet up with the WSDM 2012 crowd. Jaime Teevan and Eytan Adar outdid themselves, providing a great setting for folks to mingle, imbibe, and enjoy a spectacular view of Seattle.

Saturday I attended the “social” day of the WSDM conference.

Andrew Tomkins chaired the first morning session, which included Hila Becker‘s latest work on identifying event content in social media and Georgios Zervas presenting the work on the analyzing reputational effects of Groupon that triggered quite a controversy last September. After the break came the spotlight section — a great sequence of 5-minute presentations that in which researchers both summarized their contributions and lured attendees to visit their posters. I hope that more conferences adopt this format, which optimizes for communicating ideas and discourages long-winded expositions.

I then had the pleasure to have lunch with Jan Pedersen and friends at Blueacre Seafood — great food and even better conversation. We both noted the irony that, even though we are practically neighbors, we only seems to meet up at events like these..

I made it back to the conference in time to hear the two best-paper awardees: Adam Sadilek on “Finding Your Friends and Following Them to Where You Are” and Yaron Singer on “How to Win Friends and Influence People, Truthfully: Influence Maximization Mechanisms for Social Networks“. I highly recommend both papers, especially if you are interested in either social network prediction or the underlying economics of influence.

Another coffee break, and then the keynote: Hilary Mason on “The Secret Life of Social Links”. Hilary is a great speaker — I first met her when I invited her to the Workshop on Search and Social Media (SSM 2010) at WSDM 2010. She didn’t disappoint, and it’s great to see practitioners like her crossing the aisle to engage the academic community. Not to mention infusing their slides with lolcats.

The conference wrapped up at 5pm, but then we bussed over to Microsoft Research for the Social Search Social. That was a fun event designed to cross-pollinate the WSDM and CSCW communities. Meredith Ringel MorrisGene GolovchinksyJeremy PickensMadhu ReddyChirag Shah, and Michael Twidale put together a great program of 45-second madness presentations and “speed-dating” to pair up WSDM and CSCW attendees. It was far too short, but a lot of fun. And some of us kept up the social spirit by grabbing dinner afterward at Blue C Sushi.

To everyone I met in the last couple of days: thanks for the great company and conversation! Keep sharing ideas and making data and science social.


Social Wisdom in Seattle

February 4th, 2012 by Daniel Tunkelang


First, I wanted to give readers a heads up that I’ll be in Seattle this Friday and Saturday. I’ll spend Friday afternoon at the University of Washington, meeting with some of their outstanding computer science doctoral students. My schedule filled up with unexpected haste! But if you’re on campus and urgently want to meet, let me know and I’ll see what I can do.

Saturday I’ll be attending the social track of WSDM 2012, the premier international ACM conference covering research in the areas of search and data mining on the Web. I’m excited about the program, as well as the opportunity to catch up with friends and make new ones. Back in 2010, I had the pleasure of co-organizing the Workshop on Search and Social Media (SSM 2010) and being the official ACM blogger for WSDM 2010. You can read my posts here.

Then, on Saturday evening, I’ll be heading to Microsoft Research to attend the Social Search Social (SSS 2012). Hats off to organizers Meredith Ringel Morris, Gene Golovchinksy, Jeremy Pickens, Madhu Reddy, Chirag Shah, and Michael Twidale for creating what looks to be a fun (and very social!) event. I’m especially looking forward to the 45-second “madness” presentations (in which I’m participating) and the “speed dating” to help cross-pollinate  the WSDM and CSCW communities.

Hope to see some of you there, and of course will share what I learn here at The Noisy Channel. I also encourage you to follow the tweet streams for #wsdm2012 and #sss2012.

Comments Off on Social Wisdom in Seattle

LinkedIn @ CMU

January 26th, 2012 by Daniel Tunkelang

As regular readers know, I have a deep affection for Carnegie Mellon University, where I did my graduate work. I’m happy to announce that two of my colleagues (both fellow CMU PhDs) will be giving talks at CMU in a couple of weeks, and I hope that some of you will have the opportunities to attend.

On Tuesday, February 7th, Abhimanyu Lad will be hosting an information session at 6pm in Scaife Hall, Room 214. Abhi is rock star on our data science team, and he’s been working on the next generation of LinkedIn search. You can get a taste of his work from his recent HCIR 2011 presentation, “Is it Time to Abandon Abandonment?“. Abhi will talk about a variety of technical challenges that data scientists and engineers are working on at LinkedIn.

On Thursday, February 9th, Paul Ogilvie will talk about “Where Big Data Meets Real-Time: Efficiently Indexing and Ranking News using Activity” at 3:30pm in GHC 6115. Paul is responsible for article relevance infrastructure and algorithms on LinkedIn Today, a great example of social navigation — not to mention a great success for users. Paul will talk about the technical details that make LinkedIn Today possible, including a novel use of inverted lists to efficiently index and support real-time updates to document representations.

And, even if you can’t make it to the talks, I encourage you to visit the LinkedIn booth at the EOC fair on Wednesday, February 8th. We’re looking for great software engineers and data scientists, and we’re especially interested in interns.

I hope that CMU students and faculty will take the time to meet Abhi, Paul, and their colleagues when they visit in a couple of weeks.

Comments Off on LinkedIn @ CMU

Thoughts about Job Performance

January 22nd, 2012 by Daniel Tunkelang

This is the season of annual reviews, at least at LinkedIn. Performance reviews can be daunting for both employees and managers — at least everywhere that I’ve worked. Not only are we as human beings terrible at delivering feedback, but we also receive bad advice as managers.

For example, many of us have learned the “feedback sandwich” method, a technique that doesn’t hold up to scientific validation. Watch the video below to see what Stanford professor Clifford Nass has learned from his experiments (see my review of his book here).

Here is what I suggest as a format for performance feedback, whether for writing your own self-assessment or delivering feedback to reports or peers on their performance:

1) What is your day job?

Everyone needs a day job — a mission with a crisp set of responsibilities and deliverables. If you don’t know what you’re responsible for delivering, you can’t assess how well you are delivering it. You should know and articulate your top priorities — at most three, with a clear #1. For further reading, I suggest the Quora discussion on OKRs (Objectives and Key Results), an idea pioneered by Intel and now used at top technology companies (including LinkedIn and Google).

2) How are you performing in your day job?

Hopefully you make more contributions than you can count. But make sure that your day job comes first. If you find that a disproportionate fraction of your contribution is outside your day job, then consider changing your day job. Your top priority is to meet (hopefully exceed!) the expectations for your day job — expectations you should set early and revisit regularly. Performance reviews are a great opportunity to brag.

3) What do you do beyond your day job?

Your day job should be strongly aligned with your team and company’s top priorities. But great employees contribute beyond their day job towards other team and company priorities. For example, talent is our top priority at LinkedIn, so we particularly value contributions to hiring and growing our talent. And, at least in every environment I’ve experienced, the best employees are those who help make others successful.

4) How do you want to grow?

This is really a two-part question. First, what do you want to do next? That could mean getting better at your day job, evolving your current responsibilities, or taking on a different role. Second, what are you doing to get there? You are ultimately responsible for your own professional development. But one of your manager’s top responsibilities is to help you identify and advance along the path that is best for you. And performance reviews are a great opportunity to make you think about the future.

Regardless of how your company manages performance, these are the key questions you should think about. Performance feedback is a great opportunity to focus on professional development — your own and that of the people you work with everyday. Make the most of it!

Comments Off on Thoughts about Job Performance

Are You Hitched?

January 20th, 2012 by Daniel Tunkelang

Let me preface this post by saying that this is my personal blog, and that my opinions here are not necessarily those of my employer.

With that out of the way, I love the premise of a dating site for professionals based on LinkedIn. I won’t confirm or deny the number of my colleagues who have thought about building a dating site based on our data, but it’s great to see someone using our APIs to do so. And the marketing video, while not exactly politically correct, is brilliant.

Yet another reason to work as a data scientist at LinkedIn!


Guided Exploration = Faceted Search, Backwards

January 17th, 2012 by Daniel Tunkelang

Information Scent

In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.

Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.

For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.

Precision / Recall Asymmetry

Faceted search is a great tool for information seeking systems. But it offers a flow that is asymmetric with respect to precision and recall.

Let’s invert the flow of faceted search. Rather than starting from a large, imprecise result set and progressively narrowing it; let’s start from a small, precise result set and progressively expand it. Since faceted search is often called “guided navigation” (a term Fritz Knabe and I coined at Endeca), let’s call this approach “guided exploration” (which has a nicer ring than “guided expansion”).

Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.

That sounds great in theory, but how can we implement guided exploration in practice?

Let’s remind ourselves why faceted search works so well. Faceted search offers the user information scent: the facet values help the user identify regions of higher precision relative to his or her information need. By selecting a sequence of facet values, the user arrives at a non-empty set that consists entirely or mostly of relevant results.

How to Expand a Result Set

How do we invert this flow? Just as enlarging an image is more complicated than reducing one, increasing recall is more complicated than increasing precision.

If our initial set is the result of selecting multiple facet values, then we may be able to increase recall by de-selecting facet values (e.g., de-selecting San Francisco Bay Area and CMU in my previous example). If we are using hierarchical facets, then rather than de-selecting a facet value, we may be able to replace it with a parent value (e.g., replacing San Francisco Bay Area with California). We can also remove one or more search keywords to broaden the results (e.g., information or retrieval).

Those are straightforward query relaxations. But there are more interesting ways to expand our results:

  • We can replace a facet value with the union or that value and similar values (e.g., replacing CMU with CMU OR MIT).
  • We can replace the entire query (or any subquery) with a union of that query and the results for selecting a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR Company: Google)
  • We can replace the entire query (or any subquery) with a union of that query and the results for a keyword search a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR [faceted search]).

As we can see, there are many ways to progressively refine a query in a way that expands the result set. The question is how we provide users with options that  increase recall while preserving as much precision as possible.

Frequency : Recall :: Similarity : Precision

Developers of faceted search systems don’t necessarily invest much thought into deciding which faceted refinement options to present to users. Some systems simply avoid dead ends, offer user all refinement options that least to a non-zero result set. This approach breaks down when there are too many options, in which case most systems offer users the most frequent facet values. A chapter in my faceted search book discusses some other options.

Unfortunately, the number of options for guided exploration — at least if we go beyond the very limited basic options — is too vast to apply such a naive approach. Unions never lead to dead ends, and we don’t have a simple measure like frequency to rank our options.

Or perhaps we do. A good reason to favor frequent values as faceted refinement options is that they tend to preserve recall. What we need is a measure that tends to preserving precision when we expand a result set.

That measure is set similarity. More specifically, it is the asymmetric similarity between a set and a superset containing it, which we can think of as the former’s representativeness of the latter. If we are working with facets, we can measure this similarity in terms of differences between distributions of the facet values. If the current set has high precision, we should favor supersets that are similar to it in order to preserve precision.

I’ll spare readers the math, but I encourage you to read about Kullback-Leibler divergence and Jensen-Shannon divergence if you are not familiar with measures of similarity between probability distributions. I’m also glossing over key implementation details  — such as how to model distributions of facet values as probability distributions, and how to handle  smoothing and normalization for set size. I’ll try to cover these in future posts. But for now, let’s assume that we can measure the similarity between a set and a superset.

Guided Exploration: A General Framework

We now have the elements to put together a general framework for guided exploration:

  • Generate a set of candidate expansion options from the current search query using operations such as the following:
    • De-select a facet value.
    • Replace a facet value with its parent.
    • Replace a facet value with the union of it and other values from that facet.
    • Remove a search keyword.
    • Replace a search keyword with the union of it and related keywords.
    • Replace the entire query with the union of it and a related facet value selection.
    • Replace the entire query with the union of it and a related keyword search.
  • Evaluate each expansion option based on the similarity of the resulting set to the current one.
  • Present the most similar sets to the user as expansion options.

Visualizing Drift

It’s one thing to tell a user that two sets are distributionally similar based on an information-theoretic measure, and another to communicate that similarity in a language the user can understand. Here is an example of visualizing the similarity between [information retrieval] AND School: CMU and [information retrieval] AND School: (CMU or MIT):

As we can see from even this basic visualization, replacing CMU with (CMU OR MIT) increases the number of results by 70% while keeping a similar distribution of current companies — the notable exception being people who work for their almae matres.


Faceted search offers some of the most convincing evidence in favor of Gary Marchionini‘s advocacy that we “empower people to explore large-scale information bases but demand that people also take responsibility for this control”. Guided exploration aims to generalize the value proposition of faceted search by inverting the roles of precision and recall. Given the importance of recall, I hope to see progress in this direction. If this is a topic that interests you, give me a shout. Especially if you’re a student looking for an internship this summer!


Next Play!

January 1st, 2012 by Daniel Tunkelang

Every year brings its own adventures, but for me 2011 will be a tough act to follow.

A year ago, I’d just started working at LinkedIn, and my biggest concern was selling our apartment in Brooklyn so that my family could join me in California.

Little did I imagine that my new manager, who had just recruited me from Google to LinkedIn (and persuaded my family to change coasts!), would leave three months later for a startup. Welcome to Silicon Valley! At the time, I felt unready for the abrupt transition into the product executive team. In retrospect, I’m thankful for the kick in the pants that helped me transform my role and brought the best out of a great team.

Summer brought the excitement of LinkedIn’s IPO. The process was exhilarating, especially to someone who had worked for over a decade at a pre-IPO company.

Nonetheless, we didn’t let the IPO distract us from our mission. In March, we celebrated our 100 millionth member; by November, we passed 135 million. And lots more. We released new data products like Skills and Alumni. We won the OSCON Data Innovation Award for contributions to the open source software for big data. We also acquired a few companies, including search engine startup IndexTank. In short, we heeded the two short words on the back of our commemorative IPO t-shirts: “next play”.

Fall was an intense season of conferences. Between CIKM, HCIR, RecSys, Strata, and a talk at CMU, it was a great opportunity to connect and reconnect with researchers and practitioners around the world. I am particularly proud of the success of this year’s HCIR workshop, which showed how much the workshop (now to become a 2-day symposium!) has grown up in five years.

But what capped my year off was seeing Endeca, the company I helped start in 1999, become one of Oracle’s largest acquisitions. Even though it’s been two years since I left, Endeca will always be a core facet of my professional identity. I look forward to great things from all the folks I worked with.

That brings us to 2012, ready to start a new year of adventures. Tough or not, our job is to make every new year more amazing than the previous ones. I’m ready for the challenge, and I hope you are too.

Here’s a teaser of what I have planned:

  • My team at LinkedIn is launching into 2012 with a strong focus on derived data quality and relevance. As regular readers know, I see data quality and richer interfaces for information seeking as inseparable concerns. And, speaking of quality, we’re hiring!
  • I’ll be speaking at Strata in a couple of months with Claire Hunsaker of Samasource about “Humans, Machines, and the Dimensions of Microwork“. I’m very excited to talk about the intersection of crowdsourcing and data science. And I’ll be joined by three of LinkedIn’s top data scientists: Monica Rogati, Sam Shah, and Pete Skomoroch.
  • I’ll be co-chairing the RecSys Industry Track this fall with Yehuda Koren. I’m honored to have the opportunity to work with Yehuda, who was part of the Netflix Grand Prize team and won best paper at RecSys 2011. We’re still putting together the program, but you can look at last year’s program to get an idea of what’s in store.
  • I’ll be at the CIKM Industry Event, this time as an invited speaker. CIKM will be take place in Maui this fall and I’m excited about the program that Evgeniy Gabrilovich is putting together for the Industry Event. It will be an all-invited program, just like last year.

I hope you’re also starting 2012 with a fresh sense of purpose. Let’s take a last moment to reflect on a great 2011, and then…NEXT PLAY!


HCIR 2011: Now on YouTube!

December 17th, 2011 by Daniel Tunkelang

The Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011), held on October 20th at Google’s main campus in Mountain View, California, was a resounding success. We has almost a hundred people, presenting a wide array of papers, posters, and challenge entries. You can read my summary of the event in an earlier blog post: “HCIR 2011: We Have Arrived!“.

Better yet, you can now, for the first time in the workshop’s history, watch videos of the presentations. Embedded below are videos of Gary Marchionini’s keynote address and of the two paper presentation sessions. Thanks again to Google for being such a gracious host — now online as well as offline!


Morning Presentations

Afternoon Presentations

1 Comment

Jim Adler: The Accidental Chief Privacy Officer

December 4th, 2011 by Daniel Tunkelang

Privacy is the third rail of the cloud. On one hand, the ease of sharing information and the power of analytics have produced extraordinary value for consumers, as well as great business models for companies that serve those consumers. On the other hand, people have good reason to worry about the unintended consequences of over-sharing.

When I attended the O’Reilly Strata New York Conference in September, I had the pleasure to hear and meet Intelius’s Jim Adler talk about being his company’s “accidental chief privacy officer”. Intelius‘s main product is people search — an area that naturally brings up privacy concerns. Especially since Intelius aggregates and publishes information about people from databases of public records, eroding a history of “privacy through difficulty“. Impressed with Jim’s talk at Strata, I persuaded him to deliver a similar talk at LinkedIn, the video of which you can find above. You can also find his slides on SlideShare.

Jim brings nuance to the discussion of privacy — nuance that discussions of online privacy often lack. For example, he responded to the recent controversy about social networks’ “real names” policy with a measured post entitled “Nyms, Pseudonyms, or Anonyms? All of the Above“.

Jim appropriately opened his talk by disclosing a personal example. He shares his name with a more prominent personal injury lawyer who dominates search results for that name, raising the potential of taint by association. Intelius’s core technical problem is to cluster inputs from the sources it aggregates, thus mapping each person to exactly one record in its database.

Jim went on to note that we are at a stage in the privacy debate where we are likely to see more regulation. He makes a few key observations:

  • Social norms, which form the basis of our laws and regulations (the notion of a “reasonable expectation of privacy) have changed suddenly, leading to a “privacy vertigo” where suddenly the whole world now feels like a small town.
  • Sharing is a gateway from private to public, which often leads to violation of expectations. This problem is not new, but the efficiency of online sharing dramatically amplifies the unintended consequences of sharing. It is crucial that the parties involved in sharing data also have shared expectations around how that data will be used or disclosed.
  • We need to distinguish between data use and data access, and not to try to regulate data use with data access regulations. He cites the Fair Credit Reporting Act as one of the most inspired laws of the last 40 years to regulate data use. If you don’t have time to listen to the whole talk, I recommend you jump to 25:12, where he discusses this law in detail.

There’s a lot more in the talk, so I’m not going to try to summarize it all here. I strongly encourage you to check out the video (which includes lengthy Q&A) and the slides. Better yet, let’s use the comments to discuss!


CIKM 2011 Industry Event: Slides and Summaries

November 27th, 2011 by Daniel Tunkelang

I’ve posted slides and summaries for all ten CIKM 2011 Industry Event presentations:


Clicky Web Analytics