Categories
General

LinkedIn @ CMU

As regular readers know, I have a deep affection for Carnegie Mellon University, where I did my graduate work. I’m happy to announce that two of my colleagues (both fellow CMU PhDs) will be giving talks at CMU in a couple of weeks, and I hope that some of you will have the opportunities to attend.

On Tuesday, February 7th, Abhimanyu Lad will be hosting an information session at 6pm in Scaife Hall, Room 214. Abhi is rock star on our data science team, and he’s been working on the next generation of LinkedIn search. You can get a taste of his work from his recent HCIR 2011 presentation, “Is it Time to Abandon Abandonment?“. Abhi will talk about a variety of technical challenges that data scientists and engineers are working on at LinkedIn.

On Thursday, February 9th, Paul Ogilvie will talk about “Where Big Data Meets Real-Time: Efficiently Indexing and Ranking News using Activity” at 3:30pm in GHC 6115. Paul is responsible for article relevance infrastructure and algorithms on LinkedIn Today, a great example of social navigation — not to mention a great success for users. Paul will talk about the technical details that make LinkedIn Today possible, including a novel use of inverted lists to efficiently index and support real-time updates to document representations.

And, even if you can’t make it to the talks, I encourage you to visit the LinkedIn booth at the EOC fair on Wednesday, February 8th. We’re looking for great software engineers and data scientists, and we’re especially interested in interns.

I hope that CMU students and faculty will take the time to meet Abhi, Paul, and their colleagues when they visit in a couple of weeks.
Categories
General

Thoughts about Job Performance

This is the season of annual reviews, at least at LinkedIn. Performance reviews can be daunting for both employees and managers — at least everywhere that I’ve worked. Not only are we as human beings terrible at delivering feedback, but we also receive bad advice as managers.

For example, many of us have learned the “feedback sandwich” method, a technique that doesn’t hold up to scientific validation. Watch the video below to see what Stanford professor Clifford Nass has learned from his experiments (see my review of his book here).

Here is what I suggest as a format for performance feedback, whether for writing your own self-assessment or delivering feedback to reports or peers on their performance:

1) What is your day job?

Everyone needs a day job — a mission with a crisp set of responsibilities and deliverables. If you don’t know what you’re responsible for delivering, you can’t assess how well you are delivering it. You should know and articulate your top priorities — at most three, with a clear #1. For further reading, I suggest the Quora discussion on OKRs (Objectives and Key Results), an idea pioneered by Intel and now used at top technology companies (including LinkedIn and Google).

2) How are you performing in your day job?

Hopefully you make more contributions than you can count. But make sure that your day job comes first. If you find that a disproportionate fraction of your contribution is outside your day job, then consider changing your day job. Your top priority is to meet (hopefully exceed!) the expectations for your day job — expectations you should set early and revisit regularly. Performance reviews are a great opportunity to brag.

3) What do you do beyond your day job?

Your day job should be strongly aligned with your team and company’s top priorities. But great employees contribute beyond their day job towards other team and company priorities. For example, talent is our top priority at LinkedIn, so we particularly value contributions to hiring and growing our talent. And, at least in every environment I’ve experienced, the best employees are those who help make others successful.

4) How do you want to grow?

This is really a two-part question. First, what do you want to do next? That could mean getting better at your day job, evolving your current responsibilities, or taking on a different role. Second, what are you doing to get there? You are ultimately responsible for your own professional development. But one of your manager’s top responsibilities is to help you identify and advance along the path that is best for you. And performance reviews are a great opportunity to make you think about the future.

Regardless of how your company manages performance, these are the key questions you should think about. Performance feedback is a great opportunity to focus on professional development — your own and that of the people you work with everyday. Make the most of it!

Categories
General

Are You Hitched?

Let me preface this post by saying that this is my personal blog, and that my opinions here are not necessarily those of my employer.

With that out of the way, I love the premise of Hitch.me: a dating site for professionals based on LinkedIn. I won’t confirm or deny the number of my colleagues who have thought about building a dating site based on our data, but it’s great to see someone using our APIs to do so. And the marketing video, while not exactly politically correct, is brilliant.

Yet another reason to work as a data scientist at LinkedIn!

Categories
General

Guided Exploration = Faceted Search, Backwards

Information Scent

In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.

Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.

For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.

Precision / Recall Asymmetry

Faceted search is a great tool for information seeking systems. But it offers a flow that is asymmetric with respect to precision and recall.

Let’s invert the flow of faceted search. Rather than starting from a large, imprecise result set and progressively narrowing it; let’s start from a small, precise result set and progressively expand it. Since faceted search is often called “guided navigation” (a term Fritz Knabe and I coined at Endeca), let’s call this approach “guided exploration” (which has a nicer ring than “guided expansion”).

Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.

That sounds great in theory, but how can we implement guided exploration in practice?

Let’s remind ourselves why faceted search works so well. Faceted search offers the user information scent: the facet values help the user identify regions of higher precision relative to his or her information need. By selecting a sequence of facet values, the user arrives at a non-empty set that consists entirely or mostly of relevant results.

How to Expand a Result Set

How do we invert this flow? Just as enlarging an image is more complicated than reducing one, increasing recall is more complicated than increasing precision.

If our initial set is the result of selecting multiple facet values, then we may be able to increase recall by de-selecting facet values (e.g., de-selecting San Francisco Bay Area and CMU in my previous example). If we are using hierarchical facets, then rather than de-selecting a facet value, we may be able to replace it with a parent value (e.g., replacing San Francisco Bay Area with California). We can also remove one or more search keywords to broaden the results (e.g., information or retrieval).

Those are straightforward query relaxations. But there are more interesting ways to expand our results:

  • We can replace a facet value with the union or that value and similar values (e.g., replacing CMU with CMU OR MIT).
  • We can replace the entire query (or any subquery) with a union of that query and the results for selecting a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR Company: Google)
  • We can replace the entire query (or any subquery) with a union of that query and the results for a keyword search a single facet value (e.g., ([information retrieval] AND Location: San Francisco Bay Area AND School: CMU) OR [faceted search]).

As we can see, there are many ways to progressively refine a query in a way that expands the result set. The question is how we provide users with options that increase recall while preserving as much precision as possible.

Frequency : Recall :: Similarity : Precision

Developers of faceted search systems don’t necessarily invest much thought into deciding which faceted refinement options to present to users. Some systems simply avoid dead ends, offer user all refinement options that least to a non-zero result set. This approach breaks down when there are too many options, in which case most systems offer users the most frequent facet values. A chapter in my faceted search book discusses some other options.

Unfortunately, the number of options for guided exploration — at least if we go beyond the very limited basic options — is too vast to apply such a naive approach. Unions never lead to dead ends, and we don’t have a simple measure like frequency to rank our options.

Or perhaps we do. A good reason to favor frequent values as faceted refinement options is that they tend to preserve recall. What we need is a measure that tends to preserving precision when we expand a result set.

That measure is set similarity. More specifically, it is the asymmetric similarity between a set and a superset containing it, which we can think of as the former’s representativeness of the latter. If we are working with facets, we can measure this similarity in terms of differences between distributions of the facet values. If the current set has high precision, we should favor supersets that are similar to it in order to preserve precision.

I’ll spare readers the math, but I encourage you to read about Kullback-Leibler divergence and Jensen-Shannon divergence if you are not familiar with measures of similarity between probability distributions. I’m also glossing over key implementation details  — such as how to model distributions of facet values as probability distributions, and how to handle smoothing and normalization for set size. I’ll try to cover these in future posts. But for now, let’s assume that we can measure the similarity between a set and a superset.

Guided Exploration: A General Framework

We now have the elements to put together a general framework for guided exploration:

  • Generate a set of candidate expansion options from the current search query using operations such as the following:
    • De-select a facet value.
    • Replace a facet value with its parent.
    • Replace a facet value with the union of it and other values from that facet.
    • Remove a search keyword.
    • Replace a search keyword with the union of it and related keywords.
    • Replace the entire query with the union of it and a related facet value selection.
    • Replace the entire query with the union of it and a related keyword search.
  • Evaluate each expansion option based on the similarity of the resulting set to the current one.
  • Present the most similar sets to the user as expansion options.

Visualizing Drift

It’s one thing to tell a user that two sets are distributionally similar based on an information-theoretic measure, and another to communicate that similarity in a language the user can understand. Here is an example of visualizing the similarity between [information retrieval] AND School: CMU and [information retrieval] AND School: (CMU or MIT):

As we can see from even this basic visualization, replacing CMU with (CMU OR MIT) increases the number of results by 70% while keeping a similar distribution of current companies — the notable exception being people who work for their almae matres.

Conclusion

Faceted search offers some of the most convincing evidence in favor of Gary Marchionini‘s advocacy that we “empower people to explore large-scale information bases but demand that people also take responsibility for this control”. Guided exploration aims to generalize the value proposition of faceted search by inverting the roles of precision and recall. Given the importance of recall, I hope to see progress in this direction. If this is a topic that interests you, give me a shout. Especially if you’re a student looking for an internship this summer!

Categories
General

Next Play!

Every year brings its own adventures, but for me 2011 will be a tough act to follow.

A year ago, I’d just started working at LinkedIn, and my biggest concern was selling our apartment in Brooklyn so that my family could join me in California.

Little did I imagine that my new manager, who had just recruited me from Google to LinkedIn (and persuaded my family to change coasts!), would leave three months later for a startup. Welcome to Silicon Valley! At the time, I felt unready for the abrupt transition into the product executive team. In retrospect, I’m thankful for the kick in the pants that helped me transform my role and brought the best out of a great team.

Summer brought the excitement of LinkedIn’s IPO. The process was exhilarating, especially to someone who had worked for over a decade at a pre-IPO company.

Nonetheless, we didn’t let the IPO distract us from our mission. In March, we celebrated our 100 millionth member; by November, we passed 135 million. And lots more. We released new data products like Skills and Alumni. We won the OSCON Data Innovation Award for contributions to the open source software for big data. We also acquired a few companies, including search engine startup IndexTank. In short, we heeded the two short words on the back of our commemorative IPO t-shirts: “next play”.

Fall was an intense season of conferences. Between CIKM, HCIR, RecSys, Strata, and a talk at CMU, it was a great opportunity to connect and reconnect with researchers and practitioners around the world. I am particularly proud of the success of this year’s HCIR workshop, which showed how much the workshop (now to become a 2-day symposium!) has grown up in five years.

But what capped my year off was seeing Endeca, the company I helped start in 1999, become one of Oracle’s largest acquisitions. Even though it’s been two years since I left, Endeca will always be a core facet of my professional identity. I look forward to great things from all the folks I worked with.

That brings us to 2012, ready to start a new year of adventures. Tough or not, our job is to make every new year more amazing than the previous ones. I’m ready for the challenge, and I hope you are too.

Here’s a teaser of what I have planned:

  • My team at LinkedIn is launching into 2012 with a strong focus on derived data quality and relevance. As regular readers know, I see data quality and richer interfaces for information seeking as inseparable concerns. And, speaking of quality, we’re hiring!
  • I’ll be speaking at Strata in a couple of months with Claire Hunsaker of Samasource about “Humans, Machines, and the Dimensions of Microwork“. I’m very excited to talk about the intersection of crowdsourcing and data science. And I’ll be joined by three of LinkedIn’s top data scientists: Monica Rogati, Sam Shah, and Pete Skomoroch.
  • I’ll be co-chairing the RecSys Industry Track this fall with Yehuda Koren. I’m honored to have the opportunity to work with Yehuda, who was part of the Netflix Grand Prize team and won best paper at RecSys 2011. We’re still putting together the program, but you can look at last year’s program to get an idea of what’s in store.
  • I’ll be at the CIKM Industry Event, this time as an invited speaker. CIKM will be take place in Maui this fall and I’m excited about the program that Evgeniy Gabrilovich is putting together for the Industry Event. It will be an all-invited program, just like last year.

I hope you’re also starting 2012 with a fresh sense of purpose. Let’s take a last moment to reflect on a great 2011, and then…NEXT PLAY!