Scale, Structure, and Semantics

This morning I had the pleasure to present a keynote address at the Semantic Technology & Business Conference (SemTechBiz). I’ve had a long and warm relationship with the semantic technology community — especially with Marco Neumann and the New York Semantic Web Meetup.

But I’m not exactly a fanboy of the semantic web, and I wasn’t sure how the audience would respond to some of my more provocative assertions. Fortunately the reception was very positive. Several people approached me afterwards to thank me for presenting a balanced argument for combining big data with structured representations and for raising HCIR issues.

A couple of people felt that faceted search was old news. I’m delighted that faceted search is becoming increasingly common, but there is still a lot of opportunity to use it more often and more effectively, And I was pleasantly surprised at the interest in discussing extensions of faceted search to address relationships between entities, as well as other nuances. I’ll have to dive into those in future posts.

For now, I hope you enjoy the slides, and I encourage you to ask questions in the comments.


Visual Search Startup Modista Is Back!

Long-time readers know that I’m a great fan of visual search startup Modista, which was a victim of software patent abuse. To my delight, Modista is back from the dead. Check it out!

Also see my previous coverage of Modista.


HCIR 2012: Call for Participation

Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.

The HCIR Symposium (formerly known as the HCIR Workshop) has run annually since 2007. The event unites academic researchers and industrial practitioners working at the intersection of HCI and IR to develop more sophisticated models, tools, and evaluation metrics to support activities such as interactive information retrieval and exploratory search. It provides an opportunity for attendees to informally share ideas via posters, small group discussions and selected short talks.

The Sixth Symposium on Human-Computer Interaction and Information Retrieval will be held as a two-day event on October 4 and 5, 2012 at IBM Research in Cambridge, Massachusetts. We are delighted to bring the event back to its birthplace (HCIR 2007 took place at MIT), and even more pleased to announce that our keynote speaker for the symposium this year will be UC Berkeley professor and search user interfaces pioneer Marti Hearst.

Topics for discussion and presentation at the symposium include, but are not limited to:

  • Novel interaction techniques for information retrieval.
  • Modeling and evaluation of interactive information retrieval.
  • Exploratory search and information discovery.
  • Information visualization and visual analytics.
  • Applications of HCI techniques to information retrieval needs in specific domains.
  • Ethnography and user studies relevant to information retrieval and access.
  • Scale and efficiency considerations for interactive information retrieval systems.
  • Relevance feedback and active learning approaches for information retrieval.

Demonstrations of systems and prototypes are particularly welcome.

We are also excited to continue the HCIR Challenge, this year focusing on the problem of people and expertise finding. We are grateful to Mendeley for providing this year’s corpus: a database based on Mendeley’s network of 1.6M+ researchers and 180M+ academic documents. Participants will build systems to enable efficient discovery of experts or expertise for applications such as collaborative research, team building, and competitive analysis.

In addition to the Challenge and a small number of research presentations, we will leave plenty of time for what participants have consistently told us that they find extremely valuable: informal discussions, posters and directed group discussions. Finally, we are extending our previous format to include a few full-length, fully-refereed archival quality papers that will be indexed in the ACM Digital Library.

We have extended the event to a second day to accommodate more presentations (including the full papers), and to leave plenty of time for discussion and for interaction around the poster session.  There will be a reception on Thursday evening.

Please consult the symposium web site,, for full details. But here are some important dates to keep in mind:

  • Deadline to request access to HCIR Challenge corpus: Friday, June 15
  • Submission deadline for position and research papers: Sunday, July 29

This event would not be possible without generous support from industry and academia. This year’s supporters are IBM Research, LinkedIn, Microsoft Research, MIT CSAIL, and Oracle. Microsoft Research is also providing funds for a limited number of student travel awards. Information about these awards is available at

Looking forward to seeing you in October!


Data Science at LinkedIn: My Team

Lots of people ask me what it’s like to be a data scientist at LinkedIn. The short answer: it’s awesome. Folks like Pete Skomoroch and team are building data products related to identity and reputation, such as Skills and InMaps. Yael Garten is leading the effort to understand and increase mobile engagement. And other folks work on everything from open-source infrastructure to fraud detection. Amazing people helping our 160M+ members by deriving valuable insights from big data.

I wanted to take a moment to showcase my own team. As a team, we straddle the boundary between science and engineering. We work closely with several engineering teams to deliver products that our members use everyday.

Joseph Adler is a name you might recognize from your bookshelf: he wrote Baseball Hacks and R in a Nutshell, both published by O’Reilly. At LinkedIn, he is a data hacker extraordinaire, currently focused on improving the network update stream.

Ahmet Bugdayci just joined LinkedIn this year, and he’s already on a tear. He’s working on a better approach to representing job titles, one of the most fundamental facets of our members’ professional identity. And he’s a polyglot.

Heyning Cheng is our innovator in chief. He envisions data products and does whatever it takes to hack them together. Our recruiters are especially happy to be his beta testers, and we’re working to turn those prototypes into shipped product.

Abhimanyu Lad is working on the next generation of LinkedIn search. He’s already improved spelling correction and group search, as well as building better ways to measure search effectiveness. But stay tuned — the best is yet to come!

Gloria Lau leads all things data for the student initiative. Check out LinkedIn Alumni to see what she’s been up to. Students are the future, and we’re excited to be making LinkedIn a great tools for students, alumni, and universities.

Monica Rogati spearheaded many of LinkedIn’s key products: the Talent Match system that matches jobs to candidates; the first machine learning model for People You May Know; and the first version of Groups You May Like. When she’s not working on our products, she gives awesome presentations.

Daria Sorokina recently joined us and is working on search quality. She’s a hard-core machine learning researcher and developer: check out her open-source code for additive groves.

Ramesh Subramonian has been focused on data efforts for our international expansion. Over 60% of our members live outside the United States, and his efforts ensure that LinkedIn’s value proposition is a global one.

Joyce Wang is a data science generalist. She is part of the search team, but she’s built great tools for log analysis and human evaluation that are finding great use across the company.

I hope that gives you a flavor of what it’s like to be a data scientist at LinkedIn — and on my team in particular.

Do you possess that rare combination of computer science background, technical skill, creative problem-solving ability, and product sense? If so, then I’d love to talk with you about opportunities to work on challenging problems with amazing people!


Science as a Strategy

Last night, I had the pleasure to deliver the keynote address at the CIO Summit US. It was an honor to address an assembly of CIOs, CTOs, and technology executives from the nation’s top organizations. My theme was “Science as a Strategy”.

To set the stage, I told the story of TunkRank: how, back in 2009, I proposed a Twitter influence measure based on an explicit model of attention scarcity which proved better than the intuitive but flawed approach of counting followers. The point of the story was not self-promotion, but rather to introduce my core message:

Science is the difference between instinct and strategy.

Given the audience, I didn’t expect this message to be particularly controversial. But we all know that belief is not the same as action, and science is not always popular in the C-Suite. Thus, I offered three suggestions to overcome the HIPPO (Highest Paid Person’s Opinion):

  • Ask the right questions.
  • Practice good data hygiene.
  • Don’t argue when you can experiment!

Asking the Right Questions

Asking the right questions seems obvious — after all, our answers can only be as good as the questions we ask. But science is littered with examples of people asking the wrong questions — from 19th-century phrenologists measuring the sizes of people’s skulls to evaluate intelligence to IT executives measuring lines of code to evaluate programmer productivity. It’s easy for us (today) to recognize these approaches as pseudoscience, but we have to make sure we ask the right questions in our own organizations.

As an example, I turned to the challenge of improving the hiring process. One approach I’ve seen tried at both Google and LinkedIn is to measure the accuracy of interviewers — that is, to see how well the hire / no-hire recommendations of individual interviewers predict the final decisions. But this turns out to be the wrong question — in large part because negative recommendations (especially early ones) weigh much more heavily in the decision than positive ones.

What we found instead was that we should focus on efficiency as an optimization problem. More specifically, there is a trade-off: short-circuiting the process as early as possible (e.g., after the candidate performs poorly on the first phone screen) reduces the average time per candidate, but it also reduces the number of good candidates who make it through the process. To optimize overall throughput (while keeping our high bar), we’ve had to calibrate the upstream filters. How to optimize that upstream filter turns out to be the right question to ask — and one we still continue to iterate on.

More generally, I talked about how, when we hire data scientists at LinkedIn, we look for not only strong analytical skills but also the product and business sense to pick the right questions to ask – questions whose answers create value for users and drive key business decisions. Asking the right questions is the foundation of good science.

Practicing Good Data Hygiene

Data mining is amazing, but we have to watch out for its pejorative meaning of discovering spurious patterns. I used the Super Bowl Indicator as an example of data mining gone wrong — with 80% accuracy, the division (AFC vs. NFC) of the Super Bowl champion predicts the coming year’s stock market performance. Indeed, the NFC won this year (go Giants!) and subsequent market gains have been consistent with this indicator (so far).

We can all laugh at these misguided investors, but we make these mistakes all the time. Despite what researchers have called the “unreasonable effectiveness of data”, we still need the scientific method of first hypothesizing and then experimenting in order to obtain valid and useful conclusions. Without data hygiene, our desires, preconceptions, and other human frailties infect our rational analysis.

A very different example is using click-through data to measure the effectiveness of relevance ranking. This approach isn’t completely wrong, but it suffers from several flaws. And the fundamental flaw relates to data hygiene: how we present information to users infects their perception of relevance. Users assume that top-­ranked results are more relevant than lower-­ranked results. Also, they can only click on the results presented to them. To paraphrase Donald Rumsfeld: they don’t know what they don’t know. If we aren’t careful, a click-­based evaluation of relevance creates positive feedback and only reinforces our initial assumptions – which certainly isn’t the point of evaluation!

Fortunately, there are ways to avoid these biases. We can pay people to rate results presented to them in random order. We can use the explore / exploit technique to hedge against the ranking algorithm’s preconceived bias. And so on.

But the key take-away is that we have to practice good data hygiene, splitting our projects into the two distinct activities of hypothesis generation (i.e., exploratory analysis) and hypothesis testing using withheld data.

Don’t Argue when you can Experiment

I couldn’t resist the opportunity to cite Nobel laureate Daniel Kahneman‘s seminal work on understanding human irrationality. I also threw in Mercier and Sperber’s recent work on reasoning as argumentative. The summary: don’t trust anyone’s theories, not even mine!

Then what can you trust? The results of a well-­‐run experiment. Rather than debating data-­‐free assertions, subject your hypotheses to the ultimate test: controlled experiments. Not every hypothesis can be tested using a controlled experiment, but most can be.

I recounted the story of how Greg Linden persuaded his colleagues at Amazon to implement shopping-cart recommendations through A/B testing, despite objections from a marketing SVP. Indeed, his work — and Amazon’s generally — has strongly advanced the practice of A/B testing in online settings.

Of course, A/B testing is fundamental to all of our work at LinkedIn. Every feature we release, whether it’s the new People You May Know interface or improvements to Group Search relevance, starts with an A/B test. And sometimes A/B testing causes us to not launch — we listen to the data.

Don’t argue when you can experiment. Decisions about how to improve products and processes should not be by an Oxford-­style debate. Rather, those decisions should be informed by data.

Conclusion: Even Steve Jobs Made Mistakes

Some of you may think that this is all good advice, but that science is no match for an inspired leader. Indeed, some pundits have seen Apple’s success relative to Google as an indictment of data-­driven decision making in favor of an approach that follows a leader’s gut instinct. Are they right? Should we throw out all of our data and follow our CEOs’ instincts?

Let’s go back a decade. In 2002, Apple faced a pivotal decision – perhaps the most important decision in its history. The iPod was clearly a breakthrough product, but it was only compatible with the Mac. Remember that, back in 2002, Apple had only a 3.5% market share in the PC business. Apple’s top executives did their analysis and predicted that they could drive the massive success of the iPod by making it compatible with Windows, the dominant operating system with over 95% market share.

Steve Jobs resisted. At one point he said that Windows users would get to use the iPod “over [his] dead body”. After continued convincing, Jobs gave up. According to authorized biographer Walter Isaacson, Steve’s exact words were: “Screw it. I’m sick of listening to you assholes. Go do whatever the hell you want.” Luckily for Steve, Apple, and the consumer public, they did, and the rest is history.

It isn’t easy being one those ass­holes. But that’s our job, much as it was theirs. It’s up to us to turn data into gold, to apply science and technology to create value for our organizations. Because without data, we are gambling on our leaders’ gut feelings. And our leaders, however inspired, have fallible instincts.

Science is the difference between instinct and strategy.


Semantic Link and Internet Evolution


Recently I had a couple of great opportunities to share my thoughts publicly, and I wanted to make sure readers here were aware of them.

The first was a special guest appearance on The Semantic Link, a program hosted by Paul Miller with regular panelists Peter Brown, Christine Connors, Eric Franzon, Eric Hoffer, Bernadette Hyland, and Andraz Tori. It was a lot of fun, and a great warm-up for the keynote I’ll be delivering on “Scale, Structure, and Semantics” at the upcoming Semantic Tech & Business Conference (SemTechBiz), which will take place in San Francisco in June.

The second was a live interview on Internet Evolution, hosted by Mary Jander and Nicole Ferraro. They clearly did their homework, scouring my blog posts and web commentary for everything controversial I’d ever said — and then some! If that’s enough to pique your interest, then I encourage you to listen to the recorded interview and read the chat transcript at Internet Evolution.

Happy to answer questions based on either of these sessions — comment away!


Noah Iliinsky: Tech Talk on Designing Data Visualizations

Note: This post was written by Yael Garten, a Senior Data Scientist at LinkedIn. Yael joined Linkedin in 2011, where she leads our mobile analytics team. She previously worked at Stanford on text mining, personalized medicine, and biomedical informatics.

We live in an era of Big Data. But how do we use all of that data to answer questions and communicate those answers effectively?

My colleagues and I at LinkedIn were fortunate enough to hear answers from Noah Iliinsky, who literally wrote the book on designing data visualization.

Earlier this month, we hosted Noah at LinkedIn to give a tech talk on “Designing Effective Data Visualizations“. We are proud to make these tech talks open to the public, and enjoyed a great mix of attendees from local companies and universities. If you couldn’t attend the talk in person or remotely, I encourage you to watch the recording, embedded above.

Why do we visualize data? As Noah tells us, visualization makes data accessible. It gives us faster access to actionable insights and allows access to huge amounts of data. Visualization enables both data exploration (when you are still trying to discover the story) and data explanation (when you have a story to tell). Noah reviewed some great examples (watch the talk!), with an emphasis on the dos and don’ts of data visualization.

In particular, he provided a step-by-step framework for traversing the path from question to answer:

Phase 1: Decide what to visualize.

  • Understand the question your audience wants to answer.
  • Understand the actions they are hoping the answer will drive.
  • Consider who is consuming this data — their needs, biases, etc.
  • Decide what data to use — and what data not to use — and what relationships you are interested in.
  • Explore the data and construct a storyline.

Phase 2: Decide how to visualize it.

  • Use appropriate visual encodings for data and relationships (cf.
  • Limit the data you include.
  • Use position for your most important relationship.
  • Try different axes.
  • Show your visualization to different people, without explanations. Show an expert, show a layman.
  • Iterate, iterate, iterate!

Noah also shared his thoughts on how to visualize social networks. He recommended useful tools for data visualization, including Tableau, Spotfire, D3, Processing, ggplot2, Omnigraffle, and OmnigraphSketcher.

Finally, he left us with key lessons to take home:
  • You are not your audience. This is a huge lesson that all of us must internalize to be great at what we do. Consider what you need to communicate to marketers, investors, member of the general public, etc.
  • Do user research! Understand your users’ hopes, dreams, and favorite flavors! Understand their identity, their jargon, culture, etc.
  • Remember that your success is defined by your customer’s success. If you can’t satisfy your customer’s needs, you have failed — no matter how insightful your analysis.

You can enjoy the talk by watching the embedded video above. And you can find more LinkedIn tech talks on our YouTube channel.


Data, Algorithms, and People

One of the highlights of the recent Data 2.0 Summit was a panel featuring:

The focus of the panel was supposed to be about “Data Science and Predicting the Future”, but the most contentious topic was whether data, algorithms or people (that is, the data scientists themselves) were the most important factor in the practice and success of data science.

Yes, we one-upped the debate that my colleague Monica Rogati instigated at this year’s Strata conference. In fact, Josh cited the “better data beats more data beats clever algorithms” argument that Monica made in her own Strata presentation. And, just like at Strata, there was a healthy dose of audience participation.

Of course, I came down on the side of data — which I believe won the debate hands down.

I’m a fan of clever algorithms, which Alexander had to defend given that Skytree’s core value proposition is better machine learning algorithms delivered at scale. But I’m with Peter Norvig et al. on the dominance of data over algorithms.

Favoring data over people was a harder choice. Anthony naturally made the case for people (Kaggle’s claim to fame is assembling many of the world’s best data scientists by organizing competitions). Hopefully my team won’t quit en masse when they read this blog post! But I think they’ll agree with me that, without the incredible data we work with at LinkedIn, they’d be unable to deliver the awesomeness that I’ve come to expect from them.

There’s a saying that we all cook from the same cookbooks, so that it’s the ingredients that make all the difference. To take the metaphor further, you can also try to poach your rival’s chefs. But data is the biggest entry barrier — and the most sustainable competitive advantage.

Of course, we should have the best people apply the best algorithms to work with the best data. But data comes first. The best meal starts with the best ingredients.


Video of Strata 2012 Talk on Humans, Machines, and the Dimensions of Microwork


The video of the presentation that Claire Hunsaker and I delivered on “Humans, Machines, and the Dimensions of Microwork” at Strata 2012 is now available as part of the complete video compilation. I’ve taken the liberty to upload it to YouTube — feel free to watch the embedded video above.


Data 2.0 Summit

I’ll be participating in the Data 2.0 Summit on Tuesday, April 3rd, and I hope to see some of you there. Last year, my colleague (and fellow LinkedIn data scientist) Scott Nicholson attended and wrote this guest post about it. This year, I’m not only attending but participating on a panel about social data, moderated by AllthingsD Senior Editor Liz Gannes.

There’s a great line-up of speakers for the day, including:

  • Bram Cohen, the founder, chief scientist, and inventor of BitTorrent, the leading peer-to-peer file sharing protocol for sharing large files on the Internet.
  • Michael Driscoll, CTO and co-founder of the Metamarkets Group. He moderated a fantastic debate at the recent Strata conference about the relative importance of domain expertise or machine learning for data scientists.
  • Gil Elbaz, the founder and CEO of Factual, an information marketplace. He is also the co-founder of Applied Semantics, which Google acquired in 2003 for $102M and turned into the foundation for AdSense (now a $10B business).
  • Anthony Goldbloom, co-founder and CEO of Kaggle,  a platform for data science competitions that generated a lot of discussion at Strata.
  • Stefan Weitz, director of search at Bing. He’ll be on my panel. Also see the discussion I had with him in the comment thread for a post on “Why Are People So Clueless About Search?“.

And lots more, but you get the idea. I’m thrilled to be part of such a talent-heavy program and looking forward to insightful discussions with with fellow panelists and attendees. Also a great excuse to spend a day in the city (note for my former townspeople — that’s what they call San Francisco around here).