Categories
General

Science as a Strategy

Last night, I had the pleasure to deliver the keynote address at the CIO Summit US. It was an honor to address an assembly of CIOs, CTOs, and technology executives from the nation’s top organizations. My theme was “Science as a Strategy”.

To set the stage, I told the story of TunkRank: how, back in 2009, I proposed a Twitter influence measure based on an explicit model of attention scarcity which proved better than the intuitive but flawed approach of counting followers. The point of the story was not self-promotion, but rather to introduce my core message:

Science is the difference between instinct and strategy.

Given the audience, I didn’t expect this message to be particularly controversial. But we all know that belief is not the same as action, and science is not always popular in the C-Suite. Thus, I offered three suggestions to overcome the HIPPO (Highest Paid Person’s Opinion):

  • Ask the right questions.
  • Practice good data hygiene.
  • Don’t argue when you can experiment!

Asking the Right Questions

Asking the right questions seems obvious — after all, our answers can only be as good as the questions we ask. But science is littered with examples of people asking the wrong questions — from 19th-century phrenologists measuring the sizes of people’s skulls to evaluate intelligence to IT executives measuring lines of code to evaluate programmer productivity. It’s easy for us (today) to recognize these approaches as pseudoscience, but we have to make sure we ask the right questions in our own organizations.

As an example, I turned to the challenge of improving the hiring process. One approach I’ve seen tried at both Google and LinkedIn is to measure the accuracy of interviewers — that is, to see how well the hire / no-hire recommendations of individual interviewers predict the final decisions. But this turns out to be the wrong question — in large part because negative recommendations (especially early ones) weigh much more heavily in the decision than positive ones.

What we found instead was that we should focus on efficiency as an optimization problem. More specifically, there is a trade-off: short-circuiting the process as early as possible (e.g., after the candidate performs poorly on the first phone screen) reduces the average time per candidate, but it also reduces the number of good candidates who make it through the process. To optimize overall throughput (while keeping our high bar), we’ve had to calibrate the upstream filters. How to optimize that upstream filter turns out to be the right question to ask — and one we still continue to iterate on.

More generally, I talked about how, when we hire data scientists at LinkedIn, we look for not only strong analytical skills but also the product and business sense to pick the right questions to ask – questions whose answers create value for users and drive key business decisions. Asking the right questions is the foundation of good science.

Practicing Good Data Hygiene

Data mining is amazing, but we have to watch out for its pejorative meaning of discovering spurious patterns. I used the Super Bowl Indicator as an example of data mining gone wrong — with 80% accuracy, the division (AFC vs. NFC) of the Super Bowl champion predicts the coming year’s stock market performance. Indeed, the NFC won this year (go Giants!) and subsequent market gains have been consistent with this indicator (so far).

We can all laugh at these misguided investors, but we make these mistakes all the time. Despite what researchers have called the “unreasonable effectiveness of data”, we still need the scientific method of first hypothesizing and then experimenting in order to obtain valid and useful conclusions. Without data hygiene, our desires, preconceptions, and other human frailties infect our rational analysis.

A very different example is using click-through data to measure the effectiveness of relevance ranking. This approach isn’t completely wrong, but it suffers from several flaws. And the fundamental flaw relates to data hygiene: how we present information to users infects their perception of relevance. Users assume that top-­ranked results are more relevant than lower-­ranked results. Also, they can only click on the results presented to them. To paraphrase Donald Rumsfeld: they don’t know what they don’t know. If we aren’t careful, a click-­based evaluation of relevance creates positive feedback and only reinforces our initial assumptions – which certainly isn’t the point of evaluation!

Fortunately, there are ways to avoid these biases. We can pay people to rate results presented to them in random order. We can use the explore / exploit technique to hedge against the ranking algorithm’s preconceived bias. And so on.

But the key take-away is that we have to practice good data hygiene, splitting our projects into the two distinct activities of hypothesis generation (i.e., exploratory analysis) and hypothesis testing using withheld data.

Don’t Argue when you can Experiment

I couldn’t resist the opportunity to cite Nobel laureate Daniel Kahneman‘s seminal work on understanding human irrationality. I also threw in Mercier and Sperber’s recent work on reasoning as argumentative. The summary: don’t trust anyone’s theories, not even mine!

Then what can you trust? The results of a well-­‐run experiment. Rather than debating data-­‐free assertions, subject your hypotheses to the ultimate test: controlled experiments. Not every hypothesis can be tested using a controlled experiment, but most can be.

I recounted the story of how Greg Linden persuaded his colleagues at Amazon to implement shopping-cart recommendations through A/B testing, despite objections from a marketing SVP. Indeed, his work — and Amazon’s generally — has strongly advanced the practice of A/B testing in online settings.

Of course, A/B testing is fundamental to all of our work at LinkedIn. Every feature we release, whether it’s the new People You May Know interface or improvements to Group Search relevance, starts with an A/B test. And sometimes A/B testing causes us to not launch — we listen to the data.

Don’t argue when you can experiment. Decisions about how to improve products and processes should not be by an Oxford-­style debate. Rather, those decisions should be informed by data.

Conclusion: Even Steve Jobs Made Mistakes

Some of you may think that this is all good advice, but that science is no match for an inspired leader. Indeed, some pundits have seen Apple’s success relative to Google as an indictment of data-­driven decision making in favor of an approach that follows a leader’s gut instinct. Are they right? Should we throw out all of our data and follow our CEOs’ instincts?

Let’s go back a decade. In 2002, Apple faced a pivotal decision – perhaps the most important decision in its history. The iPod was clearly a breakthrough product, but it was only compatible with the Mac. Remember that, back in 2002, Apple had only a 3.5% market share in the PC business. Apple’s top executives did their analysis and predicted that they could drive the massive success of the iPod by making it compatible with Windows, the dominant operating system with over 95% market share.

Steve Jobs resisted. At one point he said that Windows users would get to use the iPod “over [his] dead body”. After continued convincing, Jobs gave up. According to authorized biographer Walter Isaacson, Steve’s exact words were: “Screw it. I’m sick of listening to you assholes. Go do whatever the hell you want.” Luckily for Steve, Apple, and the consumer public, they did, and the rest is history.

It isn’t easy being one those ass­holes. But that’s our job, much as it was theirs. It’s up to us to turn data into gold, to apply science and technology to create value for our organizations. Because without data, we are gambling on our leaders’ gut feelings. And our leaders, however inspired, have fallible instincts.

Science is the difference between instinct and strategy.

Categories
General

Semantic Link and Internet Evolution

   

Recently I had a couple of great opportunities to share my thoughts publicly, and I wanted to make sure readers here were aware of them.

The first was a special guest appearance on The Semantic Link, a program hosted by Paul Miller with regular panelists Peter Brown, Christine Connors, Eric Franzon, Eric Hoffer, Bernadette Hyland, and Andraz Tori. It was a lot of fun, and a great warm-up for the keynote I’ll be delivering on “Scale, Structure, and Semantics” at the upcoming Semantic Tech & Business Conference (SemTechBiz), which will take place in San Francisco in June.

The second was a live interview on Internet Evolution, hosted by Mary Jander and Nicole Ferraro. They clearly did their homework, scouring my blog posts and web commentary for everything controversial I’d ever said — and then some! If that’s enough to pique your interest, then I encourage you to listen to the recorded interview and read the chat transcript at Internet Evolution.

Happy to answer questions based on either of these sessions — comment away!

Categories
General

Noah Iliinsky: Tech Talk on Designing Data Visualizations


Note: This post was written by Yael Garten, a Senior Data Scientist at LinkedIn. Yael joined Linkedin in 2011, where she leads our mobile analytics team. She previously worked at Stanford on text mining, personalized medicine, and biomedical informatics.

We live in an era of Big Data. But how do we use all of that data to answer questions and communicate those answers effectively?

My colleagues and I at LinkedIn were fortunate enough to hear answers from Noah Iliinsky, who literally wrote the book on designing data visualization.

Earlier this month, we hosted Noah at LinkedIn to give a tech talk on “Designing Effective Data Visualizations“. We are proud to make these tech talks open to the public, and enjoyed a great mix of attendees from local companies and universities. If you couldn’t attend the talk in person or remotely, I encourage you to watch the recording, embedded above.

Why do we visualize data? As Noah tells us, visualization makes data accessible. It gives us faster access to actionable insights and allows access to huge amounts of data. Visualization enables both data exploration (when you are still trying to discover the story) and data explanation (when you have a story to tell). Noah reviewed some great examples (watch the talk!), with an emphasis on the dos and don’ts of data visualization.

In particular, he provided a step-by-step framework for traversing the path from question to answer:

Phase 1: Decide what to visualize.

  • Understand the question your audience wants to answer.
  • Understand the actions they are hoping the answer will drive.
  • Consider who is consuming this data — their needs, biases, etc.
  • Decide what data to use — and what data not to use — and what relationships you are interested in.
  • Explore the data and construct a storyline.

Phase 2: Decide how to visualize it.

  • Use appropriate visual encodings for data and relationships (cf. http://complexdiagrams.com/properties).
  • Limit the data you include.
  • Use position for your most important relationship.
  • Try different axes.
  • Show your visualization to different people, without explanations. Show an expert, show a layman.
  • Iterate, iterate, iterate!

Noah also shared his thoughts on how to visualize social networks. He recommended useful tools for data visualization, including Tableau, Spotfire, D3, Processing, ggplot2, Omnigraffle, and OmnigraphSketcher.

Finally, he left us with key lessons to take home:
  • You are not your audience. This is a huge lesson that all of us must internalize to be great at what we do. Consider what you need to communicate to marketers, investors, member of the general public, etc.
  • Do user research! Understand your users’ hopes, dreams, and favorite flavors! Understand their identity, their jargon, culture, etc.
  • Remember that your success is defined by your customer’s success. If you can’t satisfy your customer’s needs, you have failed — no matter how insightful your analysis.

You can enjoy the talk by watching the embedded video above. And you can find more LinkedIn tech talks on our YouTube channel.

Categories
General

Data, Algorithms, and People

One of the highlights of the recent Data 2.0 Summit was a panel featuring:

The focus of the panel was supposed to be about “Data Science and Predicting the Future”, but the most contentious topic was whether data, algorithms or people (that is, the data scientists themselves) were the most important factor in the practice and success of data science.

Yes, we one-upped the debate that my colleague Monica Rogati instigated at this year’s Strata conference. In fact, Josh cited the “better data beats more data beats clever algorithms” argument that Monica made in her own Strata presentation. And, just like at Strata, there was a healthy dose of audience participation.

Of course, I came down on the side of data — which I believe won the debate hands down.

I’m a fan of clever algorithms, which Alexander had to defend given that Skytree’s core value proposition is better machine learning algorithms delivered at scale. But I’m with Peter Norvig et al. on the dominance of data over algorithms.

Favoring data over people was a harder choice. Anthony naturally made the case for people (Kaggle’s claim to fame is assembling many of the world’s best data scientists by organizing competitions). Hopefully my team won’t quit en masse when they read this blog post! But I think they’ll agree with me that, without the incredible data we work with at LinkedIn, they’d be unable to deliver the awesomeness that I’ve come to expect from them.

There’s a saying that we all cook from the same cookbooks, so that it’s the ingredients that make all the difference. To take the metaphor further, you can also try to poach your rival’s chefs. But data is the biggest entry barrier — and the most sustainable competitive advantage.

Of course, we should have the best people apply the best algorithms to work with the best data. But data comes first. The best meal starts with the best ingredients.