General

Science as a Strategy

Last night, I had the pleasure to deliver the keynote address at the CIO Summit US. It was an honor to address an assembly of CIOs, CTOs, and technology executives from the nation’s top organizations. My theme was “Science as a Strategy”.

To set the stage, I told the story of TunkRank: how, back in 2009, I proposed a Twitter influence measure based on an explicit model of attention scarcity which proved better than the intuitive but flawed approach of counting followers. The point of the story was not self-promotion, but rather to introduce my core message:

Science is the difference between instinct and strategy.

Given the audience, I didn’t expect this message to be particularly controversial. But we all know that belief is not the same as action, and science is not always popular in the C-Suite. Thus, I offered three suggestions to overcome the HIPPO (Highest Paid Person’s Opinion):

Ask the right questions.
Practice good data hygiene.
Don’t argue when you can experiment!

Asking the Right Questions

Asking the right questions seems obvious — after all, our answers can only be as good as the questions we ask. But science is littered with examples of people asking the wrong questions — from 19th-century phrenologists measuring the sizes of people’s skulls to evaluate intelligence to IT executives measuring lines of code to evaluate programmer productivity. It’s easy for us (today) to recognize these approaches as pseudoscience, but we have to make sure we ask the right questions in our own organizations.

As an example, I turned to the challenge of improving the hiring process. One approach I’ve seen tried at both Google and LinkedIn is to measure the accuracy of interviewers — that is, to see how well the hire / no-hire recommendations of individual interviewers predict the final decisions. But this turns out to be the wrong question — in large part because negative recommendations (especially early ones) weigh much more heavily in the decision than positive ones.

What we found instead was that we should focus on efficiency as an optimization problem. More specifically, there is a trade-off: short-circuiting the process as early as possible (e.g., after the candidate performs poorly on the first phone screen) reduces the average time per candidate, but it also reduces the number of good candidates who make it through the process. To optimize overall throughput (while keeping our high bar), we’ve had to calibrate the upstream filters. How to optimize that upstream filter turns out to be the right question to ask — and one we still continue to iterate on.

More generally, I talked about how, when we hire data scientists at LinkedIn, we look for not only strong analytical skills but also the product and business sense to pick the right questions to ask – questions whose answers create value for users and drive key business decisions. Asking the right questions is the foundation of good science.

Practicing Good Data Hygiene

Data mining is amazing, but we have to watch out for its pejorative meaning of discovering spurious patterns. I used the Super Bowl Indicator as an example of data mining gone wrong — with 80% accuracy, the division (AFC vs. NFC) of the Super Bowl champion predicts the coming year’s stock market performance. Indeed, the NFC won this year (go Giants!) and subsequent market gains have been consistent with this indicator (so far).

We can all laugh at these misguided investors, but we make these mistakes all the time. Despite what researchers have called the “unreasonable effectiveness of data”, we still need the scientific method of first hypothesizing and then experimenting in order to obtain valid and useful conclusions. Without data hygiene, our desires, preconceptions, and other human frailties infect our rational analysis.

A very different example is using click-through data to measure the effectiveness of relevance ranking. This approach isn’t completely wrong, but it suffers from several flaws. And the fundamental flaw relates to data hygiene: how we present information to users infects their perception of relevance. Users assume that top-ranked results are more relevant than lower-ranked results. Also, they can only click on the results presented to them. To paraphrase Donald Rumsfeld: they don’t know what they don’t know. If we aren’t careful, a click-based evaluation of relevance creates positive feedback and only reinforces our initial assumptions – which certainly isn’t the point of evaluation!

Fortunately, there are ways to avoid these biases. We can pay people to rate results presented to them in random order. We can use the explore / exploit technique to hedge against the ranking algorithm’s preconceived bias. And so on.

But the key take-away is that we have to practice good data hygiene, splitting our projects into the two distinct activities of hypothesis generation (i.e., exploratory analysis) and hypothesis testing using withheld data.

Don’t Argue when you can Experiment

I couldn’t resist the opportunity to cite Nobel laureate Daniel Kahneman‘s seminal work on understanding human irrationality. I also threw in Mercier and Sperber’s recent work on reasoning as argumentative. The summary: don’t trust anyone’s theories, not even mine!

Then what can you trust? The results of a well-‐run experiment. Rather than debating data-‐free assertions, subject your hypotheses to the ultimate test: controlled experiments. Not every hypothesis can be tested using a controlled experiment, but most can be.

I recounted the story of how Greg Linden persuaded his colleagues at Amazon to implement shopping-cart recommendations through A/B testing, despite objections from a marketing SVP. Indeed, his work — and Amazon’s generally — has strongly advanced the practice of A/B testing in online settings.

Of course, A/B testing is fundamental to all of our work at LinkedIn. Every feature we release, whether it’s the new People You May Know interface or improvements to Group Search relevance, starts with an A/B test. And sometimes A/B testing causes us to not launch — we listen to the data.

Don’t argue when you can experiment. Decisions about how to improve products and processes should not be by an Oxford-style debate. Rather, those decisions should be informed by data.

Conclusion: Even Steve Jobs Made Mistakes

Some of you may think that this is all good advice, but that science is no match for an inspired leader. Indeed, some pundits have seen Apple’s success relative to Google as an indictment of data-driven decision making in favor of an approach that follows a leader’s gut instinct. Are they right? Should we throw out all of our data and follow our CEOs’ instincts?

Let’s go back a decade. In 2002, Apple faced a pivotal decision – perhaps the most important decision in its history. The iPod was clearly a breakthrough product, but it was only compatible with the Mac. Remember that, back in 2002, Apple had only a 3.5% market share in the PC business. Apple’s top executives did their analysis and predicted that they could drive the massive success of the iPod by making it compatible with Windows, the dominant operating system with over 95% market share.

Steve Jobs resisted. At one point he said that Windows users would get to use the iPod “over [his] dead body”. After continued convincing, Jobs gave up. According to authorized biographer Walter Isaacson, Steve’s exact words were: “Screw it. I’m sick of listening to you assholes. Go do whatever the hell you want.” Luckily for Steve, Apple, and the consumer public, they did, and the rest is history.

It isn’t easy being one those assholes. But that’s our job, much as it was theirs. It’s up to us to turn data into gold, to apply science and technology to create value for our organizations. Because without data, we are gambling on our leaders’ gut feelings. And our leaders, however inspired, have fallible instincts.

Science is the difference between instinct and strategy.

By Daniel Tunkelang

High-Class Consultant.

View Archive

17 replies on “Science as a Strategy”

“Then what can you trust? The results of a well-‐run experiment. Rather than debating data-‐free assertions, subject your hypotheses to the ultimate test: controlled experiments.”

In the most general sense, I don’t disagree with this post. Broadly, what you’re saying is sage and should be followed.

My main problem with the whole A/B testing approach, where I continue to doubt its efficacy, is when you’re trying to “leap” a system, rather than “evolve” it.

And by “leap”, I don’t just mean large changes to the UI, UX, or algorithm. I understand that A/B testing can handle all *sizes* of changes, both small and large.

What I am talking about is when the change is qualitatively different, not just quantitatively different. That kind of a leap. For these sorts of changes, I remain unconvinced that A/B testing is capable of meeting the challenge. And that is because in order to make some of these leaps, it is not enough just to change the UI, UX, or algorithm. No, what has to be changed is the OEC (or KPI or whatever you call it). What has to be changed is the metric itself.

So what you have is a condition A that is best measured by (let’s call it) OEC_a. And condition B that is best measured by OEC_b. And because the metrics themselves are different, because the system itself has taken a qualitative leap, A/B testing is scientifically impossible to perform. A/B testing, science itself, requires sameness of metric across all conditions.

In those cases, those leaps, it seems to me that you still need a strong “Jobsian” (or Lindenian? Tunkelangian?) voice, to say that the metric itself needs to change. That we’re not just using a new algorithm or new UI with an old metric. That we need to have a new metric, too.

So that’s my main problem with this whole thing. Like said, 98.3% of the time it works just fine. But 98.3% of the time you’re doing “normal science” are you not? (http://en.wikipedia.org/wiki/Normal_science) When you need to step outside of normal science, when you need a Kuhnian “paradigm shift”, then your A/B testing also falls apart. The creation of an OEC is essentially the same thing as the establishment of a paradigm. A/B testing does not allow you to do paradigm shifts (http://en.wikipedia.org/wiki/Paradigm_shift).

For a good example of this, I point you to a blog post by this nice fellow, on the difference between “set retrieval” and “ranked retrieval”:

Set Retrieval vs. Ranked Retrieval

To me, in order to go from ranked retrieval to set retrieval, you can’t just A/B test your way into it. You need to change the OEC itself. Because the OEC for ranked retrieval is different than the one for set retrieval. A/B testing cannot, therefore, help you make that leap.

Share this:

Related

By Daniel Tunkelang

17 replies on “Science as a Strategy”