Categories
General

Science as a Strategy

Last night, I had the pleasure to deliver the keynote address at the CIO Summit US. It was an honor to address an assembly of CIOs, CTOs, and technology executives from the nation’s top organizations. My theme was “Science as a Strategy”.

To set the stage, I told the story of TunkRank: how, back in 2009, I proposed a Twitter influence measure based on an explicit model of attention scarcity which proved better than the intuitive but flawed approach of counting followers. The point of the story was not self-promotion, but rather to introduce my core message:

Science is the difference between instinct and strategy.

Given the audience, I didn’t expect this message to be particularly controversial. But we all know that belief is not the same as action, and science is not always popular in the C-Suite. Thus, I offered three suggestions to overcome the HIPPO (Highest Paid Person’s Opinion):

  • Ask the right questions.
  • Practice good data hygiene.
  • Donโ€™t argue when you can experiment!

Asking the Right Questions

Asking the right questions seems obvious — after all, our answers can only be as good as the questions we ask. But science is littered with examples of people asking the wrong questions — from 19th-century phrenologists measuring the sizes of people’s skulls to evaluate intelligence to IT executives measuring lines of code to evaluate programmer productivity. It’s easy for us (today) to recognize these approaches as pseudoscience, but we have to make sure we ask the right questions in our own organizations.

As an example, I turned to the challenge of improving the hiring process. One approach I’ve seen tried at both Google and LinkedIn is to measure the accuracy of interviewers — that is, to see how well the hire / no-hire recommendations of individual interviewers predict the final decisions. But this turns out to be the wrong question — in large part because negative recommendations (especially early ones) weigh much more heavily in the decision than positive ones.

What we found instead was that we should focus on efficiency as an optimization problem. More specifically, there is a trade-off: short-circuiting the process as early as possible (e.g., after the candidate performs poorly on the first phone screen) reduces the average time per candidate, but it also reduces the number of good candidates who make it through the process. To optimize overall throughput (while keeping our high bar), we’ve had to calibrate the upstream filters. How to optimize that upstream filter turns out to be the right question to ask — and one we still continue to iterate on.

More generally, I talked about how, when we hire data scientists at LinkedIn, we look for not only strong analytical skills but also the product and business sense to pick the right questions to ask โ€“ questions whose answers create value for users and drive key business decisions. Asking the right questions is the foundation of good science.

Practicing Good Data Hygiene

Data mining is amazing, but we have to watch out for its pejorative meaning of discovering spurious patterns. I used the Super Bowl Indicator as an example of data mining gone wrong — with 80% accuracy, the division (AFC vs. NFC) of the Super Bowl champion predicts the coming year’s stock market performance. Indeed, the NFC won this year (go Giants!) and subsequent market gains have been consistent with this indicator (so far).

We can all laugh at these misguided investors, but we make these mistakes all the time. Despite what researchers have called the “unreasonable effectiveness of dataโ€, we still need the scientific method of first hypothesizing and then experimenting in order to obtain valid and useful conclusions. Without data hygiene, our desires, preconceptions, and other human frailties infect our rational analysis.

A very different example is using click-through data to measure the effectiveness of relevance ranking. This approach isn’t completely wrong, but it suffers from several flaws. And the fundamental flaw relates to data hygiene: how we present information to users infects their perception of relevance. Users assume that top-ยญranked results are more relevant than lower-ยญranked results. Also, they can only click on the results presented to them. To paraphrase Donald Rumsfeld: they don’t know what they don’t know. If we aren’t careful, a click-ยญbased evaluation of relevance creates positive feedback and only reinforces our initial assumptions โ€“ which certainly isn’t the point of evaluation!

Fortunately, there are ways to avoid these biases. We can pay people to rate results presented to them in random order. We can use the explore / exploit technique to hedge against the ranking algorithmโ€™s preconceived bias. And so on.

But the key take-away is that we have to practice good data hygiene, splitting our projects into the two distinct activities of hypothesis generation (i.e., exploratory analysis) and hypothesis testing using withheld data.

Donโ€™t Argue when you can Experiment

I couldn’t resist the opportunity to cite Nobel laureate Daniel Kahneman‘s seminal work on understanding human irrationality. I also threw in Mercier and Sperber’s recent work on reasoning as argumentative. The summary: don’t trust anyone’s theories, not even mine!

Then what can you trust? The results of a well-ยญโ€run experiment. Rather than debating data-ยญโ€free assertions, subject your hypotheses to the ultimate test: controlled experiments. Not every hypothesis can be tested using a controlled experiment, but most can be.

I recounted the story of how Greg Linden persuaded his colleagues at Amazon to implement shopping-cart recommendations through A/B testing, despite objections from a marketing SVP. Indeed, his work — and Amazon’s generally — has strongly advanced the practice of A/B testing in online settings.

Of course, A/B testing is fundamental to all of our work at LinkedIn. Every feature we release, whether it’s the new People You May Know interface or improvements to Group Search relevance, starts with an A/B test. And sometimes A/B testing causes us to not launch — we listen to the data.

Don’t argue when you can experiment. Decisions about how to improve products and processes should not be by an Oxford-ยญstyle debate. Rather, those decisions should be informed by data.

Conclusion: Even Steve Jobs Made Mistakes

Some of you may think that this is all good advice, but that science is no match for an inspired leader. Indeed, some pundits have seen Apple’s success relative to Google as an indictment of data-ยญdriven decision making in favor of an approach that follows a leader’s gut instinct.ย Are they right? Should we throw out all of our data and follow our CEOs’ instincts?

Let’s go back a decade. In 2002, Apple faced a pivotal decision โ€“ perhaps the most important decision in its history. The iPod was clearly a breakthrough product, but it was only compatible with the Mac. Remember that, back in 2002, Apple had only a 3.5% market share in the PC business. Apple’s top executives did their analysis and predicted that they could drive the massive success of the iPod by making it compatible with Windows, the dominant operating system with over 95% market share.

Steve Jobs resisted. At one point he said that Windows users would get to use the iPod “over [his] dead body”. After continued convincing, Jobs gave up. According to authorized biographer Walter Isaacson, Steve’s exact words were: โ€œScrew it. Iโ€™m sick of listening to you assholes. Go do whatever the hell you want.โ€ Luckily for Steve, Apple, and the consumer public, they did, and the rest is history.

It isnโ€™t easy being one those assยญholes. But thatโ€™s our job, much as it was theirs. Itโ€™s up to us to turn data into gold, to apply science and technology to create value for our organizations. Because without data, we are gambling on our leaders’ gut feelings. And our leaders, however inspired, have fallible instincts.

Science is the difference between instinct and strategy.

By Daniel Tunkelang

High-Class Consultant.

17 replies on “Science as a Strategy”

“Then what can you trust? The results of a well-ยญโ€run experiment. Rather than debating data-ยญโ€free assertions, subject your hypotheses to the ultimate test: controlled experiments.”

In the most general sense, I don’t disagree with this post. Broadly, what you’re saying is sage and should be followed.

My main problem with the whole A/B testing approach, where I continue to doubt its efficacy, is when you’re trying to “leap” a system, rather than “evolve” it.

And by “leap”, I don’t just mean large changes to the UI, UX, or algorithm. I understand that A/B testing can handle all *sizes* of changes, both small and large.

What I am talking about is when the change is qualitatively different, not just quantitatively different. That kind of a leap. For these sorts of changes, I remain unconvinced that A/B testing is capable of meeting the challenge. And that is because in order to make some of these leaps, it is not enough just to change the UI, UX, or algorithm. No, what has to be changed is the OEC (or KPI or whatever you call it). What has to be changed is the metric itself.

So what you have is a condition A that is best measured by (let’s call it) OEC_a. And condition B that is best measured by OEC_b. And because the metrics themselves are different, because the system itself has taken a qualitative leap, A/B testing is scientifically impossible to perform. A/B testing, science itself, requires sameness of metric across all conditions.

In those cases, those leaps, it seems to me that you still need a strong “Jobsian” (or Lindenian? Tunkelangian?) voice, to say that the metric itself needs to change. That we’re not just using a new algorithm or new UI with an old metric. That we need to have a new metric, too.

So that’s my main problem with this whole thing. Like said, 98.3% of the time it works just fine. But 98.3% of the time you’re doing “normal science” are you not? (http://en.wikipedia.org/wiki/Normal_science) When you need to step outside of normal science, when you need a Kuhnian “paradigm shift”, then your A/B testing also falls apart. The creation of an OEC is essentially the same thing as the establishment of a paradigm. A/B testing does not allow you to do paradigm shifts (http://en.wikipedia.org/wiki/Paradigm_shift).

For a good example of this, I point you to a blog post by this nice fellow, on the difference between “set retrieval” and “ranked retrieval”:

https://thenoisychannel.com/2008/08/24/set-retrieval-vs-ranked-retrieval/

To me, in order to go from ranked retrieval to set retrieval, you can’t just A/B test your way into it. You need to change the OEC itself. Because the OEC for ranked retrieval is different than the one for set retrieval. A/B testing cannot, therefore, help you make that leap.

Like

..not to mention the fact that if your OEC does need to change (or if you need to create an OEC in the first place), can you even use A/B testing to come up with that OEC? I don’t think so. Well, unless you have a meta-OEC that lets you choose between OEC’s. Then you’re into an infinitely recursive conundrum in which you need meta-meta-OECs and so on.

So don’t you still need a HiPPO, or at least one voice in the room that shouts louder than all the rest, to come to an OEC agreement?

Like

Dave, thanks!

Jeremy, so good to have you back! And your point is well-taken — testing assumes an objective function, and I think your main point is that you can’t test whether you have chosen the right objective function.

The typical problem with objective functions is that they are models — attempts to achieve the right trade-off between realism and the simplicity that enables analysis. We can use data for validation, e.g., if we think that CTR or speed is a proxy for user happiness, then we can collect independent data on user happiness (e.g., through surveys) and compute the correlations. So we can evaluate objective functions to some extent.

But there are limits. Science won’t tell us which problems to solve. Indeed, one of my main points is that science is about choosing which problems to solve. That’s why we have scientists!

And, if we get paid enough, maybe we even get to be HiPPOs. ๐Ÿ™‚

Like

I think your main point is that you canโ€™t test whether you have chosen the right objective function.

I was trying to make a slightly deeper point than that, which is to say that two different versions of one’s algorithm/interface/whatever might not be unify-able under any single OEC. It’s not just a matter of not being able to pick the right one. It’s that each of your two conditions A and B require a different “right one”.

Like

Well, the point of an algorithm or interface change may well be a decision to optimize relative to a different objective function. But in that case you can compare the old and new approaches using the new objective function. Of course that should favor the new approach, but if that’s fine if the approach is driven by the metric rather than vice versa.

Like

Agreed that both methods should be evaluated using the new objective function (metric). But how did you come to the choice of the new objective function in the first place? You did say above that it’s difficult to test whether you’ve chosen the right objective function. But my concern with A/B testing is less about whether you’ve got the absolute best right one, and more about how the HiPPO still has ultimate control if she or he is allowed to pick the objective function.

This is similar to a rule we know about in politics, about how if you control the discourse (terminology, definitions) then you control the policy/outcome. Think about the abortion debates. Where you stand on those debates has everything to do with how you’ve chosen the objective function: life vs. choice. The left doesn’t define itself as “anti-life”, because life isn’t their objective function — choice is. And the right doesn’t define itself as “anti-choice”, because choice isn’t their objective function — life is. We could deep end into what is or isn’t life.. whose choice is at stake, blah blah blah. I don’t want to do that – that’s not my point. My point is to show how you can control the ultimate workings of a system, the ultimate policies (aka algorithms) that get implemented by how you choose your objective function. It doesn’t matter how much you measure condition A against condition B, if the objective function has been specifically designed to favor condition B.

I think you see what I’m saying, when you write, “in that case you can compare the old and new approaches using the new objective function. Of course that should favor the new approach”. But I see it more strongly than that. I see it as the new objective function not just favoring the latter, but essentially being capable of pre-selecting the latter. As in a rigged game.

So if the HiPPO can still rig the game, by being the person who chooses the objective function, what does it even matter if the HiPPO can’t select algorithm A or algorithm B directly? The choice of objective function is essentially the same thing. Natch?

Like

Exactly. I wish there were more discussion along those lines. I hear a lot about how it’s difficult to pick a good objective function, about how it’s difficult to test which objective function is better than another, etc. But I hear very little about how a savvy HiPPO, by controlling the objective function, still has ultimate control. And this isn’t just an academic exercise. I think web search is the way it is (mcdonalds-ified) because of objective-function controlling HiPPOs.

Like

That was a great topic, a great talk, and a great time! Make sure TED knows about this.

Like

So here is a case in point.

Yahoo! just released Axis: http://axis.yahoo.com/

Axis offers visual browsing of search results. Similar to SearchMe from a few years ago (http://en.wikipedia.org/wiki/SearchMe).

When SearchMe came out, I remember hearing from Google that they tried visual results in an A/B test, but that it failed. The big data science of A/B testing indicated that it was not a useful feature.

And yet Y! supposedly makes its decisions in a big data, science driven manner as well. Yahoo’s user base might be slightly smaller, but overall it covers all the same cities and states and nations as Google’s. In aggregation, one would expect the data-driven observations about the utility of visual search results to be very similar across user bases.

And yet they’re not.

Google’s science led them to not do visual results. Yahoo!’s science led them to do visual results.

What explains this? Lots of possibilities, sure. But the Occamy-est possibility seems to be the HiPPO, or the objective function set by the HiPPO. Both companies likely have very similar data. But how that data is “scienced” is a function of the HiPPO-chosen objective function.

So again, don’t get me wrong; I’m not anti-science ๐Ÿ™‚ But I have to raise stronger objections to a strictly data-driven approach, when it really seems clear that the HiPPO is still more powerful in affecting the overall outcome (visual results vs. no visual results — complete opposite conclusion!)

Like

Of course many organizations let HiPPOs make decisions –I said as much in my talk. But I’m not sure the examples you offer argue in favor of this status quo. ๐Ÿ™‚

Regardless, we also agree that people decide on the choice of objective functions. That’s what we as people needs to be good scientists — if we’re not part of the solution, then we are part of the problem.

Like

But when you look at these two organizations, Y and G, they both have world-class scientists. They both ask what they believe to be the right questions, practice good data hygiene, and have a culture of experimentation over arguing, right? I don’t doubt either org’s adherence to these principles.

And yet the outcome of this science as a strategy led in two diametrically-opposed directions.

So isn’t that…? What does it…? How does one…?

I’m struggling to ask the right question here, but do you see what I’m saying? When two physicists on opposite sides of of the world practice good scientific strategy, they much more often than not reach the same conclusions or outcome. Even climate scientists are like 98:2 in agreement about the direction that the big data is pointing.

And yet with two of the biggest data driven web companies out there, the outcome points in polar opposite directions.

So if I’m thinking about strategizing my decision-making processes scientifically, and I see something like this, it doesn’t seem all that different from a coin toss. Should I do a visual summary of web pages on the desktop? Coin flip heads or tails, I would be in agreement with the “science as a strategy” approach just as often no matter which way the coin landed.

Is this a valid concern, or am I just getting annoying? I do want us to be good scientists who are part of the solution and not part of the problem. But when I see outcomes like this example above, I’m not sure that I can be a good scientist who is part of the solution and not the problem. Not that I would be any less of a scientist, but that even if I am of the same caliber as the excellent people at both Y and G, I’d probably reach a coin-flip conclusion, too. Non?

Like

Organizations with world-class scientists don’t necessarily listen to them. And even great scientists are capable of irrational decision making (cf. Linus Pauling’s work on Vitamin C. Science is what you do, not what you have.

I’m sure that executives in many organizations claim to have a culture of experimentation and hypothesis testing. But then they go ahead, make decisions without data, and treat the decisions they’ve made as facts on the ground rather than hypotheses waiting to be tested by reality.

Why organizations fail at decision making is a larger topic. What I’m describing is an ideal we should all strive for.

Like

Hmm.. very fair points.

I guess my next question is: Are we, as outside observers, able to tell which, if any, entities have followed the strategy of science? If we see these two different companies, with two diametrically-opposed outcomes, is it possible for us to either observe or infer which one did it correctly, and which one did not? (By “correctly”, I mean, scientifically).

Because it’s not just a matter of marketplace success, I imagine. Following a scientific strategy does not guarantee positive market outcomes; it only guarantees scientific truth. And the two are not necessarily the same. So we can’t just look at the winner and say, “that one was more scientific”.

At the same time, it really would be nice to know who actually follows the strategy, and who only pays lip service (or, to what degree, since I’m sure it’s not black-and-white, but more nuanced). Because even though I’m not going to go out and build a competing search engine, I really do get academically curious about which of various approaches are better. In this example it was visual search results versus text-only, on the desktop. But the same question is true, of most of the products that I see and/or use: How do I as a consumer know that a product has been created for me because of science, marketing, HiPPOing, or for whatever other reason? Is it possible to observe something from the outside, and infer how it was arrived at?

Like

I wish I had a good answer. Usually we get sufficient access to most decision making processes to evaluate them — certainly for decisions that happen outside our own organizations. And, if we have that access, we may lack objectivity.

Like

Comments are closed.