One of the highlights of the recent Data 2.0 Summit was a panel featuring:
- Alexander Gray, CTO of SkyTree
- Anthony Goldbloom, CEO of Kaggle
- Josh Wills, Director of Data Science at Cloudera
The focus of the panel was supposed to be about “Data Science and Predicting the Future”, but the most contentious topic was whether data, algorithms or people (that is, the data scientists themselves) were the most important factor in the practice and success of data science.
Yes, we one-upped the debate that my colleague Monica Rogati instigated at this year’s Strata conference. In fact, Josh cited the “better data beats more data beats clever algorithms” argument that Monica made in her own Strata presentation. And, just like at Strata, there was a healthy dose of audience participation.
Of course, I came down on the side of data — which I believe won the debate hands down.
I’m a fan of clever algorithms, which Alexander had to defend given that Skytree’s core value proposition is better machine learning algorithms delivered at scale. But I’m with Peter Norvig et al. on the dominance of data over algorithms.
Favoring data over people was a harder choice. Anthony naturally made the case for people (Kaggle’s claim to fame is assembling many of the world’s best data scientists by organizing competitions). Hopefully my team won’t quit en masse when they read this blog post! But I think they’ll agree with me that, without the incredible data we work with at LinkedIn, they’d be unable to deliver the awesomeness that I’ve come to expect from them.
There’s a saying that we all cook from the same cookbooks, so that it’s the ingredients that make all the difference. To take the metaphor further, you can also try to poach your rival’s chefs. But data is the biggest entry barrier — and the most sustainable competitive advantage.
Of course, we should have the best people apply the best algorithms to work with the best data. But data comes first. The best meal starts with the best ingredients.
7 replies on “Data, Algorithms, and People”
I can see how the output of this debate is beneficial (striving through the strengths and weaknesses of each part). Still, the best data is hardly worth the hard drive it sits on without human intervention…even newbie data analyst intervention at that. The best chef is still going to go out of business with a steady supply of rotten ingredients. At the end of the day, these eithor/or programmatic arguments are silly. It’s the whole, the sum of the parts that matters…nothing comes first except the drive to create something excellent…hmm, but that comes from people! Well, you know what I mean.
James, of course people are crucial — that’s why there is such a war for talent! But it’s also why great talent follows great data: people are a lot more mobile than proprietary data.
Of course, that’s also an argument in favor of open data as a public good. But a lot of data won’t (and shouldn’t!) be open any time soon, whether it’s web search logs, social network activity data, etc. So people like me will continue to follow the data.
All I’m saying is the framework that surrounds the great data is what actually draws you. You’re at LinkedIn because of the system (everything) that nurtured and grew the data, which birthed from some very smart individuals. At some point, yes, what they’ve created will seem to outweigh the creators themselves in importance, but I’m not so easily convinced. Data doesn’t create anything new, people do, and people want new things.
But I wouldn’t mind getting my hands on some of your data 😛
James, I think we’re in violent agreement. It does take people to recognize the value of data as a first-class citizen in an organization.
As for your desire to get your hands on our data, I will note that two of our data scientists used to work at Juice. No pressure. 🙂
Its true. You and your blasted seductive data powers 😉
A little violence here and that ends in holistic agreement is well worth it. Now on to ruling the world…
[…] In the end, we all have our perspectives, based perhaps on what we work on, but I do think that the “better data” perspective is often lost in the rush toward larger datasets with more complex algorithms. For more on this perspective, here and here are two blog posts I found interesting on the subject. Daniel Tunkelang blogged about the same panel here. […]