Strata 2012: Big Data is Bigger than Ever!

I spent the last three days at the O’Reilly Strata Conference, an assembly of two thousand over 2500 people focused on data science and its applications. While I’m wary of industry conferences from attending vendor-fests in my past life in the enterprise software world, Strata is an exceptionally good conference. The speakers were a who’s who of data science, including Lucene and Hadoop creator Doug Cutting, search user interface pioneer Marti Hearst, and Google chief economist Hal Varian. You can find the tweet stream for the conference at hash tag #strataconf.


I spent Tuesday in the Deep Data session, billed as a no-holds-barred program for data scientists. My two favorite talks:

  • Claudia Perlich, winner of three KDD cups, talked about using information to pick the right action and to influence people such that they behave in a way that is better for them, better for us, and possibly better for society in general.
  • Monica Rogati, my colleague at LinkedIn and the epitome of a data scientist, delivered a fantastic talk about machine learning models and training data in the real world, extending Peter Norvig‘s point about the “unreasonable effectiveness of data” to observe that more data beats clever algorithms but better data beats more data.

But the most fun that day was the Oxford-style debate featuring Drew ConwayPete Skomoroch, Mike Driscoll, DJ Patil, Amy Heineike, Pete Warden, and Toby Segaran. The question proposed was absurdly Manichean: if you had to hire your first data scientist and could only hire one, would you pick a domain expert or a machine learning expert? After the moderator suppressed some initial attempts to hedge (“both”, “it depends”, etc.), the debaters ripped into the question by taking extreme positions and defending them with gusto. It was a lot of fun, with enthusiastic audience participation and the debaters exploiting their inside knowledge of their opponents’ work histories. In the end, the machine learning side won by a small margin.

I then had the good fortune to grab dinner with Marti Hearst and Hal Varian at Xanh — a wonderful mix of great food and conversation.


The Wednesday morning keynote session offered some gems:

  • Cloudera CEO Mike Olson urged big data practitioners to focus on guns, drugs, and oil.
  • Doctor and data geek Ben Goldacre delivered a mesmerizing and disturbing talk about the suppression of inconvenient medical trial results and analytical tools to discover it.

But the person who stole the show was Google’s Avinash Kaushik, who talked about making love with data to find orgasm-inducing actions to change the world and make more money. Unfortunately this was the one talk that was not recorded, but you can read the summary on Avinash’s Google+ page.

As a speaker, I held “office hours” on Wednesday. It was supposed to be a 40-minute slot for conference attendees to come and ask me question. But somehow those 40 minutes extended into three hours of conversation about everything from normalized KL divergence to interview problems — and segued into a reception with specialty big-data cocktails. By the time I got back to my apartment, my voice, brain, and liver were spent.


I spent most of Thursday morning in the speaker lounge, recovering from the previous evening and making last touches on my presentation. But I couldn’t resist attending a two-part session on privacy. Indeed, this session was distinctive enough to merits it’s own hash tag: #strataprivacy.

The first part featured O’Reilly’s Alex Howard moderating Intelius Chief Privacy Officer Jim Adler and NYU PhD student Solon Barocas on a panel provocatively titled  “If Data Wants to Be Free, is Privacy a Prison?” It was a great discussion, and I enjoyed the opportunity to offer my own provocative question through Twitter. Since the panelists were arguing that it was unethical to infer private facts from public data, I asked if they were trying to establish a new form of thoughtcrime.

The second panel, entitled “Pretty Simple Data Privacy“, featured Kaitlin Thaney from Digital Science, Betsy Masiello from Google, and John Wilbanks from the Kauffman Foundation for Entrepreneurship. Given that today was the first day of Google’s new privacy policy, there was no avoiding focus on the associated controversy. I did try to get Betsy to address my charge that Google doesn’t think users own their search history (cf. “Google vs. Bing: A Tweetle Beetle Battle Muddle“), but she said she was unfamiliar with the details of that event. I do wish that someone at Google with more familiarity would respond publicly.

Back to the speaker room after lunch, until my own talk with Samasource’s Claire Hunsaker on “Humans, Machines, and the Dimensions of Microwork“. I’ll post the slides (and there will be a video on the conference site), but the sound bite is that you need to keep crowdsourcing tasks simple, manage the trade-off between task value and difficulty, and watch out for systematic bias.

I wrapped up the conference by hearing William Gunn talk about how Mendeley is disrupting bibliometrics and perhaps the entire academic publishing and reputation ecosystem. I laud his ambition and wish him and Mendeley luck in this quest.

In summary, three days of great talks, conversations, and general enjoyment. My thanks to Strata organizers Edd Dumbill and Alistair Croll for putting together such an outstanding event and for giving me the opportunity to participate.

By Daniel Tunkelang

High-Class Consultant.

21 replies on “Strata 2012: Big Data is Bigger than Ever!”

my colleague at LinkedIn and the epitome of a data scientist, delivered a fantastic talk about machine learning models and training data in the real world, extending Peter Norvig‘s point about the “unreasonable effectiveness of data” to observe that more data beats clever algorithms but better data beats more data.

I’d like to insert a placeholder here for a big data argument.. algorithms vs. data vs. better data vs. better interaction design.


Glad you all enjoyed the write-up. I agree that Strata is doing impressive job of providing the best of both worlds: the substance of academic conference combined with the groundedness of a practitioner focus.

And yes, Jeremy, we’ll have a separate post to analyze your Manichean perspective on better data vs better interaction design :-).


we’ll have a separate post to analyze your Manichean perspective on better data vs better interaction design 🙂

Pfft.. Manichean. I see your smiley face, but pfft, nevertheless! 😉

But seriously, the duality is usually between more data vs. better algorithms. All I’m saying is that it is a triumvirate: data, algorithms, and process. The last part, the process, is usually ignored. But it is what allows you to connect data to algorithms. It’s the glue that holds it all together. Bad glue = bad outcome.


Jeremy, not sure we actually disagree! It’s clear that you can screw the pooch despite having lots of great data and clever algorithms if you don’t know what you’re doing — that’s what people like us are gainfully employed! Even a service like Kaggle offers a competition of human experts, not machines.


Aw, but I’d really like to disagree. Otherwise, how are we going to have an interesting argument? 🙂

But honestly, I almost if not never hear folks talk about process. The dichotomy is always between data and algorithm. It’s never about the fluid, oft capricious, infinitely more important of how and when and especially why various pieces of data are wed to various scraps of algorithm.

In fact, I remember having a discussion with you a year or two ago, when we were debating the core of what HCIR meant. I didn’t see HCIR only as faceted search, or only as visual search, or only as just a way of allowing humans more decision-making capability. Rather, I see HCIR as a system that allows humans input in deciding when and how and where various pieces of data should be fed to various algorithms, and when and how and where various algorithms should be wrapped around various pieces of data.

Most information systems, web search systems especially, feed data to algorithms, or wrap algorithms around data. But they embody a completely rigid and inflexible process. The user is given no way of altering the ultimate procedural goals of the system. Those are set by the system designers. And even if those goals are set in a data-driven way, once that historical data has spoken, you the current system user have no way of disagreeing and altering the process.

I want that process to be alterable not by experts such as you or me, but by the users of the system themselves. That’s HCIR. HCIR is about allowing users to alter the data, or to alter the algorithm, but instead to alter the relationship between the data and the algorithm. I.e the process.


Data and algorithms generally feel concrete, while process often feel squishy. To butcher Kelvin’s mantra, if you can’t quantify it, you can’t publish.

Can we bring more concrete evidence to bear? Examples or study designs to test the hypothesis that process is as or more essential than data and algorithms?


“Can we bring more concrete evidence to bear? Examples or study designs to test the hypothesis that process is as or more essential than data and algorithms?”

The literature is scattered with these sorts of things. Though scattered is the correct term, not centralized.

The first thing that pops to mind is this study:

In it, the Google process is compared against Diane’s process. The Google process is “use a small query box, stay simple, and use a few words as possible to describe your need”. Diane’s process is “use a bigger box, and let the users use as many words as needed. Don’t encourage them to use a “stay small” process.

And Diane’s process won.


[…] CTO and co-founder of the Metamarkets Group. He moderated a fantastic debate at the recent Strata conference about the relative importance of domain expertise or machine learning for data […]


I’d be interested in connecting with you to discuss ways we can mutually promote each other. I offer a global event focused on Predictive Analytics and we are rolling out an industry newsletter. As part of the newsletter, we’d like to promote your blog to our subscriber base. I’d also be interested in reviewing an affiliate program if you are interested and have a strong enough following?

I look forward to hearing back.

Adam R. Kahn
Chief Operating Officer, Rising Media, Inc.
O: 508.644.0639 |


Adam, glad to have you as a reader. And I’m glad you’ve been attracting industry leaders like my colleague Scott Nicholson to participate in Predictive Analytics World.

Seeing how this is not a commercial blog, I’m not interested in affiliate programs. But thanks for the offer. And you’re certainly welcome to promote my blog in your newsletter if you think it would benefit your readers.

I sometimes write about events here myself, but mostly when I am participating in them personally and have something to contribute to the discussion.


Comments are closed.