General

So You Like Big Data…

The increasing volume of data that we generate as a species is a story so overplayed as to have become trite. Indeed, a vast amount of this data is in the public domain, including data from the full text and common ngrams of books, genome research, the United States census, and much more. There is also open-source software not only to crawl the web, but also to search the data your crawl. So, if you’re an aspiring data scientist and just want to get your hands on data, there’s no excuse–go out and get it!

But perhaps you’d like to make a career out your jones for big data. Luckily for you, some of the hottest companies around are hiring data scientists!

Of course, those jobs aren’t for everyone. To get an idea of the necessary qualifications, I suggest you read the answers on Quora for “How do I become a data scientist?” to get an idea of the requisite math and computer science skills. I’m also a fan of Hilary Mason‘s definition which was cited in Ryan Kim’s “Wanted: Data Scientists to Turn Information Into Gold“: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. You can see Hilary’s full explanation in a blog post she co-authored with Chris Wiggins, entitled “A Taxonomy of Data Science“.

If the qualifications haven’t scared you off, then it’s just a question of where you can best apply your data scientist skills. The good news is that there are a lot of different ways to make a career out of working with big data. Here are some suggestions for what to work on. I apologize in advance for taking a US-centric perspective — if you’re outside the US, I can only hope that the examples have local analogs.

1) Web search.

Google, Yahoo, and Bing all collect an enormous amount of data from people’s web search activity. Google is, of course, the 800-pound gorilla, but don’t dismiss the others — even a single-digit market share is enough to derive extremely valuable insights from user activity. And, since every major search engine makes the bulk of its revenue from advertising, they all present the big-data challenges associated with computational advertising. Search is, in my view, the web’s killer app, so you can’t go wrong working on it. But temper your expectations — despite heroic efforts from various parties, it seems difficult to deliver revolutionary improvements to this field.

2) Social networking.

Here the biggest players are Facebook and Twitter, but you can find a more comprehensive list on Wikipedia. Many consider LinkedIn to be a social network, but I’ll take the liberty to discuss it in its own section. Social networks attract an outsized share of users’ attention: Facebook alone accounts for a quarter of US page views on the web! All of this user activity means a lot of data to crunch, so it’s not surprising that LinkedIn, Facebook, and Twitter are recognized as having the best data science teams. How much you’ll enjoy working at these companies will in part reflect the value (and values) you perceive in their offerings, but they are all playgrounds for data scientists.

3) Electronic commerce.

While ad-supported web search may be the killer app of the web, what opens up people’s wallets is e-commerce. Led by Amazon and eBay, e-commerce sites deserve much of the credit for turning the web from an esoteric research project into a mainstream staple. And, like their offline counterparts, e-commerce sites generate vast amounts of data from how users view and purchase products. This data drives user recommendations, merchandising campaigns, pricing strategy, and much more. If you’d like to pursue data-driven capitalism, then e-commerce may be for you. A word of caution: if you are one of a crowd of merchants selling the same products as everyone else (as opposed to a site like Etsy selling unique products), make sure you have a sustainable competitive advantage. Data science is necessary for success in e-commerce, but it may not be sufficient.

4) Digital content.

Whether its books, music, video, or apps, the long-prophesied digital convergence has arrived: almost every newly created piece of digital content is now distributed in electronic form. Here the biggest players are Amazon, Apple, and Google (particularly its YouTube subsidiary), but there is still a lot of flux as new hardware, software, and business models compete for dominance. Digital content poses two daunting challenges: the volume of published content far exceeds people’s available attention, and digital media products are experience goods than people can only evaluate after consuming them. For both of these reasons, the digital content industry depends on data scientists to help people find and discover what they like. The catch: from its advent, the digital content industry has struggled with unauthorized distribution (aka piracy), and the results of this struggle will determine which business models are viable.

5) Finance.

Money, money, money. Working in finance has always been a data-intensive business, but advances in technology have only increased the industry’s reliance on data scientists. Algorithmic trading — and high-frequency trading in particular — mean that those who can most effectively and efficiently mine financial data can derive enormous financial benefits. Finance isn’t for everyone — the hours are long, the stress is high, and the compensation is highly variable. That said, the financial upside can be quite compelling, and some even enjoy the lifestyle.

6) Public sector.

Given the libertarian leanings of the software industry, the public sector might not seem like an obvious career choice. But some of the largest repositories of data reside there–from public repositories like census data to highly classified repositories restricted to the TLAs. Better understanding of this data can improve public policy, national security, and much more. Not everyone has the temperament to deal with government bureaucracy, but those who do have the opportunity to turn big data into big public good.

7) LinkedIn.

OK, I’m being self-serving, but after all this is my blog! LinkedIn is widely recognized as being one of the top data science teams on the planet. But LinkedIn has more than just talent — it has what Pete Warden of ReadWriteWeb described in “Secrets of the LinkedIn Data Scientists” as “detailed information on millions of people who are motivated to keep their profiles up-to-date, collect a rich network of connections and have a strong desire from their users for more tools to help them in their professional lives.” Indeed, I don’t know of anyone who has a dataset that competes with the combined quantity, quality, and utility of LinkedIn’s data. Moreover, working as a data scientist at LinkedIn means helping make people more professionally successful by connecting the to opportunities, information, and of course other people. It’s a wonderful way to create value, and it doesn’t hurt to do so in the context of a profitable, rapidly growing company.

And LinkedIn recognizes the extraordinary value of data science. Don’t take my word for it — listen to LinkedIn CEO Jeff Weiner’s interview at the 2010 Web 2.0 Summit:

To wrap up, data science is more than just an opportunity to have fun and make the world a better place — it might even be how you make an honest living!