So You Like Big Data…

The increasing volume of data that we generate as a species is a story so overplayed as to have become trite. Indeed, a vast amount of this data is in the public domain, including data from the full text and common ngrams of books, genome research, the  United States census, and much more. There is also open-source software not only to crawl the web, but also to search the data your crawl. So, if you’re an aspiring data scientist and just want to get your hands on data, there’s no excuse–go out and get it!

But perhaps you’d like to make a career out your jones for big data. Luckily for you, some of the hottest companies around are hiring data scientists!

Of course, those jobs aren’t for everyone. To get an idea of the necessary qualifications, I suggest you read the answers on Quora for “How do I become a data scientist?” to get an idea of the requisite math and computer science skills. I’m also a fan of Hilary Mason‘s definition which was cited in Ryan Kim’s “Wanted: Data Scientists to Turn Information Into Gold“: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. You can see Hilary’s full explanation in a blog post she co-authored with Chris Wiggins, entitled “A Taxonomy of Data Science“.

If the qualifications haven’t scared you off, then it’s just a question of where you can best apply your data scientist skills. The good news is that there are a lot of different ways to make a career out of working with big data. Here are some suggestions for what to work on. I apologize in advance for taking a US-centric perspective — if you’re outside the US, I can only hope that the examples have local analogs.

1) Web search.

Google, Yahoo, and Bing all collect an enormous amount of data from people’s web search activity. Google is, of course, the 800-pound gorilla, but don’t dismiss the others — even a single-digit market share is enough to derive extremely valuable insights from user activity. And, since every major search engine makes the bulk of its revenue from advertising, they all present the big-data challenges associated with computational advertising. Search is, in my view, the web’s killer app, so you can’t go wrong working on it. But temper your expectations — despite heroic efforts from various parties, it seems difficult to deliver revolutionary improvements to this field.

2) Social networking.

Here the biggest players are Facebook and Twitter, but you can find a more comprehensive list on Wikipedia. Many consider LinkedIn to be a social network, but I’ll take the liberty to discuss it in its own section. Social networks attract an outsized share of users’ attention: Facebook alone accounts for a quarter of US page views on the web! All of this user activity means a lot of data to crunch, so it’s not surprising that LinkedIn, Facebook, and Twitter are recognized as having the best data science teams. How much you’ll enjoy working at these companies will in part reflect the value (and values) you perceive in their offerings, but they are all playgrounds for data scientists.

3) Electronic commerce.

While ad-supported web search may be the killer app of the web, what opens up people’s wallets is e-commerce. Led by Amazon and eBay, e-commerce sites deserve much of the credit for turning the web from an esoteric research project into a mainstream staple. And, like their offline counterparts, e-commerce sites generate vast amounts of data from how users view and purchase products. This data drives user recommendations, merchandising campaigns, pricing strategy, and much more. If you’d like to pursue data-driven capitalism, then e-commerce may be for you. A word of caution: if you are one of a crowd of merchants selling the same products as everyone else (as opposed to a site like Etsy selling unique products), make sure you have a sustainable competitive advantage. Data science is necessary for success in e-commerce, but it may not be sufficient.

4) Digital content.

Whether its books, music, video, or apps, the long-prophesied digital convergence has arrived: almost every newly created piece of digital content is now distributed in electronic form. Here the biggest players are Amazon, Apple, and Google (particularly its YouTube subsidiary), but there is still a lot of flux as new hardware, software, and business models compete for dominance. Digital content poses two daunting challenges: the volume of published content far exceeds people’s available attention, and digital media products are experience goods than people can only evaluate after consuming them. For both of these reasons, the digital content industry depends on data scientists to help people find and discover what they like. The catch: from its advent, the digital content industry has struggled with unauthorized distribution (aka piracy), and the results of this struggle will determine which business models are viable.

5) Finance.

Money, money, money. Working in finance has always been a data-intensive business, but advances in technology have only increased the industry’s reliance on data scientists. Algorithmic trading — and high-frequency trading in particular — mean that those who can most effectively and efficiently mine financial data can derive enormous financial benefits. Finance isn’t for everyone — the hours are long, the stress is high, and the compensation is highly variable. That said, the financial upside can be quite compelling, and some even enjoy the lifestyle.

6) Public sector.

Given the libertarian leanings of the software industry, the public sector might not seem like an obvious career choice. But some of the largest repositories of data reside there–from public repositories like census data to highly classified repositories restricted to the TLAs. Better understanding of this data can improve public policy, national security, and much more. Not everyone has the temperament to deal with government bureaucracy, but those who do have the opportunity to turn big data into big public good.

7) LinkedIn.

OK, I’m being self-serving, but after all this is my blog! LinkedIn is widely recognized as being one of the top data science teams on the planet. But LinkedIn has more than just talent — it has what Pete Warden of ReadWriteWeb described in “Secrets of the LinkedIn Data Scientists” as “detailed information on millions of people who are motivated to keep their profiles up-to-date, collect a rich network of connections and have a strong desire from their users for more tools to help them in their professional lives.” Indeed, I don’t know of anyone who has a dataset that competes with the combined quantity, quality, and utility of LinkedIn’s data. Moreover, working as a data scientist at LinkedIn means helping make people more professionally successful by connecting the to opportunities, information, and of course other people. It’s a wonderful way to create value, and it doesn’t hurt to do so in the context of a profitable, rapidly growing company.

And LinkedIn recognizes the extraordinary value of data science. Don’t take my word for it — listen to LinkedIn CEO Jeff Weiner’s interview at the 2010 Web 2.0 Summit:

To wrap up, data science is more than just an opportunity to have fun and make the world a better place — it might even be how you make an honest living!

By Daniel Tunkelang

High-Class Consultant.

30 replies on “So You Like Big Data…”

Mendeley, a “ for research” and the world’s largest database for academic literature is always on the hunt down for Data Scientists.

One thing we need to see CS programs start doing more of is making use of real-world sized data problems requiring tech such as Hadoop. Too often I see incoming CVs from fresh graduates looking for a junior position who have never handled terabytes of noisy data. Data mining on a gig or 10GB of clean structured data just isn’t reality and doesn’t help the student prepare for work at places such as bitly, foursquare, twitter, LinkedIn, etc.

Jason Hoyt
Chief Scientist – Mendeley


Just when you think information overload as a subject is done it pops up again – see
and Last year, Clay Shirky said that information overload has been around since the day there were more books available than a person had time to read them all and each time new information filters are invented so that people can find what they need.

Lately, I’ve been thinking about two things: i) what happens when billions of feature phone users transition to mobile internet devices (smartphones) and those devices become their main access to the real-time web generating trillions of data items? and ii) are orgs going to become dependent on large Hadoop-style clusters continuously chugging away on real-time data streams before they start making sense of the data?

The problem with information overload today is the exponential nature of it.


Re: Digital content

“For both of these reasons, the digital content industry depends on data scientists to help people find and discover what they like.”

There is an unspoken assumption in the whole “digital content = another form of big data” that you’re not really looking at the digital content itself, but at patterns of activity around that digital content. Maybe those patterns of activity are big data, but the digital content itself is not.

Why do I say that? Take music for example. How many songs are there, total, in existence? I Binged it, and got a page with rough estimates between 2 million and 97 million:

For the sake of argument, let’s say that it’s the larger number: 97 million. If we’re doing content-based music recommendation, that means we’ve done some sort of music analysis on each piece. It could be automated analysis, like Echonest. Or it could be manual, human analysis, like Pandora.

But the results of this analysis leave us with a couple hundred features per song. How much space does it take to represent these features? Given that the features are usually short strings, such as genre, tempo, artist, year, key signature, time signature, harmonic profile, etc. I’ll bet it doesn’t take more than 10k of data per song.

That comes out to what.. about a single gigabyte of data (somebody check my math)? That’s small data, not big data.

And it’s not like the number of songs digitally recorded per year is increasing exponentially. It might be going up in recent years with the advent of cheap home computer digital studios. But it’s not exponential.

So content-based digital music recommendation is actually a small-data problem, not a big data problem. It only becomes big data if you assume that worldwide aggregated usage statistics are a necessary component of the recommendation algorithm.


Folks thanks for the comments. I apologize for not being comprehensive, but it’s a great excuse for anyone who is hiring data scientists to chime in. 🙂

Jason: given how new Hadoop is, I think that CS programs are actually reacting pretty quickly to include it. And yes, the experience with terabytes of data is obviously helpful. But students can get that through internships, so their academic curricula shouldn’t be the bottleneck.

Dinesh: I don’t think that working with data will ever be reduced to mindless analysis. It’s easy to generate more data simply by looking at the world in finer granularity. That’s why intuition will always be crucial to guide what data is gathered and analyzed.

Jeremy: you’re right that purely content-based recommendation does not involve massive amounts of data, at least for music. But I do think there’s a lot of value in analyzing patterns of activity around digital content — especially for people whose interest in the space is commercial.


But I do think there’s a lot of value in analyzing patterns of activity around digital content — especially for people whose interest in the space is commercial.

So does the data scientist necessarily include the content-based analysis part of digital music, in order to be a digital music data scientist? Or can/should/does the scientist stick strictly to the patterns of behavior?


Content certainly matters, whether its music or people! I was simply agreeing that the data scale is less daunting if you only consider content and ignore behavior. I’d also say that the distinction between the two isn’t always clear.


I don’t think I was very clear. Let me express it a different way:


This is an example of one of the most successful music recommendation services out there. And as far as I understand it, Pandora is _not_ big data driven.

Instead, Pandora pays a small team of “expert” musicians to sit in a room and manually listen to and annotate every song with a vector of 400 features.. everything from time and key signature, to the amount of vibrato in the singer’s voice. For recommendation, Pandora then just does a diversity-sensitive knn walk in feature space.

The number of features is small (400). The number of songs is (when compared to the other terabyte datasets out there) also small.

So what makes Pandora work is not big data, but big expertise. Musicians, who manually listen to each song and have the musicological training to give the correct feature labels.

Granted, in relatively recent history, Pandora has been attempting to add social features to its service, by letting you see what your friends are listening to:

But that is very different from big data analysis for the purposes of recommendation. The analysis is still in “music expert feature space”, i.e. small data.

Now, if that approach works for Pandora, if digital content isn’t big data.. what else might we be able to solve with lesser, but smarter, data as well?

I think that’s what I’m trying to say: Is the way forward really through big data, or is it through smart data?


Pandora certainly benefits by starting by taking an intelligent approach to annotate content. I actually see similarities to LinkedIn here, only that Pandora relies more on experts while LinkedIn relies significantly on users being motivated to create their own profiles of record.

But a key part to Pandora’s success is how it reacts to users’ feedback. I am personally impressed with how it learns from my thumbs-up and thumbs-down clicks. I don’t know the algorithmic details — and in particular whether one user’s feedback can affect another user’s experience. I would think it would — learning across users would seem to offer advantages in dealing with sparsity. In that case, I’d say that Pandora is doing big data analysis.

But I’d love to do more than speculate. Does anyone here know enough about how Pandora works to resolve this debate? I tried Quora to no avail:


You already know I am a big fan of personal, explicit relevance feedback. No gripes from me there. But yes, even if it’s personal-only, that’s still small data, because each learning instance needs only be self-contained to that user only.

And yes, I also agree that if there were cross-user (whole population) feedback, that would be big data. Though probably still on the smaller side of big data.. because people probably give fewer thumbs up/down per day than they do search+clicks. Not to mention that there are fewer people using the service, overall. But yes, I would concede that would be big data, if it worked that way.

Still I have to wonder: (1) Does it work that way (also your question), and (2) Even if it did, do you really think that would be better?

Music is so intensely personal. Often if I like a song, the reason I like it might not transfer, socially. I’ll admit, there is a Britney Spears song or two that I don’t completely gag on, but mainly because they’re great to teach dance to. If I started getting recommendations based on my own thumbs up, it would only take into account the musicological properties of the song, and then find me other songs with similar rhythms and chord progressions.. songs that were great to teach west coast swing or cha cha to. But if I started getting socially-based recommendations, what all the other Britney fans thumbs-upsed, it might really throw me off.

Music, I think, isn’t like web search in that way. It’s more deeply, intensely, personal.

At least that’s my own personal feeling.

I wonder if there are papers out there. I’ve been too busy to read the past year or two of ISMIR conference proceedings, to see if anyone has published on this.


Hmm.. that Quora link you sent.. the first guy says that they have about 4000 features per song. I need to correct that. I personally attended a small gathering a few years ago where their CTO said it was around 400. And while it has been a few years, I highly doubt that they paid musicians to go back and listen to all those songs again, and create 3600 new annotations for every song.


I do think that, as personal as musical taste is, we can benefit from analysis of each others’ tastes. I’ve personally found that, when I share one niche taste with someone, I’m likely to share more. Perhaps content-based recommendation captures some of those additional elements of my taste, but I sometimes find that recommendations from people with overlapping tastes capture elements that don’t necessarily show up in content-based methods. Arguably that’s just means we need more features, but social mechanisms may be a more practical way to obtain these signals.

Anyway, it’s clear that both content-based and user-based methods can contribute signal, regardless of how exactly Pandora works. But it would be nice to know. And of course there are other music recommendation engines out there that take “big data” approaches — in particular, seems to use collaborative filtering.


Uhmm! I know a little about music recommenders using data. Using “Chopin” as the example, of whom I know very little but a while back I read that he was popular with people who were not that much into classical music.

This is’s recommendation for similar artists to Chopin:

The weird learning experience was that CF and by extension’s recommendations are pre-calculated:

The Xyggy Music Recommender demo uses playcount data for 360K artists from ~400K listeners. These are Xyggy’s recommendations for Chopin:,&chopin which show a surprising element of variety based on real listener patterns – a mixture of classical and non-classical artists.

Here are the Xyggy recommendations for Chopin & PJ Harvey:,215340,&chopin. Yes, this must appear weird. I mean who wants to listen to music that is similar to Chopin and PJ Harvey? Well, maybe someone out there does and it talks to Jeremy’s point that music is a very personal thing. Where is the “personal” in Pandora or


And of course there are other music recommendation engines out there that take “big data” approaches — in particular, seems to use collaborative filtering.

And with no offense to… the recommendations there seem very.. safe. Unsurprising. Fat head, no long tail. Like the B-52s? Then you also might like Devo, The Go-Gos, The Cars, etc.

I can’t imagine that you can’t do as good as that, or better, with small-data-ed Pandora. Heck, just a single feature (creation date, i.e. 80s music) gets you 90% of the way there.

In fact, I just went to Pandora, and created a B-52s station. The first song it played was by the B-52s. The next three songs were from Talking Heads, The Cars, and The Cure.

Hmm.. not that these recommendations are any less safe. But they’re just as good.. if not better. (At least in my 80s social circle, you were much more likely to find B-52s and The Cure together than you would find B-52s and The Go-Gos..)

And if the Pandora recommendations were created by content feature-vectors only.. well.. why waste your hardware and programming effort on big data?


In our experiments with data, combining user generated data (tags and comments) with content data (listener playcounts only) does improve the recommendations but I’d say by not much. The demo today uses listener playcounts only. It would require extensive user testing to determine the value of the additional cost of processing user generated data. This goes to what I asked previously about when do you reach a point of diminishing returns when dealing with user generated data.

There is definitely big data (eg. FB updates, Twitter, scientific data) but I’m wondering if in reality more pools of data are available than before but which are not big of themselves.


@Dinesh: listener playcounts is not “content data”. It may be content metadata, or big data “around” content. But what I mean by content data is the actual signals in the music itself.. notes, rhythms, etc.


While writing the note I said to myself, “Jeremy’s not going to like this”!

Completely agree but listener playcounts is not user generated data in the conventional sense either.

Please don’t firehose me on this as we are both agreed, honestly!


Dinesh — Does “firehose” mean write a long response? No worries; won’t do it.

Instead, let me call out my own contradictory remarks. I said, “But what I mean by content data is the actual signals in the music itself.. notes, rhythms, etc.”

In actuality, it also includes the creation date. My bad.

Where I was trying to draw a distinction was between aspects of data that change vs. don’t change. Creation date, chord structure, rhythm, etc. doesn’t change. It’s a small feature vector, small data.

But listener playcounts are more like user-generated tags. They change and grow, with more appearing the more people listen to a song. They’re the type of metadata that leads to “big” data.

So that’s all I’m asking.. do you really need “big” data, when it appears (from Pandora) that you can do just as good (if not better) using “small” data alone.


Sure, listener playcounts change with time whereas content doesn’t. But, it is the only data we use and it is pretty small-data for a music recommendation system. It supports your point that big-data is not needed to deliver quality services.

I haven’t got my thoughts together around all the issues but I’m wondering what the real fuss is about big-data after subtracting all the hype.


Dinesh: I think some of the hype is that there are certain domains in which they don’t have the possibility of any kind of static content analysis, like one has in music. All they have is the big data, social actions, million sensor signals.

So I think people are excited about trying to extract meaning from that.


Very interesting, Dinesh. Thank you for pointing that out.

Two tidbits from that conversation. One person says:

“As a “data person”, I’m most “turned on” by how I can use the past to impact the future”

..and another says:

“I feel it’s important to point out a point Norvig makes in his talk (warning: gross generalization caveat): with enough data, you can discover patterns and facts using simple counting that you can’t discover in small data using sophisticated statistical and ML approaches.”

I understand that this is why people are excited.. that you can find patterns in large data that you can’t find in small data. But because the patterns that you are able to find have less to do with sophisticated analysis than with counting, I have a hard time seeing how those discovered patterns are in any way “surprising”, i.e. how you are discovering anything that you really didn’t know before.

If it takes so much big data for the signal to rise above the noise, it seems to me that you’ll also have signal rising above the signal. I.e. that signal which does rise above the noise is going to be the more generic, head, top, common type of signal. The “Coldplay is popular” types of signal. The fat belly of the curve, and the long tail of the curve, are still going to be hidden or buried along with the noise if all you’re doing is counting.

So I’m still scratching my head in wonder.. why so much excitement over discovering that people like the Beatles? I see that people want to use the past to predict the future, but waiting for the big head of the most frequent counts to emerge is about the safest, most unsurprising future prediction out there, so unsurprising that I’m hard pressed to know exactly why you need big data to predict it.


[…] LinkedIn’s employees and investors — this is a great day for Silicon Valley, for the data scientists who are building its most valuable companies, and for the users who are benefiting from it all. I […]


Comments are closed.