The Unreasonable Effectiveness of Data

Over the past week, there’s been lots of commentary about “The Unreasonable Effectiveness of Data“, an article by Googlers Alon Halevy, Peter Norvig, and Fernando Pereira in the most recent issue of IEEE Intelligent Systems.

Here are a few posts that have been appearing in my RSS reader:

I’m intrigued by the amount of attention this paper has attracted–especially the vitriol in this Stefano’s post:

What upset me about that paper is not how they say “oh sure, structure is great, but look overhere: there is a goldmine in all the sand” (which is something I fully resonate with) but they phrased it as a fight, deterministic vs. statistical, trying to convince people that adding structure it not the way to go, it’s basically a global waste of research resources.

And yet, without the <a> tag (that is: machine-readable imposed structure), they wouldn’t be where they are, not they would be able to speak from such a tall soapbox.

I’m actually sympathetic to the view that it’s usually better to have more data than heavier theoretical machinery. But I’ve seen this view taken to an extreme so absurd as to be worthy of an April Fool’s joke–in Chris Anderson’s Wired article about “The End of Theory“. Moreover, that same article quotes Peter Norvig as saying that “All models are wrong, and increasingly you can succeed without them.” Note: Peter Norvig explains here that he was misquoted.

So perhaps Stefano is right to react so harshly.

By Daniel Tunkelang

High-Class Consultant.

27 replies on “The Unreasonable Effectiveness of Data”

I had a similar gut reaction against the paper. I think I finally cracked when they said that, in order to remove an object from a picture, you need millions of examples of pictures taken from that (or similar) locations. That data-only approach might very well work.. but only for a few very choice locations around the globe.. i.e. in front of the Eiffel tower, Big Ben, the pyramids, etc. But Zipf’s law tells us that for every location with millions of available photos, there will be millions of locations with only a couple of photos.

And so that means that the data-only approach will succeed in, say, 5 of 5,000,000 cases.. or 0.0001% of the time. That’s hardly confidence-inspiring.


To Daniel: thanks for the nice summary and for your analysis. Look at to see that I, like you, disagree with the Chris Anderson article and the quote attributed to me.

To Stefano: I’m not sure why you think we are against structured data. In fact, we talked quite a bit about extracting information from structured tables. Our point was that if there are hundreds of millions of people generating data of some kind (like html tables), it is probably a better strategy to use that data, even if it is noisy, rather than try to put together a coalition of a few researchers who will generate perfect non-noisy data.

To jeremy: you should read the Hays and Efros paper. It is a very good paper and it is not at all what you describe. Yes, they use millions of photos in total. But to remove, say, a car from a street scene, you don’t need millions of photos of that car, or that street. In fact, you don’t need any photos of that car or street — the system generalizes over other street scenes.


Peter, thank you for the clarification. I’m sorry I didn’t fact-check the quote myself. It’s a reminder that my blogging does not necessarily make me a journalist. But at least it’s easy to revise, and I’ve updated the post to point to your explanation of how you were misquoted.

In any case, I’m flattered that you took the time to swing by! I have to admit that my first reaction to seeing a comment attributed to Google’s Director of Research on April 1 was to prepare for an April Fool’s joke! But I don’t think you wrote this one on Google Autopilot.



Thanks for the feedback. I’ll read that paper in closer detail.

I think I get what you’re saying. To relate the idea back to Google’s spelling corrector, you would say that you don’t need millions of examples of every single word in the dictionary, or every single proper name not in the dictionary. All you need are the n-th order Markov chain probabilities of characters across the vocabulary as a whole, and you can use those probabilities to essentially “fill in” what the most likely correct spelling should have been.

I’m all for that kind of thing, and think it’s great. I’ve done similar work, myself, on “term context modeling” using maxent models. I just question the efficacy of relying on it wholly for most problems. Even for most web-scale problems. Take music recommendation for example, something that Google Music China now needs. There you have web-scale users and millions of songs. But if you rely solely n-th order joint or conditional probabilities to recommend music, you’ll quickly be recommending nothing but the hits, and the long tail (which in sum contains at least as many, if not more, songs than the head) will be underrepresented, if not completely ignored. So you need to develop non-large-data-dependent algorithms in order to do proper recommendations on large, but fragmented, tail-distributed music.

And so something about all this still doesn’t sit quite right with me. I can’t shake the “rich get richer” feeling, which to me is the antithesis of the types of systems that we should be building to organize the world’s information and serve all the world’s people. Maybe that’s my misunderstanding, though. Do you have any “fact-checking” treatises that dispel that myth?


@Peter: For example, see this blogpost by a colleague of mine:

In particular, scroll down to the “Help, I’m stuck in the head” section, which refers to work done by Oscar Celma.

I’m interested in long-tail information, the stuff that you can’t ever seem to find, no matter how many different ways you try and type your query.. because all of the searching and recommendation mechanics are pushing your back into the short head.

How do non-parametric, big-data trained models help us avoid this problem?


s/”pushing your back”/”pushing you back”/g

Ah, if only blogposts were like sed. 😉


Not sure if Peter will come back, but a lot of web search does seem to be about pushing you back into the short head. To take a non-Google example, look at “Random Walks on the Click Graph” by Microsoft researchers Nick Craswell and Martin Szummer. When I saw Craswell present this work, I came away with the impression that it was pushing long-tail queries back into the head in order to improve retrieval performance.


Yeah, doesn’t look like he’s returning. I think I’m going to write a blogpost on this question. Will let you know when it’s up.


On the “rich get richer”: a wide variety of phenomena follow either a Zipf (1/n) or Pareto (n^k) distribution. Pareto did the original wok to explain (or at least describe) why the rich had most of the money. My interpretation of the distribution of money, web page popularity, or song popularity, is that it is inevitable that most of the resources go to a few. Look, the reason that most of the energy in a gas or Bose-Einstein condensate resides in a small percentage of the particles is not because of an evil repressive oligarchy keeping the masses down; it is just statistics that there has to be some winners. So we should expect the same from web pages or songs. So the question comes down to: given that there will inevitably be this concentration in the head, do search engines and colaborative filtering services help you find things in the tail? i think they do, but you have to ask the right question. If you ask a very general question, you’ll get a very popular answer. If you ask a rare or specific question, you’ll get a result that is a weighted combination of the nearest neighbors, weighted by popularity. So the more data you have, the more oddball nearest neighbors you’ll have, but you have to put the query into a part of the space that is not already populated with very popular entries. For example: recently I watched the Bill Murray movie “Life Aquatic…” and liked the Brazilian acoustic guitar covers of David Bowie songs. iTunes quickly directed me to Seu Jorge, and from there to neighboring similar artists that I enjoyed. In this case the collaborative filtering recommendations were excellent, but I had to help by giving the system a clue as to where to start. If I just start at the iTunes home page, I get Flo Rida, Black Eyed Peas, Miley Cyrus and Christian Gospel albums for $7.99. Mostly I’m not interested in that, although I was, through Seu Jorge, led to a great song by Black Eyed Peas and Sergio Mendes, who I’m old enough to remember from “Brazil 66”, but I had no idea that he was still alive and recording, let alone relevant today. So I think search engines and collaborative filtering is doing just what it should. No force is going to turn a Pareto distribution into a uniform distribution. But the tools we have let you navigate the tail, as long as you aren’t asking queries that are better answered by the head.


@Peter: Thanks for the response 🙂

Yes, I absolutely agree with you.. Pareto and Zipf are the way they are because that’s the way the world works, not because of any evil repressive oligarchy. And I think you’re spot on when you say “given that there will inevitably be this concentration in the head, do search engines and colaborative filtering services help you find things in the tail?” That is indeed the question that we need to ask.

But in answer to this question, to write: “i think they do, but you have to ask the right question.

This is where I start to disagree. This places the burden, the onus, on the user, on being able to ask the right question. How do I ask the right question, if I don’t know the right words to use? What if all I know is the general, broad statement or query formulation? That, imho, is something that the search engine needs to be able to help the user with — helping discover the right vocabulary. Or at least give the user some way of exploring and discovering that vocabulary other than 10 blue-linked URLs. Look at Nick Belkin’s work. He talks a lot about this.

Let me use a more concrete example. It’s a real-world, natural phenomenon that different communities invent some of the same mathematical structures, algorithms, etc. without knowledge of each other. The machine learning community comes up with some idea, and is completely unaware that the same idea already exists in the physics community under a completely different name. Oh, and it also exists in the computer science community. And pure math community. Also with different names in each community. But it’s the same basic idea, every single time.

So if the search engine just helps you follow collaborative linking data into the specific, localized neighborhoods, then you will never be able to find all the information you need from across the different communities. You will never be able to make that jump. It’s well known that cross-discipline citations are rare. That’s why these things have different names to begin with. The data just isn’t there, for you to follow, and it never will be. And that is due to the very same natural processes (not evil oligarcies 🙂 ) that create the Zipfian and Pareto-type distributions in the first place. The tail will always be relatively disconnected to itself.

So my point is that the job of the search engine is to help us discover those connections, to jump out of the data-only pathways. That’s why it’s called search. We can’t just pagerank our way into discovering these connections between ideas in various, non-connected communities. We have to use other techniques that aren’t wholly reliant on global data only. I am not saying that we have to flatten Zipfian distributions into uniformity. But we do our users and ourselves a disservice when we don’t even try, and only provide tools that lock them into local minima.

We have to get better at this. And by “we” needing to get better, that’s really not a thinly veiled reference to Google only — I mean myself as a researcher, too. Though of course I mean Google as well. 😉


[…] Consider that social relationships are sometimes one-way. We often have less time for others than they have for us. This is especially the case for widely trusted experts on particular matters. This is the general asymmetrical social stuff of celebrity, which is surely an archaic notion, inherent in even the simplest of villages. As society scales, moreover, consider the natural—or, potentially, the morally optimal—distribution of those asymmetries of attention. […]


In short, “search” and receiving recommendations not only aren’t exploration, but are often the antithesis of exploration. “Better” often means pushing us into paths of least resistance–and thus least diversity. Perhaps not quite the Google’s = McDonald’s metaphor from my talk, but in the same spirit, no?


In short, “search” and receiving recommendations not only aren’t exploration, but are often the antithesis of exploration.

Let me ask a clarifying question, Daniel: You’re saying that this is where the current state of web recommendation and web search is? You’re being descriptive of the current state of affairs, not descriptive of what search and recommendation are, as general concepts, right?


Absolutely. I embrace exploratory search and social navigation. That’s what I want “search” and “recommendation” to be–user-centered, user-controlled, transparent, guided exploration. And those are in many ways the antithesis of the current state of affairs for search and recommendation.


Jeremy, I think we’re in agreement. The trick is to have a better conversation with the user. If they want the head, a two-word query is sufficient conversation; if they want the tail it will take more. I showed how it can be done if the user takes the initiative; you are right that we, the community, should have better solutions for mixed initiative and for a passive user. You mention some good examples; there are lots of cases of some branch of science re-inventing a solution from another branch. Some have the excuse that there were no search engines way back then, but mostly the problem is matching relationships between objects that are homomorphic but have completely different vocabulary, and that’s something we as a field currently can’t do — but it is a great goal to srive towards.


I.. uh.. err.. mm.. but..

Ok, well, then I think on the main points we do indeed agree. I guess what I am trying to get across, and I don’t know how successfully, is the (observable) feeling that I have that simply adding more words to the query is not enough (does not necessarily) succeed in pushing the user out past the head, past the fat belly, into the long tail.

I hear what you’re saying, with your “Life Aquatic”, Brazilian guitar example. I guess I’m just not convinced that the recommendations that you received through that process really were long tail recommendations. Go back up to the link I posted in comment #5 above, to the section I point out in that link. In that, Oscar Celma finds that of the 10k or so musical artists, ~50% of the recommendations went to the ~100 head, another 50% to the few hundred in the belly, and 0% to the actual long tail. Thousands of artists, most artists, had zero recommendations. I think you might be experiencing a “fat belly” recommendation with those Brazilian acoustic guitar artists. Not true long tail.

I have another example that I’d like to go into a bit more detail about, involving search and Google Maps. But instead of getting too much more into it in the comments section here, I’ll write a longer post about it over the next few days.

Again, the short of it is that I’m still not fully convinced that even adding more terms to a query gets the user to the long tail, because of the way Google algorithms are designed.. well any IR algorithm for that matter, not just Google’s. I will write a longer post about that, though, later this week:


For what it’s worth, I never got anything out of iTunes Genius–I actually disabled it even when I still used iTunes.

But I find Pandora to be quite good at picking out obscure songs for me. And it is the most transparent recommendation engine I’ve found. Right now I’m listening to a track selected because it features folk roots, a subtle use of vocal harmony, mild rhythmic syncopation, extensive vamping, and acoustic sonority. OK, that requires a little knowledge of music to grok, but it works for me.


Well, Pandora works because it *doesn’t* rely on pagerank style popularity recommendations. It relies on the actual characteristics or properties of the song itself, on a song by song basis.

This is part of the whole point that I am after. In pandora’s case, there are no millions of songs that you need in order to do better recommendation. You cannot aggregate pagerank, popularity, n-grams, etc. Each song stands alone. Each song its its own intrinsic value.

So it naturally makes sense that Pandora does a better job than iTunes at finding obscure music. Because it’s not recommending the music to you based on popularity. It’s recommending it based on music.

I wish there were some exploratory/interactive/transparent way of doing the same thing with Google. I wish there were some way of saying, “ok, give me the results to my query, but *don’t* filter them by most popular pagerank results first. Let me sort through or explore them in some other manner.”

I again don’t think that its possible to get to certain results just by adding more keywords to the query, any more than you can use iTunes to get those same obscure Pandora recommendations by adding more artist names or whatever to your iTunes query. Big data-based, popularity based methods just don’t work that way. You have to have some way of being able to turn those methods off, and view the data in a more “neutral” manner.

Am I sorta making sense?


Makes sense to me. But, in fairness to Google, Pandora doesn’t have to deal with adversarial data. Google does, and it’s main defense is to favor more authoritative pages. That bias probably does raise the average quality of results, but at the expense of imposing a regression to the mean.

I think it’s a surmountable problem. Part of the solution could well be that, by de-emphasizing their own ranking in favor of giving more control to users, search engines could diffuse the effectiveness of spamming / bad-faith SEO, and then wouldn’t have to skew their results to combat it.


Great discussion. I’m sorry I discovered it over a year after the fact. (I’m part of the temporal long tail I guess.)

From my understanding, Pandora’s recommendations are similarity-based: “I’m interested in X. Find me things that are similar to X.” The trick then is to get the proper measure of similarity — a problem Pandora appears to have solved for music.

Is there any effort to come up with a similar thing in the general search engine world? E.g., where I could say, “Find me all the web pages that have a high similarity to *this* one.” (You’d probably want to filter out pages that are too similar, given that they’re likely to be copies or mirrors of the original.)


Comments are closed.