A couple of weeks ago, I participated in a summit that Greylock Partners organized for its portfolio companies at LinkedIn to discuss the power of data. Invited participants represented some of the most interesting “big data” companies in Silicon Valley, including Google, Facebook, Pandora, Cloudera, and Zynga. Discussion took place under the Chatham House Rule, so I’m not at liberty to share much detail. But I can say that there were energetic conversations about metrics, tools, and (of course) hiring.
Fusion Tables allow the general public to upload, visualize, and share structured data. They are particularly useful for journalists who want to distill compelling stories from data — indeed, The Guardian‘s Simon Rogers has used Fusion Tables to visualize and interpret everything from nuclear power plant accidents to Wikileaks.
After his presentation, I asked Alon for his thoughts on why haven’t we seen an encyclopedic structured data repository comparable in scope, scale to Wikipedia? Alon offered that structured data is brittle — its value tends depend more on context than the unstructured content that populates Wikipedia. I agree in part — for example, consider this map of Brooklyn bus stops that were slated for elimination last summer. Such data is useful in a narrow context, but hardly encyclopedic.
But what about Freebase and DBpedia? Freebase is an open repository of structured data associated with about 20 million topics. DBpedia describes itself as “a community effort to extract structured information from Wikipedia and to make this information available on the Web.” While these tools have seen some use by developers (especially in the semantic web community), they have not achieved mainstream adoption. Perhaps data marketplaces like Factual and Infochimps will be successful as for-profite businesses, but the question remains why we don’t have a Wikipedia-scale success story for public structured data.
Wikipedia represents a collective attempt to achieve precision at the level of individual entries. Contributor / editors correct mistakes and argue over the details of content and tone. But coverage is a much lower priority. When in doubt, the Wikipedia collective assumes that information is not notable enough to justify inclusion. Thus Wikipedia errs on the side of precision rather than recall when it comes to meeting the information needs of its users.
This arrangement works well for a typical web user who seeks out information by using Google web search as an interface to discover Wikipedia articles. But structured data is about sets, not just individuals. It does me no good to see aggregate statistics about a set of entities if the set is erratically populated (e.g., Wikipedia’s list of companies established in 1999 or Freebase’s list of those founded after 2000).
In the June 2009 SIGIR Forum, University of Melbourne researchers Justin Zobel, Alistair Moffat, and Laurence Park argued “against recall“, concluding that they could find “no justification for implicit or explicit use of recall as a measure of search satisfaction.” I posted a rebuttal entitled “In Defense of Recall“, arguing that recall is much more useful as a measure for set retrieval than for ranked retrieval. Revisiting this argument two year later, I can see that it holds even more strongly if we are interested in structured data where we want to reason about aggregate properties of sets.
Back when we both worked at Endeca, my colleague Rob Gonzalez described structured data repositories to be as a public good that no one is ever willing to pay for. I’m an optimist by nature, but in this case I fear he has a point. It takes a lot of work to build something useful, and no one seems to have addressed the challenge of incenting people to contribute this work for either economic or altruistic motives.
Or perhaps we’ll just have to wait for the holy grail of information extraction algorithms to structure the world’s information for us? Ironically, that’s not even included on Wikipedia’s list of AI-complete problems.