The Noisy Channel

 

In Search Of Structure

May 15th, 2011 · 13 Comments · General

A couple of weeks ago, I participated in a summit that Greylock Partners organized for its portfolio companies at LinkedIn to discuss the power of data. Invited participants represented some of the most interesting “big data” companies in Silicon Valley, including Google, Facebook, Pandora, Cloudera, and Zynga. Discussion took place under the Chatham House Rule, so I’m not at liberty to share much detail. But I can say that there were energetic conversations about metrics, tools, and (of course) hiring.

One of the participants was Google researcher Alon Halevy, who generously shared his presentation on Fusion Tables with me with permission to re-share it here.

Fusion Tables allow the general public to upload, visualize, and share structured data. They are particularly useful for journalists who want to distill compelling stories from data — indeed, The Guardian‘s Simon Rogers has used Fusion Tables to visualize and interpret everything from nuclear power plant accidents to Wikileaks.

After his presentation, I asked Alon for his thoughts on why haven’t we seen an encyclopedic structured data repository comparable in scope, scale to Wikipedia? Alon offered that structured data is brittle — its value tends depend more on context than the unstructured content that populates Wikipedia. I agree in part — for example, consider this map of Brooklyn bus stops that were slated for elimination last summer. Such data is useful in a narrow context, but hardly encyclopedic.

But what about Freebase and DBpedia? Freebase is an open repository of structured data associated with about 20 million topics. DBpedia describes itself as “a community effort to extract structured information from Wikipedia and to make this information available on the Web.” While these tools have seen some use by developers (especially in the semantic web community), they have not achieved mainstream adoption. Perhaps data marketplaces like Factual and Infochimps will be successful as for-profite businesses, but the question remains why we don’t have a Wikipedia-scale success story for public structured data.

I think the problem is easiest to frame in information retrieval terms. Wikipedia is all about precision, but not so much about recall. Let me elaborate.

Wikipedia represents a collective attempt to achieve precision at the level of individual entries. Contributor / editors correct mistakes and argue over the details of content and tone. But coverage is a much lower priority. When in doubt, the Wikipedia collective assumes that information is not notable enough to justify inclusion. Thus Wikipedia errs on the side of precision rather than recall when it comes to meeting the information needs of its users.

This arrangement works well for a typical web user who seeks out information by using Google web search as an interface to discover Wikipedia articles. But structured data is about sets, not just individuals. It does me no good to see aggregate statistics about a set of entities if the set is erratically populated (e.g., Wikipedia’s list of companies established in 1999 or Freebase’s list of those founded after 2000).

In the June 2009 SIGIR Forum, University of Melbourne researchers Justin Zobel, Alistair Moffat, and Laurence Park argued “against recall“, concluding that they could find “no justification for implicit or explicit use of recall as a measure of search satisfaction.” I posted a rebuttal entitled “In Defense of Recall“, arguing that recall is much more useful as a measure for set retrieval than for ranked retrieval. Revisiting this argument two year later, I can see that it holds even more strongly if we are interested in structured data where we want to reason about aggregate properties of sets.

Back when we both worked at Endeca, my colleague Rob Gonzalez described structured data repositories to be as a public good that no one is ever willing to pay for. I’m an optimist by nature, but in this case I fear he has a point. It takes a lot of work to build something useful, and no one seems to have addressed the challenge of incenting people to contribute this work for either economic or altruistic motives.

Or perhaps we’ll just have to wait for the holy grail of information extraction algorithms to structure the world’s information for us? Ironically, that’s not even included on Wikipedia’s list of AI-complete problems.

13 responses so far ↓

  • 1 Terry Jones // May 15, 2011 at 1:02 pm

    Hi Daniel! You forgot Fluidinfo :-)

    There are many similarities between wikipedia and Fluidinfo, like having an object for everything and allowing anyone (any app) to add any data to any objects.

    3 differences you need if you want to build a wikipedia for data are 1) A permissions system (that prevents apps from overwriting each other’s data), 2) Typed data, 3) a query language. If you added those to Wikipedia, you’d essentially have Fluidinfo.

    Terry

  • 2 Daniel Tunkelang // May 15, 2011 at 1:08 pm

    Terry, no slight intended! And I like Fluidinfo’s model. But that only reinforces my belief that it will take more than a better architecture to create a Wikipedia for structured data that achieves comparable scale, scope, and adoption.

    Ian Soboroff asks on Twitter whether it’s because Wikipedia articles is fun, but typing in structured data is a data entry job. Perhaps. But uploading data sets could be fun too. Is it just a matter of creating a better, funner interface?

  • 3 Dinesh Vadhia // May 15, 2011 at 1:55 pm

    Noticed that Halevy’s presentation includes a table of the Presidents of the US and so does Wikipedia at http://en.wikipedia.org/wiki/Presidents_of_the_United_States – a mixture of unstructured and structured data (semi-structured data anyone?).

    Rather than a separate Wikipedia for unstructured data maybe what is needed is a method to include and/or link to structured data.

  • 4 John McCubbin // May 16, 2011 at 3:06 am

    Perhaps the US Government IT spending Dashboard http://it.usaspending.gov/ is an example how this might develop. Public accountability could be a key driver for a Wikipedia for “Big Data”.

  • 5 Daniel Tunkelang // May 16, 2011 at 7:55 am

    Dinesh: I agree — the ideal here is to support semistructured data, not just structured or unstructured. Indeed, I’d be thrilled if Wikipedia supported explicit structure — it’s wacky that everyone is working to reverse engineer this structure from infoboxes, NLP, etc.

    John: I admire your optimism. But the recent defunding of Data.gov makes me less optimistic about the US government driving its own data transparency, let alone setting an example for everyone else.

  • 6 jeremy // May 16, 2011 at 11:49 am

    In the June 2009 SIGIR Forum, University of Melbourne researchers Justin Zobel, Alistair Moffat, and Laurence Park argued “against recall“, concluding that they could find “no justification for implicit or explicit use of recall as a measure of search satisfaction.” I posted a rebuttal entitled “In Defense of Recall“, arguing that recall is much more useful as a measure for set retrieval than for ranked retrieval. Revisiting this argument two year later, I can see that it holds even more strongly if we are interested in structured data where we want to reason about aggregate properties of sets.

    There is yet another area in which recall-oriented search is important: Music playlisting.

    The IR system’s goal is to search an entire collection of music and put together a sub-collection of songs that play well together. Perhaps aggregated around a particular seed song.

    Precision isn’t important. Recall is important. The properties of the set as a whole is important. The fluidity and cohesion, and overall interestingness of the set is important.

    See: http://musicmachinery.com/2011/05/14/how-good-is-googles-instant-mix/

    Now, I would still argue with you that the ranking is important, that you still want to treat a music playlist like an ordered set, rather than like an unordered set. It really matters to the overall flow and experience which songs follow which others. But at the end of the day, your information need is only satisfied once you’ve heard five dozen songs (recall), not one song (precision).

  • 7 Dinesh Vadhia // May 16, 2011 at 2:05 pm

    @daniel
    Good to see your continued focus on the importance of ‘recall’ (eg. this and previous posts, the HCIR Challenge). The dominance of web search which is precision-rich, recall-poor has relegated the value of recall to … well, the bottom!

  • 8 Terry Jones // May 17, 2011 at 5:51 am

    Hi Daniel – no offence taken! :-)

    I like your question and I’ve spent time thinking about it too. But things are too busy for me to be able to type up some thoughts. Right I need an army of clones….

    Hope to see you around sometime soon. Going to Gluecon?

  • 9 Daniel Tunkelang // May 17, 2011 at 8:05 am

    Jeremy: Agreed on all points. For me the even more interesting aspect of playlisting is getting the interaction right. Even though my interaction with Pandora is minimal (thumbs up, thumbs down, skip — with the occasional adding of an artist as a seed), it has become my single way of listening to and discovering music.

    Dinesh: Thanks!

    Terry: Keep me posted on the cloning project. Unfortunately can’t send myself or a clone to Gluecon. Have fun there next week!

  • 10 MarkH // May 18, 2011 at 6:23 am

    Enforcing a structure on any data store narrows the variety of content that can be represented in it, limiting it to a particular domain and also making it difficult to adapt the structure.
    Unstructured info places no such limitations and invites varied contributions from unskilled users with little prior organisation. These feel like good reasons for the success of the relatively unstructured Wikipedia vs anything else structured attempting the same breadth of subject matter.
    Successful structured repositories built from public contributions (e.g. delicious, music recommendations) seem to share these characteristics:
    * Single problem domain with stable data structure
    * Low-friction data capture (e.g. click to “like”)
    * A pay-back for user contributions (free bookmarking/recommendations)

  • 11 Jeff D // May 18, 2011 at 7:44 pm

    Outside of government circles and ecommerce, there is an issue of lack of incentive to publish your structured data.

    Structured data is mystifying to end users and it takes a lot of work to build compelling applications around it.

    I have yet to see a good system for letting a user create a structured data, say from an interesting webpage.

    The semantic web community faces this problem, but there has been some progress in the Linked Data community.

    http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets

    One of the most exciting things going on is the Facebook Open Graph API. If only there were an open alternative…

    I think the greatest potential for greater adoption and publication are search engines using microformat/rdfa data, like Google RecipeView. It gets companies to put the data out there.

    Much of the really valuable structured data, people tend to keep to themselves. There’s a bunch at LinkedIn!

  • 12 Daniel Tunkelang // May 18, 2011 at 8:43 pm

    I’m not sure there is any more incentive to publish unstructured data on Wikipedia than structured data. But I agree that it poses a higher barrier to users to create it, even with tools like Fusion Tables.

    And yes, there are a bunch of folks with valuable private repositories of structured data. But encyclopedias were once private too — Wikipedia is a very recent phenomenon. So I at least hold out hope that we’ll see more public repositories of structured data in the future.

  • 13 CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities // Nov 15, 2011 at 12:27 am

    […] concerns about Freebase’s robustness as a structured knowledge base (see my post on “In Search Of Structure“), I’m excited to see Google investing in structured representations of knowledge. To […]

Clicky Web Analytics