Categories
General

CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The second speaker in the program was Metaweb co-founder John Giannadrea. Google acquired Metaweb last year and has kept its promise to to maintain Freebase as a free and open database for the world (including for rival search engine Bing — though I’m not sure if Bing is still using Freebase). John’s talk was entitled “Freebase – A Rosetta Stone for Entities“. I am thankful to Jeff Dalton for live-blogging a summary.

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNetLibrary of Congress Authorities,  and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

Freebase does have its challenges. The requirement to keep out duplicates is an onerous one, as they discovered when importing a portion of the Open Library catalog. Maintaining quality calls for significant manual curation, and quality varies across the knowledge base. John asserted that Freebase provides 99% accuracy at the 95th percentile, though it’s not clear to me what that means (update: see Bill’s comment below).

While I still have concerns about Freebase’s robustness as a structured knowledge base (see my post on “In Search Of Structure“), I’m excited to see Google investing in structured representations of knowledge. To hear more about Google’s efforts in this space, check out the Strata New York panel I moderated on Entities, Relationships, and Semantics — the panelists included Andrew Hogue, who leads Google’s structured data and information extraction group and managed me during my year at Google New York.

By Daniel Tunkelang

High-Class Consultant.

13 replies on “CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities”

Hey Daniel,

This blog post is an interesting continuation of our discussion about the process we (Bee) currently call Blosm. It is indeed very helpful, thanks. Freebase’s “Reconciling” is an important piece of technology. We call that process “suturing” within Blosm. If you don’t mind, I’d like to use Freebase as another data point to compare and contrast with our process. The main difference stems from us building special-purpose knowledge-bases instead of one central DB.

We build a dedicated FreeBase-like system (“knowledge-base”) tailored for the client’s topics, be-they products for an e-commerce site, or companies for a hedge fund. Features like Google Refine’s “reconcile to FreeBase”, puts FreeBase at the center of the universe, and thus limits the utility to only entities that are already in FreeBase. Instead we offer “reconcile to everything on the web (and usually lots of data feeds)”.

This flips the model, putting the client/seed data at the center. The technology necessary to quickly/affordably build a knowledge-base, using many more sources that are each much less manicured, is quite different. The resulting data is also very different, being much more special purpose in both content and structure.

Thanks,

-Carl

Like

Daniel – Thanks for the summary. Your reference to “an open library catalog” should actually be “(a portion of) the Open Library catalog” (ie Internet Archive’s openlibrary.org).

Carl – Google Refine will reconcile against anything you want including the OpenCorporates database, any SPARQL endpoint, etc. (yes, I know your point was really just to pitch your product, but didn’t want the wrong impression left about Refine)

Like

Hi Tom,

I’d love to see how to get Google Refine to reconcile to arbitrary data sources. I apologize if I misrepresented that functionality. I use Google Refine myself, and like it, but I don’t see how to get it to join to data sources that were not specifically designed to do so (ie support SPARQL or “standard source” interface).

I do think being able to do so is something important and distinct. I look forward to finding out how to do so. I’ll dig around some more.

Thanks,

-Carl

Like

Here are some links to reconciliation services which work with Google Refine:

OpenCorporates – 31 million corporate entities http://opencorporates.com

DERI Galway RDF extension – any SPARQL endpoint or RDF dump http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/

VIVO national scientific collaboration platform – http://sourceforge.net/apps/mediawiki/vivo/index.php?title=Extending_Google_Refine_for_VIVO

Talis Kasabi – any database published on the Kasabi platform http://kasabi.com/doc/api/reconciliation

To write a reconciliation service of your own, start here http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi

Like

Daniel, when the guys at Metaweb explained the “99% accuracy at the 95th percentile” concept to me (before they were bought by Google), they said it meant that for 95 of every 100 reconciliations between entities, they are 99% accurate (only 1% false positives or false negatives).

Like

Daniel – Freebase’s coverage varies greatly from one domain to the next, but I’m curious what population you’d use as your benchmark(s) to measure recall against. Obviously there’s no corpus of all human knowledge or even all films,books, or people.

Subjectively coverage is great for small well-known domains like countries, U.S. politicians then drops to very good for films (probably even stronger than IMDB, particularly for non-English or older films) and continues on down to passable for books (much better than Wikipedia, but not as good as WorldCat), and so on. Sprinkled throughout are high value nuggets where someone has taken in interest in curation, but the real power is the breadth and connectivity.

I’d explain the accuracy measure a little differently than Bill did. My understanding is that the 95% is the confidence interval used to select the sample size. Given the total population of reconciled items, a large enough random sample is selected to give you a 95% confidence in your testing. That sample is then 100% human verified with multiple verifiers voting per item. If more than 1% of the sampled items fail human verification, the entire batch is rejected and sent back for more work.

Like

Tom, I understand that there’s no perfect way to measure recall. What I’d hope is, that for some of the sets identified by Freebase (e.g., http://www.freebase.com/view/base/dubai/views/companies_with_headquarters_in_dubai), there is a way to generate a set with very high recall (as close to 100% as possible) and low but not infinitesimal precision (e.g,, at least 1%). Ghen we could pick a subset from that set, filter it manually, and see how much of the filtered subset is covered by Freebase That would give us a reasonable sense of recall.

Like

Comments are closed.