The Noisy Channel

 

Free as in Freebase

August 29th, 2009 · 10 Comments · General

It’s been a while since I’ve blogged about Freebase, the semantic web database maintained by Metaweb. But I recently had the chance to meet Freebasers Robert Cook and Jamie Taylor and hear them present to the New York Semantic Web Meetup on “Content, Identifiers and Freebase” (slides embedded above).

It was a fun and informative presentation. Perhaps the most surprising revelation about Freebase was that all of their data fits in RAM on a 32G box (yes, some of you caught me live-tweeting that during the presentation). Their biggest challenge is collecting good data that lends itself to the reconciliation needed to make Freebase useful as a data repository. Despite the lack of a near-term revenue model, the Freebasers are bullish about their approach: strong identifiers, strong semantics, open data. On the last point, almost all of Freebase is available under the  Creative Commons Attribution License (CC-BY)–which, as far as I can tell, make anyone free to develop a mirror of Freebase. Indeed, many people are using this data, including Google and Bing.

You might wonder whether Freebase is a business or a non-profit foundation–and the question did come up. The answer is that Freebase eventually expects to make money by providing services, e.g., helping advertisers. They see their graph store as a competitive advantage–but they freely admit that this advantage will erode over time. Indeed, the surprisingly small size of their graph makes me wonder how much speed and scalability matter, compared to the challenge of data scarcity.

I’d like to see Freebase succeed. I’m particularly a fan of the work David Huynh has done there on interfaces for semantic web browsing. Clearly their investors are true believers–Metaweb has raised a total of $57M in funding. I don’t quite get it, but I’m happy we can all benefit from the results.

10 responses so far ↓

  • 1 Bob Carpenter // Aug 29, 2009 at 7:14 pm

    I just talked to Jamie and Robert recently, and was very impressed that Freebase was addressing the shared semantic problem through integration. Clearly they’ve already built something useful.

    As to profitability or even revenue, I was wondering the same thing about Google during a visit back in the late 1990s — awesome search, crazy company (funhouse office, food better than most restaurants, programmers complaining about not getting massages quick enough), and no revenue stream in sight (at least to me).

  • 2 ram // Aug 29, 2009 at 7:15 pm

    Thanks Daniel for posting blog on Freebase. Hope it gains momentum like Wikipedia and manages to overcome the challenge of collecting the semantically enhanced data sets.

  • 3 Daniel Tunkelang // Aug 29, 2009 at 7:21 pm

    Bob, I think it’s worth noting that, had Google not been, um, inspired by Goto.com‘s revenue model of auctioning sponsored listings, it’s not at all clear that Google would have succeeded. Remember: Google didn’t introduce AdWords until 2000. It’s not at all clear, that predicting Google’s success back in the late 1990s would have been rational. Hindsight is 20/20.

  • 4 Daniel Lemire // Aug 31, 2009 at 9:52 am

    Perhaps the most surprising revelation about Freebase was that all of their data fits in RAM on a 32G box (…). Their biggest challenge is collecting good data

    Just in case one of your reader decides that we no longer need fancy database indexes or good engineers because RAM is so cheap… Know that it will take over 5 seconds to read the content of the memory—sequentially.

    People have been designing RAM-based databases for indexing data such as XML for a while. The performance is not automagically acceptable.

  • 5 Daniel Tunkelang // Aug 31, 2009 at 10:30 am

    I didn’t mean to imply otherwise. But I do think it’s an eye-opener for folks who are used to measuring open web content in petabytes. Trust me, I know from experience that putting data in memory isn’t a silver bullet when you need to do anything interesting with it.

  • 6 Daniel Lemire // Aug 31, 2009 at 10:56 am

    As a basis of comparison for the 32GB figure… According to Jim Gray, you can store everything you read in a year in 25 MB.

    Reference:
    http://www.daniel-lemire.com/blog/archives/2006/10/26/what-is-infinite-storage/

  • 7 Daniel Tunkelang // Aug 31, 2009 at 11:02 am

    I suspect the best comparable is the very modest size of Wikipedia–since Freebase, at least from my perspective, is trying to be a Wikipedia of structured data.

    http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

  • 8 Daniel Lemire // Aug 31, 2009 at 12:00 pm

    The uncompressed size of a wikipedia dump is over one terabyte. There are solid state drives with that capacity—though they are expensive. However, in *compressed* form, wikipedia can fit within 4GB and thus, could fit in my one-year-old laptop’s RAM.

    Now that wikipedia has decided to accept video clips, however, expect the size of the wikipedia dumps to go up significantly in the next few years.

    Reference:
    http://download.wikipedia.org/backup-index.html

  • 9 Daniel Tunkelang // Aug 31, 2009 at 12:04 pm

    In any case, comparison is tricky, since Freebase is a structured repository that isn’t storing large blocks of text–let alone images and video.

  • 10 Relational databases: are they obsolete? // Sep 16, 2009 at 4:25 pm

    [...] as performance is concerned, Stonebraker is obviously right: we are undergoing major changes. As pointed out by Daniel Tunkelang, you can store a lot of data in 32GB of RAM. Solid-state drives can be used to wipe out some IO [...]

Clicky Web Analytics