The Noisy Channel

 

Structured Search Is On The Table

October 13th, 2009 · 15 Comments · General

Freebase. Wolfram Alpha. Google Squared. I hesitate to declare a trend, but there does seem to be a growing interest in more structured approaches to information seeking.

The latest entry is Factual, launched today by Gil Elbaz. Elbaz is no slouch: in 1998, he and Adam Weissman co-founded Applied Semantics (originally known as Oingo) and built a word sense disambiguation engine based on WordNet. In 2003, they sold the company to Google for $102M, where it became the bases of their very lucrative AdSense offering.

According to Factual’s website:

Factual is a platform where anyone can share and mash open data on any subject.  For example, you might find a comprehensive directory of restaurants along with dozens of searchable attributes, a huge database of published books, or a list of every video game and their cheat codes.  We provide smart tools to help the community build and maintain a trusted source of structured data.

Factual’s key product, the Factual Table, provides a unique way to view and work with structured data.  Information in Factual Tables comes from the wisdom of the community and from our powerful data mining tools, and the result is rich, dynamic, and transparent data.

You can read more detailed coverage in Search Engine Land, TechCrunch, ReadWriteWeb, GigaOM, and VentureBeat.

To me, Factual sounds like a hybrid between Freebase and Many Eyes. And, like both, it’s free (as in free beer). Free cuts both ways: the Factual site states clearly that “There is currently no way for us to help you monetize these tables.” As with many companies at this stage, the business model is TBD.

I have mixed feelings. I like the increasing interest by startups in structured search. It’s a step in the right direction, since structure is a key enabler for interaction. But we already have one Freebase (and even Google Base), and it’s not clear that we need yet another company to enable crowd-sourced submission of structured data. Perhaps what we need is a way to incent the sort of behavior that has made Wikipedia so successful. As my colleague Rob Gonzalez (who is rumored to have a blog in the works) is always happy to point out, structured data repositories are a public good that no one is ever willing to pay for. The current best hope seems to be the Linked Data initiative, which sounds great in theory–though I think the jury is still out on whether it will succeed in practice.

My ambivalence aside, I am excited that some of the greatest minds in computer science are focused on bringing more structure to the information seeking progress. Even if some of these efforts prove to be false starts, we’re going in the right direction. Structured search is on the table.

15 responses so far ↓

  • 1 christopher // Oct 13, 2009 at 10:23 pm

    Hi Daniel,

    Take a look at what the @infochimps are doing. They have a super open data repository and a monetization strategy.

  • 2 Daniel Tunkelang // Oct 13, 2009 at 10:35 pm

    Infochimps is “the world’s largest open platform for data”. Freebase is “an open database of the world’s information”. Factual is “a platform where anyone can share and mash open data on any subject”. Will the real Slim Shady please stand up? :-)

    But point taken–I’ll check it out.

  • 3 Christopher // Oct 13, 2009 at 10:41 pm

    LOL,

    Yes, many ways to say the same thing. I’m following the InfoChimps closely, they and Freebase seem to be a significant step ahead of the others.

  • 4 Laurent // Oct 14, 2009 at 1:40 pm

    At http://www.twillage.com we also do structured search. We try to detect event-related tweets from twitter, extract placenames, dates and then let people search for what’s happening in a specific town.
    e.g. http://www.twillage.com/search?q=concert+san+francisco

  • 5 Daniel Tunkelang // Oct 14, 2009 at 1:58 pm

    Interesting–definitely better precision than http://search.twitter.com/search?q=&ands=concert&near=San+Francisco&within=15&units=mi. Do you implement event detection through humans or analyzing the text of tweets?

  • 6 Laurent // Oct 14, 2009 at 2:12 pm

    automatically, but it’s simply keyword-based. We currently monitor about 100 keywords using twitter’s streaming API (track). And use our own regular expressions to detect dates/times and then use another service for placename extraction.

  • 7 Daniel Tunkelang // Oct 14, 2009 at 3:22 pm

    Nicely done. Not clear that you can do much more with 140-character tweets.

  • 8 Laurent // Oct 14, 2009 at 3:29 pm

    If the tweet doesn’t contain neither a date, time, or placename, but a link, we could follow it and try to figure out if the landing page describes an event and get the place, date and time from there. Maybe you could help ;)

  • 9 Daniel Tunkelang // Oct 14, 2009 at 3:43 pm

    Have you tried experimenting with Open Calais? It’s not perfect, but it’s free if you don’t bang on it too hard.

  • 10 Laurent // Oct 14, 2009 at 8:05 pm

    I’ve tried their online demo at http://viewer.opencalais.com/ but it doesn’t pick up the locations (most often just state level, not city) or dates and times. However, it’s good at identifying people and sometimes main topics (e.g. environment).
    For twillage.com, I’d like to be able to query “conferences in Europe in December” or “happy hours near Palo Alto this friday”.
    What about you?

  • 11 Daniel Tunkelang // Oct 14, 2009 at 8:50 pm

    Laurent, my experience is comparable, and others I’ve talked to say similar things. The way I’ve improved on document-level annotation has been by leveraging corpus analysis, e.g., in the “Supporting Exploratory Search for the ACM Digital Library” work presented at HCIR 2008. We had similar success applying corpus analysis to improve on document-level named entity detection, which, in my experience, offers a similar challenge of low recall.

  • 12 Christopher // Oct 14, 2009 at 10:05 pm

    @Laurent,

    This is an example of where using more syntactic/semantic data as context rather than just c0-located n-gams can really improve recall.

    I also agree with Daniel that using corpus analysis to improve on document-level named entity detection is a great tool in the arsenal.

  • 13 The latest search and social news for geeks - Ass hats, Crap hats, SEO Sucks and more | Search Engine Journal // Oct 20, 2009 at 10:20 am

    […] Structured Search Is On The Table – the Noisy Channel […]

  • 14 Pankaj Mehra // Nov 23, 2009 at 8:21 pm

    The query Oracle acquisitions on Google Squared brings back a whopping two answers! Some things are worth paying for so you can get good qality answers. Room for a half-decent paid service here.

  • 15 Daniel Tunkelang // Nov 23, 2009 at 9:41 pm

    Not that Google Squared is fully baked, but I get lot more than two answers when query oracle acquisitions.

    BEA Systems, Demantra, Retek, TimesTen, Portal Software, Oblix, LODESTAR Corporation, Stellent, and 360Commerce look right to me. Network computer and Oracle Corporations are false positives–and Sun Microsystems, which actually isn’t an Oracle acquisition yet, triggered the latter. So there’s work to do, for sure. But, as the FAQ says, “Google Squared is still in an experimental phase at this point, so you might encounter some hiccups while using it.”

    There are paid services in this domain, e.g., those listed Dow Jones’s Taxonomy Warehouse. But it seems that a number of folks are competing in the world of free. That means a lot of price pressure–if you’re going to charge, you really have to knock it out of the park, and to hope that free doesn’t catch up on quality.

Clicky Web Analytics