Categories
General

The Twouble with Twitter Search

There has been a flurry of reports about Twitter search–whether about Twitter’s plans to improve their search functionality or about alternative ways to search Twitter. But Danny Sullivan makes a great point in a recent post about Google:

Ironically, Google gets a taste of its own medicine with Twitter. It still can’t access the “firehose” of information that Twitter has, in order to build a decent real-time search service. If it can’t strike a deal, expect to hear the company start pushing on how “real-time data” should be open.

Of course, that logic applies not only to Google, but also to anyone with aspirations to build a better mousetrap for Twitter search. As things stand, applications can’t do much better than post-processing native Twitter search results–which makes it hard to offer any noticeable improvement on them. If Twitter offered full Boolean set retrieval (e.g., if a search for star trek returned the set of all tweets containing both words), then applications could implement lots of interesting algorithms and interfaces  on top of their API. I’d love to work on exploratory search applications myself! But the trickle that Twitter returns is hardly enough.

I believe this limitation is by design–that Twitter knows the value of such access and isn’t about to give it away. I just hope Twitter will figure out a way to provide this access for a price, and that an ecology of information access providers develops around it. Of course, if Google or Microsoft buys Twitter first, that probably won’t happen.

By Daniel Tunkelang

High-Class Consultant.

10 replies on “The Twouble with Twitter Search”

Hannes, I’m aware of the operators–you might notice that I use some of them to generate my Twitter feed (bottom right). The problem I’m trying to highlight is not the lack of operators. Rather, it is that Twitter only returns the most recent results.

That’s extremely limiting for popular queries (presumably the main fodder for “real-time” search), e.g., as I’m writing, all of the results for http://search.twitter.com/search?q=star+trek are from less than a minute ago and hardly give me a coherent, holistic picture. It’s also limiting for queries that would show an interesting arc over time, e.g., http://search.twitter.com/search?q=google+books+settlement.

I could see cost and business model reasons for Twitter not to give this access away. I just hope they find a way to make this access available, under terms and conditions that lead people to make use of it.

Like

I believe this limitation is by design–that Twitter knows the value of such access and isn’t about to give it away.

Simpler explanation: they’re too busy to implement new features!

Have you tried using max date restrictions? You can go back further though it’s still limited.

They even have a blog post saying they wished they could serve up more, but are limited by available hardware and are working on getting more. There’s no business conspiracy. Just a little company with too much too do.

Like

Brendan, conspiracy is a strong word. But I’m sure the a number of folks, including Google, would happily fund the availability of the firehouse with both money and labor if Twitter would give them a chance–and I’m equally sure that Twitter knows this.

Like

Twitter had a decent way for others to get access to the real-time information but that method didn’t scale well. A firehose is just what it is…we’re talking a million or more tweets per day and can expect 10 million tweets a day in a year or two. Jabber could handle it but they could also post the messages to a consumer server 100, 1000, or 10000 messages at a time. The realtime-ness is important to within a minute but people notice missing tweets so the top priority should be completeness.

Like

I think that full access to the Twitter archives (e.g., tweets that are at least a day-old) would be valuable in its own right, though I realize that what excites people is the “real-time” aspect of Twitter.

Are you saying that the scaling challenge is because of the desire for sub-minute latency or because of the challenge of servicing a large number of consuming applications?

Like

Comments are closed.