Categories
General

Twitter is Not a Search Engine

Now that Michael Arrington is back from his time off as a blogger, he’s in full force, proclaiming that “It’s Time To Start Thinking Of Twitter As A Search Engine“. There isn’t much to distill from his post, other than that Twitter accumulates lots of social content and “all of it is discoverable at search.twitter.com“.

There are at least two problems with this glib analysis.

First, while it’s true that lots of information gets into Twitter, it’s not clear how much of this information is valuable and how much of it is unique.  Is BackType a search engine? What about Dogpile SearchSpy? Just because you accumulate information socially in “real time” and offer up some sliver of analysis on it doesn’t make you a search engine. There’s got to be a notion of fulfilling information needs. (Update: In fairness, Twitter does help fulfill information needs–though I still maintain that it isn’t a search engine. See discussion at Paul Ogilvie’s blog.)

Second, the search that search.twitter.com supports is minimal–reverse date ordering of Boolean queries. No relevance ranking, user-specified sorting, query refinement, etc. I talked about it in a previous post. If Twitter wants to be taken seriously as a search engine (and I’m not sure that they do), then they need to up their game in the search functionality they offer to users. Right now, their search functionality is text book–and we’re talking 1970s textbook if not earlier. Not that there’s anything wrong with that–it shows me that they don’t see search as their primary offering.

I think a search engine–especially an exploratory search engine–on Twitter’s content would be fantastic! But that’s not what Twitter offers today, and I think it’s stealing a few bases to amount Twitter a search engine just because it sounds like a nice idea.

Let’s accept and appreciate Twitter for what it is: a social network for conversation. And let’s hope that they or others build rich search functionality on top of the content it is encouraging its users to produce.

Categories
Uncategorized

If Anyone Wants a Likaholix Invite…

I just joined Likaholix to see what all the buzz was about. According to their About page:

Likaholix is a fun and easy way to share and discuss your likes and discover new ones with people you know. You can like anything from a great book you have read to your favorite food to some art work that you love.

We have found that recommendations from friends, whose tastes, you trust are usually much better than most reviews on the web. Most people, when they are out on social occasions with friends, find themselves exchanging notes and discussing things that they like. We hope to bring the same experience online with Likaholix. Likaholix serves as both a self-expression and a recommendation tool. We provide personalized recommendations based on the people, topics and items you like.

It’s a nice idea, but I have to say I’m underwhelmed by the experience. Still, I’d be more than happy to share my 10 invites: first come, first serve. I believe that every new users gets 10 invites, so clearly their hoping for exponential growth through viral marketing. Hey, can’t hurt to try.

Categories
General

Check Out TunkRank.com!

A couple of months ago, I put out a challenge to implement an influence measure for Twitter that acquire the personally gratifying (if unmelifluous) name TunkRank.

Well, Jason Adams was up to the challenge and has posted his implementation at http://tunkrank.com/. Check it out! I understand that he’s still ironing out bugs and perhaps even implementing features. He’s even set up a Twitter user for the project: http://twitter.com/tunkrank.

To others considering their own implementations: please don’t be discouraged! I believe this is a ripe area for exploration, and hopefully one that lends itself to fun hackery. Perhaps even an open source project.

Still, Jason deserves the glory for getting there first. Let’s make it worth his while by stress-testing his site and suggesting ways to improve it!

Categories
Uncategorized

The Taxonomy Folksonomy Cookbook

cookbook1

Check out Daniela Barbosa‘s beautiful (and free!) ebook, Taxonomy Folksonomy Cookbook! She introduces the subjects of taxonomies and folksonomies, and encourages you to use both in your recipes. It does include a pitch for Dow Jones, Daniela’s employer, but it’s low key. Regular readers know that I’m no fan of advertorials, and I’ll vouch for the legitimacy of the content.

Via Ron Miller.

Categories
General

Twitter’s “Real-Time Search” Ain’t That Hard

Google CEO Eric Shmidt’s dismissal of Twitter as a poor man’s email was petty and comical, but no more amusing that the blogosphere’s obsession with the wonders of “real-time search“. The dominant narrative in the echo chamber seems to be that Google is in danger of being usurped writing off this segment of the information seeking market.

I actually see some merit to this narrative: search engines in general, and Google in particular, could do a lot to improve their alerting tools. But I want to get something straight: from a technical perspective, real-time search, at least as Twitter implements it, is not that hard.

Let me try to explain this is terms that hopefully do not require technical background.

Twitter’s search interface offers a simple search box. If users do not use any operators, then the results for search are those tweets containing all of the words the user enters. In logical terms, it is as if the terms were combined with a logical AND (e.g., information seeking). In fact, Twitter supports a few Boolean logic operators, so that it is possible to combine terms with OR (e.g., tunkelang OR dtunkelang) and the minus sign (-) for negation (e.g., dtunkelang -published). Twitter also supports quotes as a way of requiring two or more words to occur as a phrase (e.g., “noisy channel”). Finally, Twitter supports some other filtering on its advanced search page.

But what is important is that Twitter only supports one sort, reverse order by date. This vastly simplifies the requirements for Twitter’s inverted index. For those of you unfamiliar with and inverted index, it is much like an index at the back of a book (remember those relics? don’t forget to buy mine!) that associates each word with a list of the documents (in this case, tweets) in which it occurs.

Since Twitter users can only see search results sorted by date, the inverted index presumably maintains its lists in date order. Doing so makes it trivial to add new content, since all additions are at the end of the lists. Moreover, as the index grows, there’s a natural way to partition it into smaller chunks: time-slicing. The problem is, as computer scientists say, embarrassingly parallel.

I’m not trying to suggest that real-time search–or alerting, as it used to be called in ancient pre-Twitter times–isn’t valuable. But, if there is an entry barrier, it is surely not a technical one. Rather, it’s a human one: Twitter’s great achievement, much like Wikipedia’s, is one of human computation: its users supply the content that makes it valuable. Twitter may be much smaller than Facebook, but its single-minded focus on micro-blogging has made it incredibly efficient at what it does, the noise from follower-whores notwithstanding. Twitter’s strength comes from the loyalty of its users.

But this strength is also a vulnerability. As Twitter looks into ways to monetize the attention of its users, it has to be extraordinarily careful not to alienate them. Loyalty is a two-way street.

Categories
Uncategorized

IEEE Computer Special Issue on Information Seeking Support Systems

Check this this month’s issue of IEEE Computer, which offers a special issue on Information Seeking Support Systems. You can read the editors’ introduction (the editors are Gary Marchionini and Ryen White) for free here. I can’t find the rest of the special issue online yet, but I’ll let you know when I do. If anyone here has more details, please let the rest of us know!

Categories
General

Community = Copy Protection

In a post entitled “Want To Know Why Newspapers Are Going Out Of Business? Because Adding Value Never Seems To Be An Option“, Mike Masnick writes:

As we’ve pointed out repeatedly, there are a bunch of sites out there that copy all our content. Not just the headlines and the ledes, but all of the content. Some are pure spam sites. Some are aggregation sites. Some are trying (and failing) to prove the point that we’d get upset if someone copied our stuff. But, that’s not what happens — because this site has much more than just the content. It has the community. It has the Insight Community, where we actually help the community make money. Some of our community members made five figures in 2008. What newspaper has done that for their community? Our community has great ongoing discussions all the time. These other sites can’t replicate that. All they can do is end up sending us more traffic.

I don’t always agree with Masnick (see this debate as an example) and I feel strongly that wholesale copying is unethical, not to mention that it violates fair use. I doubt Masnick disagrees on either count. But he’s right that preventing copying through technical and legal means is, for the most part, a futile battle: at most, you can go after high-profile, blatant offenders.

But community can’t be copied. Even if you mirrored all of this blog’s content and put someone else’s name on it, the comment threads would still live here. You could copy those too, but only the readers who came here could participate in the conversation, and I believe that would still draw most of you.

I’m not encouraging anyone to test this theory–I’d really rather not have rogue versions of this blog proliferating in the hands of unscrupulous spammers. But I do think that Masnick is onto something: the only real copy protection is making your value proposition inherently uncopiable. Building a community where readers partcipate is a great way to create such value.

Categories
General

How Does Your Organization Use SharePoint?

For those of you unfamiliar with Microsoft SharePoint, it is a collection of products and technologies include browser-based collaboration and a document-management platform. At least that’s what the Wikipedia entry says; I gave up after a few minutes of searching for an official definition. But official definitions notwithstanding, many organizations are using SharePoint, which raises the question of how they are using it.

George Dearing called my attention to a study by AIIM and Oracle, reported in CMS Wire as “Study Finds SharePoint Primarily Used for File Sharing“, that tries to answer this question.

Their key findings:

  • 83% of survey respondents currently use, or are planning to use, SharePoint
  • 47% of current SharePoint users use it primarily for file sharing
  • few (no number given)  use it for complex business processes, records management or digital asset management

Given the increasingly tight integration of SharePoint and FAST, Microsoft’s enterprise search offering, my interest in how people use SharePoint is a bit more than idle curiosity. 🙂

At Endeca, we use Alfresco as a content management system, but of course we use our own dog food when it comes to search. I’m always curious to learn more what it’s like behind other people’s firewalls, and I hope some of you will indulge me.

Categories
General

Ranked Set Retrieval

I haven’t posted any ramblings about information retrieval theory in a while. Some of you might be grateful for this lull, but this post is for those of you who miss such thoughts. Everyone else: you’ve been warned!

Here’s what I’ve been thinking about. At one extreme, we have set retrieval, which, given a query, divides a corpus into two subsets corresponding to those documents the system believes to be relevant and those it does not–a binary split. At the other extreme, we have ranked retrieval, which orders documents according to their estimated likelihood of relevance. Given the poor reputation of extremism, I want to explore the space between these extremes.

In both extreme cases, the system returns an ordered sequence of subsets of the corpus, and I propose we consider this as a general framework, which we might call ranked set retrieval. In the first case, the system returns two sets; in the second case, it returns as many singleton sets as there are documents in the corpus. In practice, of course, even ranked retrieval systems tend to dismiss some subset of the corpus as irrelevant, which we can model in our ranked set retrieval framework by appending that subset to the end of the ranked sequence of singletons.

Now that we can consider set retrieval and ranked retrieval in the same framework, we can ask interesting questions and reason about how they should inform the evaluation criteria for information retrieval systems.

For example, when is set retrieval a more appropriate response to a query than ranked retrieval? An easy–though only partial–answer there is evident from symmetry: set retrieval is more appropriate in cases where our estimates of relevance are themselves binary, and where we thus have no principled basis for a finer-grained partition. Hence, given such binary relevance assessments, our retrieval algorithm should recognize that our optimal response is to return two subsets. Conversely, the more fine-grained our estimates of relevance, the greater a basis we have for returning more subsets and including those documents estimated to be more relevant in earlier subsets. At the extreme, the relevance estimates for all documents may be so well separated that the optimal response is, in fact, to return a sequence of singleton sets as per conventional ranked retrieval.

Of course, the interesting cases are in between, i.e., where the optimal response to a query is a collection of subsets corresponding to varying ranges of relevance assessment. Or perhaps we should go beyond bucketing by relevance estimates, and instead optimize for the probability that one of the offered subsets has a high utility reflecting a combination of precision and recall. We could then ordering the subsets by their utility. In fact, a utility measure for such an approach could be recursive–since each subset is really a subquery or query refinement that can then be partitioned into ranked subsets. Indeed, such a recursive approach closely models the behavior we see with information retrieval systems that support interaction.

Why does this subject concern me so much? It’s not just that I’d like to see robust evaluation measures for faceted search and clustering–I’d like to see measures that are able to compare them against ranked retrieval in a common framework, without having to depend on user studies.

Perhaps I’m naively rediscovering paths already explored by folks like Yi Zhang and Jonathan Koren. Their notion of “expected utility based evaluation” does strike a chord. But I don’t see them or anyone else taking the next step and using such an approach to compare the apples and oranges of set and ranked retrieval methods. It’s a missed opportunity, and maybe even a way to bring IR respectability to approaches designed for interactive and exploratory search. If IR can’t come to HCIR, perhaps HCIR can come to IR.

Categories
General

Jeremiah Owyang Defends “Sponsored Conversations”

In a post today entitled “How To Make Sponsored Conversations Work“, Forrester analyst Jeremiah Owyang explains how sponsored conversations–whether through blogs, Twitter, or some other online social medium–can be done right.

He excerpts the following requirements from a report prepared by his fellow Forrester analyst Sean Corcoran:

“1) sponsorship transparency and 2) blogger authenticity.

Sponsorship transparency means that both the marketer and the blogger must make it absolutely clear to the reader community that they are reading paid content – think of Google Adwords “Sponsored Links.” Blogger authenticity means that the blogger should have complete freedom to write in their own voice – even if the content they write about the brand is negative.”

He then goes on to cite Seagate, Panasonic, Symantec, and Wal-Mart as successful examples of companies sponsoring conversations according to these principles.

I have mixed reactions. I like the idea of sponsors as long-term advertisers for blogs and aggregators, e.g., the way that several companies sponsor posts on Techmeme. I’m a lot less keen on the idea of paying a blogger who normally writes unpaid content to bestow his or her reputation on commissioned posts. That crosses the line between advertising and editorial, at least for me. And I can’t imagine how any of this would work on a micro-blogging medium like Twitter.

I find that the bloggers I like reading and the tweeters I follow are people who communication their passion as text with minimal loss in translation. Maybe I’m just projecting–I know that I’d never want to find my readers questioning whether I’m writing what I really feel.

In any case, my gut reaction–much like Steve Hodson’s–to sponsored conversations is to see them as advertorials. I can see how they might play a key role in an ad-supported revenue model, and they have the potential to be much more interesting than other ads. But I don’t think that independent bloggers should be writing them. As Steve points out, you’ve got to ask yourself: how much is your integrity worth?