Categories
General

Is Global the New Local?

I was just reading a nice article by Mike Elgan in Computerworld entitled “Why global is the new ‘local‘”.

He starts off by talking about the transformations happening in radio:

“Local” radio stations are going national, and even international. That sounds like an opportunity for the stations — they can now reach a larger potential audience for advertisers. But in reality, it’s a problem. The whole radio business model is built around pandering to local community groups, small businesses, area schools and, above all, local listeners. So how do you pander to the old audience without alienating the new one?

He then goes on to explain how the same problem applies to newspapers:

Now you can get local news anywhere. Look, for example, at Lodi, Calif., a medium-size city of about 63,000 people. (You may recall the town from a 1969 Creedence Clearwater Revival song.)

Search Google News for “Lodi” and there it is: more than 4,000 news stories, organized roughly by importance. Getting Lodi news on Google is faster, cheaper, more comprehensive and, well, better than the local Lodi paper. You can get Lodi news even if you’re in Timbuktu. And, of course, you can get county, state, national and international news everywhere. Even if you’re stuck in Lodi.

And here is the money shot:

What’s really going on is that the Internet is punishing inefficiency.

His analysis strikes me as brutally accurate. As much as I criticize  the ad-supported model in general and Google’s role in devaluing online content in particular, I think that Elgan does a great job of explaining what may be one of the the news industry’s biggest contributions to its own malaise. Indeed, for all of the hype about hyperlocal news, I suspect that the winners in this market will be news providers or aggregators that don’t focus on local news but rather let users find whatever they want.

In an unsuccessful City Council run, Tip O’Neill received the famous advice from his father that “All politics is local.” That was surely true in the 1930s, but the world had changed a bit in seven decades.

Fittingly, Elgan concludes his article:

Nothing is local anymore. And it’s a huge opportunity. The new mantra should be: Cover local events exclusively, but for a global audience.

Categories
General

Twitter is Not a Search Engine

Now that Michael Arrington is back from his time off as a blogger, he’s in full force, proclaiming that “It’s Time To Start Thinking Of Twitter As A Search Engine“. There isn’t much to distill from his post, other than that Twitter accumulates lots of social content and “all of it is discoverable at search.twitter.com“.

There are at least two problems with this glib analysis.

First, while it’s true that lots of information gets into Twitter, it’s not clear how much of this information is valuable and how much of it is unique.  Is BackType a search engine? What about Dogpile SearchSpy? Just because you accumulate information socially in “real time” and offer up some sliver of analysis on it doesn’t make you a search engine. There’s got to be a notion of fulfilling information needs. (Update: In fairness, Twitter does help fulfill information needs–though I still maintain that it isn’t a search engine. See discussion at Paul Ogilvie’s blog.)

Second, the search that search.twitter.com supports is minimal–reverse date ordering of Boolean queries. No relevance ranking, user-specified sorting, query refinement, etc. I talked about it in a previous post. If Twitter wants to be taken seriously as a search engine (and I’m not sure that they do), then they need to up their game in the search functionality they offer to users. Right now, their search functionality is text book–and we’re talking 1970s textbook if not earlier. Not that there’s anything wrong with that–it shows me that they don’t see search as their primary offering.

I think a search engine–especially an exploratory search engine–on Twitter’s content would be fantastic! But that’s not what Twitter offers today, and I think it’s stealing a few bases to amount Twitter a search engine just because it sounds like a nice idea.

Let’s accept and appreciate Twitter for what it is: a social network for conversation. And let’s hope that they or others build rich search functionality on top of the content it is encouraging its users to produce.

Categories
General

Check Out TunkRank.com!

A couple of months ago, I put out a challenge to implement an influence measure for Twitter that acquire the personally gratifying (if unmelifluous) name TunkRank.

Well, Jason Adams was up to the challenge and has posted his implementation at http://tunkrank.com/. Check it out! I understand that he’s still ironing out bugs and perhaps even implementing features. He’s even set up a Twitter user for the project: http://twitter.com/tunkrank.

To others considering their own implementations: please don’t be discouraged! I believe this is a ripe area for exploration, and hopefully one that lends itself to fun hackery. Perhaps even an open source project.

Still, Jason deserves the glory for getting there first. Let’s make it worth his while by stress-testing his site and suggesting ways to improve it!

Categories
General

Twitter’s “Real-Time Search” Ain’t That Hard

Google CEO Eric Shmidt’s dismissal of Twitter as a poor man’s email was petty and comical, but no more amusing that the blogosphere’s obsession with the wonders of “real-time search“. The dominant narrative in the echo chamber seems to be that Google is in danger of being usurped writing off this segment of the information seeking market.

I actually see some merit to this narrative: search engines in general, and Google in particular, could do a lot to improve their alerting tools. But I want to get something straight: from a technical perspective, real-time search, at least as Twitter implements it, is not that hard.

Let me try to explain this is terms that hopefully do not require technical background.

Twitter’s search interface offers a simple search box. If users do not use any operators, then the results for search are those tweets containing all of the words the user enters. In logical terms, it is as if the terms were combined with a logical AND (e.g., information seeking). In fact, Twitter supports a few Boolean logic operators, so that it is possible to combine terms with OR (e.g., tunkelang OR dtunkelang) and the minus sign (-) for negation (e.g., dtunkelang -published). Twitter also supports quotes as a way of requiring two or more words to occur as a phrase (e.g., “noisy channel”). Finally, Twitter supports some other filtering on its advanced search page.

But what is important is that Twitter only supports one sort, reverse order by date. This vastly simplifies the requirements for Twitter’s inverted index. For those of you unfamiliar with and inverted index, it is much like an index at the back of a book (remember those relics? don’t forget to buy mine!) that associates each word with a list of the documents (in this case, tweets) in which it occurs.

Since Twitter users can only see search results sorted by date, the inverted index presumably maintains its lists in date order. Doing so makes it trivial to add new content, since all additions are at the end of the lists. Moreover, as the index grows, there’s a natural way to partition it into smaller chunks: time-slicing. The problem is, as computer scientists say, embarrassingly parallel.

I’m not trying to suggest that real-time search–or alerting, as it used to be called in ancient pre-Twitter times–isn’t valuable. But, if there is an entry barrier, it is surely not a technical one. Rather, it’s a human one: Twitter’s great achievement, much like Wikipedia’s, is one of human computation: its users supply the content that makes it valuable. Twitter may be much smaller than Facebook, but its single-minded focus on micro-blogging has made it incredibly efficient at what it does, the noise from follower-whores notwithstanding. Twitter’s strength comes from the loyalty of its users.

But this strength is also a vulnerability. As Twitter looks into ways to monetize the attention of its users, it has to be extraordinarily careful not to alienate them. Loyalty is a two-way street.

Categories
General

Community = Copy Protection

In a post entitled “Want To Know Why Newspapers Are Going Out Of Business? Because Adding Value Never Seems To Be An Option“, Mike Masnick writes:

As we’ve pointed out repeatedly, there are a bunch of sites out there that copy all our content. Not just the headlines and the ledes, but all of the content. Some are pure spam sites. Some are aggregation sites. Some are trying (and failing) to prove the point that we’d get upset if someone copied our stuff. But, that’s not what happens — because this site has much more than just the content. It has the community. It has the Insight Community, where we actually help the community make money. Some of our community members made five figures in 2008. What newspaper has done that for their community? Our community has great ongoing discussions all the time. These other sites can’t replicate that. All they can do is end up sending us more traffic.

I don’t always agree with Masnick (see this debate as an example) and I feel strongly that wholesale copying is unethical, not to mention that it violates fair use. I doubt Masnick disagrees on either count. But he’s right that preventing copying through technical and legal means is, for the most part, a futile battle: at most, you can go after high-profile, blatant offenders.

But community can’t be copied. Even if you mirrored all of this blog’s content and put someone else’s name on it, the comment threads would still live here. You could copy those too, but only the readers who came here could participate in the conversation, and I believe that would still draw most of you.

I’m not encouraging anyone to test this theory–I’d really rather not have rogue versions of this blog proliferating in the hands of unscrupulous spammers. But I do think that Masnick is onto something: the only real copy protection is making your value proposition inherently uncopiable. Building a community where readers partcipate is a great way to create such value.

Categories
General

How Does Your Organization Use SharePoint?

For those of you unfamiliar with Microsoft SharePoint, it is a collection of products and technologies include browser-based collaboration and a document-management platform. At least that’s what the Wikipedia entry says; I gave up after a few minutes of searching for an official definition. But official definitions notwithstanding, many organizations are using SharePoint, which raises the question of how they are using it.

George Dearing called my attention to a study by AIIM and Oracle, reported in CMS Wire as “Study Finds SharePoint Primarily Used for File Sharing“, that tries to answer this question.

Their key findings:

  • 83% of survey respondents currently use, or are planning to use, SharePoint
  • 47% of current SharePoint users use it primarily for file sharing
  • few (no number given)  use it for complex business processes, records management or digital asset management

Given the increasingly tight integration of SharePoint and FAST, Microsoft’s enterprise search offering, my interest in how people use SharePoint is a bit more than idle curiosity. 🙂

At Endeca, we use Alfresco as a content management system, but of course we use our own dog food when it comes to search. I’m always curious to learn more what it’s like behind other people’s firewalls, and I hope some of you will indulge me.

Categories
General

Ranked Set Retrieval

I haven’t posted any ramblings about information retrieval theory in a while. Some of you might be grateful for this lull, but this post is for those of you who miss such thoughts. Everyone else: you’ve been warned!

Here’s what I’ve been thinking about. At one extreme, we have set retrieval, which, given a query, divides a corpus into two subsets corresponding to those documents the system believes to be relevant and those it does not–a binary split. At the other extreme, we have ranked retrieval, which orders documents according to their estimated likelihood of relevance. Given the poor reputation of extremism, I want to explore the space between these extremes.

In both extreme cases, the system returns an ordered sequence of subsets of the corpus, and I propose we consider this as a general framework, which we might call ranked set retrieval. In the first case, the system returns two sets; in the second case, it returns as many singleton sets as there are documents in the corpus. In practice, of course, even ranked retrieval systems tend to dismiss some subset of the corpus as irrelevant, which we can model in our ranked set retrieval framework by appending that subset to the end of the ranked sequence of singletons.

Now that we can consider set retrieval and ranked retrieval in the same framework, we can ask interesting questions and reason about how they should inform the evaluation criteria for information retrieval systems.

For example, when is set retrieval a more appropriate response to a query than ranked retrieval? An easy–though only partial–answer there is evident from symmetry: set retrieval is more appropriate in cases where our estimates of relevance are themselves binary, and where we thus have no principled basis for a finer-grained partition. Hence, given such binary relevance assessments, our retrieval algorithm should recognize that our optimal response is to return two subsets. Conversely, the more fine-grained our estimates of relevance, the greater a basis we have for returning more subsets and including those documents estimated to be more relevant in earlier subsets. At the extreme, the relevance estimates for all documents may be so well separated that the optimal response is, in fact, to return a sequence of singleton sets as per conventional ranked retrieval.

Of course, the interesting cases are in between, i.e., where the optimal response to a query is a collection of subsets corresponding to varying ranges of relevance assessment. Or perhaps we should go beyond bucketing by relevance estimates, and instead optimize for the probability that one of the offered subsets has a high utility reflecting a combination of precision and recall. We could then ordering the subsets by their utility. In fact, a utility measure for such an approach could be recursive–since each subset is really a subquery or query refinement that can then be partitioned into ranked subsets. Indeed, such a recursive approach closely models the behavior we see with information retrieval systems that support interaction.

Why does this subject concern me so much? It’s not just that I’d like to see robust evaluation measures for faceted search and clustering–I’d like to see measures that are able to compare them against ranked retrieval in a common framework, without having to depend on user studies.

Perhaps I’m naively rediscovering paths already explored by folks like Yi Zhang and Jonathan Koren. Their notion of “expected utility based evaluation” does strike a chord. But I don’t see them or anyone else taking the next step and using such an approach to compare the apples and oranges of set and ranked retrieval methods. It’s a missed opportunity, and maybe even a way to bring IR respectability to approaches designed for interactive and exploratory search. If IR can’t come to HCIR, perhaps HCIR can come to IR.

Categories
General

Jeremiah Owyang Defends “Sponsored Conversations”

In a post today entitled “How To Make Sponsored Conversations Work“, Forrester analyst Jeremiah Owyang explains how sponsored conversations–whether through blogs, Twitter, or some other online social medium–can be done right.

He excerpts the following requirements from a report prepared by his fellow Forrester analyst Sean Corcoran:

“1) sponsorship transparency and 2) blogger authenticity.

Sponsorship transparency means that both the marketer and the blogger must make it absolutely clear to the reader community that they are reading paid content – think of Google Adwords “Sponsored Links.” Blogger authenticity means that the blogger should have complete freedom to write in their own voice – even if the content they write about the brand is negative.”

He then goes on to cite Seagate, Panasonic, Symantec, and Wal-Mart as successful examples of companies sponsoring conversations according to these principles.

I have mixed reactions. I like the idea of sponsors as long-term advertisers for blogs and aggregators, e.g., the way that several companies sponsor posts on Techmeme. I’m a lot less keen on the idea of paying a blogger who normally writes unpaid content to bestow his or her reputation on commissioned posts. That crosses the line between advertising and editorial, at least for me. And I can’t imagine how any of this would work on a micro-blogging medium like Twitter.

I find that the bloggers I like reading and the tweeters I follow are people who communication their passion as text with minimal loss in translation. Maybe I’m just projecting–I know that I’d never want to find my readers questioning whether I’m writing what I really feel.

In any case, my gut reaction–much like Steve Hodson’s–to sponsored conversations is to see them as advertorials. I can see how they might play a key role in an ad-supported revenue model, and they have the potential to be much more interesting than other ads. But I don’t think that independent bloggers should be writing them. As Steve points out, you’ve got to ask yourself: how much is your integrity worth?

Categories
General

Dunbar Lives!

The other day, I talked about the “real” Twitter: the sparse subgraph of meaningful social relationships buried in the far denser follower graph. Well, it turns out that Facebook’s own “in-house sociologist” Cameron Marlow has documented a similar phenomenon on Facebook:

The average male Facebook user with 120 friends:

  • Leaves comments on 7 friends’ photos, status updates, or wall
  • Messages or chats with 4 friends

The average female Facebook user with 120 friends:

  • Leaves comments on 10 friends’ photos, status updates, or wall
  • Messages or chats with 6 friends

The average male Facebook user with 500 friends:

  • Leaves comments on 17 friends’ photos, status updates, or wall
  • Messages or chats with 10 friends

The average female Facebook user with 500 friends:

  • Leaves comments on 26 friends’ photos, status updates, or wall
  • Messages or chats with 16 friends

Students of sociology have long been familiar with Dunbar’s number, which Wikipedia defines as “the theoretical cognitive limit to the number of people with whom one can maintain stable social relationships”. Others have proposed different limits, but everyone seems to agree that the number is less than 300–something that you might not know from looking at the follower / connection statistics of online social networks.

Of course, this cognitive limit reflects attention scarcity. Wouldn’t it be nice if online social networks did too. I’m trying!

Categories
General

It’s OK To Tweet

The other day, Owen Thomas at Valleywag smirked about the audience at Times Open that “sat and Twittered instead of listening to the speaker”. To which I say, take a look at our tweets and you’ll see that people were listening intently.

I’m glad that Congress isn’t reading Valleywag: CNN reports that members of Congress twittered through Obama’s big speech:

Members of Congress twittered their way through President Obama’s nationally televised speech Tuesday night, providing a first-of-its-kind running commentary that took users of the social networking site inside the packed House chamber.

I hope this mainstream use of Twitter inspires audiences to play a more active role not only as listeners but also contributors to the conversations that good speeches are designed to inspire.

Of course, there remains the question of establishing social norms for live audiences who are torn between looking at the speaker and typing. Ironically, I remember being yelled at in class for *not* taking notes! Perhaps the people who most need coaching at the speakers who have to face live-tweeting audiences. Here’s some advice on the subject from speaking expert Olivia Mitchell.