Category: General

General posts, typically analyzing HCIR issues.

I’m No Google Fan Boy, But…

Post author By Daniel Tunkelang
Post date April 5, 2009
2 Comments on I’m No Google Fan Boy, But…

I may not be a Google fan boy (start with this post if you’re new here), but the recent column in the Guardian (which, by the way, is one of my favorite Endeca customers) entitled “Google is just an amoral menace” is over the top. In fairness to the Guardian, the column is an opinion piece written by Henry Porter of The Observer, and is hardly representative of the fare I expect from the United Kingdom’s leading liberal voice.

What does Porter tell us? Before he vilifies Google, he goes after Scribd, a popular document sharing website. I’m partial to SlideShare myself, but Scribd is significantly more popular. Porter excoriates Scribd for not doing enough to combat unauthorized reproduction of copyrighted materials:

The point is that even if Scribd removes books, it still allows individuals to advertise services for delivering pirated books by email, which must make it the enemy of every writer and publisher in the world. In effect it has turned copyright law on its head: instead of asking publishers for permission, it requires them to object if and when they become aware of a breach.

I understand how publishers resent file-sharing sites that facilitate digital piracy. Clearly Porter doesn’t feel that laws like the World Intellectual Property Organization Copyright Treaty (implemented in the United States as the DMCA) go far enough–he objects to the “safe harbor” provisions that indemnify an ISP, as long as the ISP responds promptly to infringement allegations. He would like ISPs to be responsible for not publishing unauthorized reproduction, and not just for removing them when publishers complain. He’d probably get along well with the Italian prosecutors who want to throw some Google executives in jail because of a YouTube video.

Indeed, I generally prefer opt-in provisions to opt-out–for example, I’m among the skeptics of Google’s book search settlement. And no, I’m not a Microsoft fan boy either!

But there’s a difference between being a publisher and being an ISP. The safe harbor provision for ISPs is there because ISPs are supposed to be common carriers that provide service to the general public without discrimination. Telephone companies are not liable for slander; snail mail and email providers are not liable for illegal activity conducted through the offline or online post; etc.

Yes, there is contributory copyright infringement. But contributory infringement means that the service provider actually knew or should have known of the infringing activity. It is a doctrine of reactive, not proactive, enforcement.

The point of the safe harbor provision is to ensure that there will be common carriers. Remove it, and there would be a chilling effect on ISPs. You might as well shut down the internet. The only middle ground I can espose is to eliminate anonymity–but that would have a chilling effect where it matters most, on dissidents in repressive regimes. I do think we overuse and abuse online anonymity, but it has its place.

Back to Porter and Google. He tells us:

Google presents a far greater threat to the livelihood of individuals and the future of commercial institutions important to the community. One case emerged last week when a letter from Billy Bragg, Robin Gibb and other songwriters was published in the Times explaining that Google was playing very rough with those who appeared on its subsidiary, YouTube. When the Performing Rights Society demanded more money for music videos streamed from the website, Google reacted by refusing to pay the requested 0.22p per play and took down the videos of the artists concerned.

Huh? Google walked away from commercial terms it found unfavorable, and that makes Google a bully? I actually grant that Google exercises monopolistic power in some arenas, such as the book search settlement, or in its negotiation with advertisers, but in this case the performers are just whining that Google won’t buy at the price they demand. Unless I’m missing some critical part of the story, it’s the artists who should be mocked for their sense of entitlement. I know that Billy Bragg is a left-wing activist, and perhaps he sees Google as some sort of fascist overlord. But Google surely does not have a monopoly on the distribution of music or music videos, and it’s absurd for artists to feel entitled that Google distribute their wares any pay them a price that the market is unlikely to bear. Unless the idea is to fix the price of music for the general public–in which case, who is being the fascist?

Porter does make some points that I agree with. His characterization that Google is “a parasite that creates nothing, merely offering little aggregation, lists and the ordering of information generated by people who have invested their capital, skill and time” is a caricature, but not entirely off base. What he’s missing, of course, is that this “creating nothing” is a significant technical feat. But I agree that Google’s relationship to content creators is often parasitic.

And his point about newspaper industry is spot-on:

One of the chief casualties of the web revolution is the newspaper business, which now finds itself laden with debt (not Google’s fault) and having to give its content free to the search engine in order to survive. Newspapers can of course remove their content but then their own advertising revenues and profiles decline. In effect they are being held captive and tormented by their executioner, who has the gall to insist that the relationship is mutually beneficial. Were newspapers to combine to take on Google they would be almost certainly in breach of competition law.

Of course, I blame the newspapers a bit more for getting themselves into this mess–they didn’t have to give their content away for free. But now that they have, they’re trapped in a catch 22: sustaining the relationship devalues their content, while ending the relationship only works if the industry acts in concert.

In summary, Google has its faults, and it’s important to hold those faults up to the light. But Google is not an “amoral menace”, and attacks like these only reinforce the perception that Google critics are intransigent Luddites. Criticism is most effective when it is informed and even-handed.

Note: Ian Betteridge offers more measured (and briefer) analysis at Technovia in “Some quick thoughts about Google versus the newspapers“.

General

API for TunkRank Scores

I hope that most readers here have had a chance to try out TunkRank. TunkRank is an application Jason Adams built, in response to a challenge to implement a measure that takes a PageRank-like approach to measuring influence on Twitter.

To my delight:

TunkRank has become an influential user on Twitter, with 47 followers, a Twitter Grader score in the 80th percentile, and a TunkRank score in the 83rd percentile.
The TunkRank page has a Google PageRank of 4–impressive for such a new site! For perspective, this blog has a PageRank of 5.
TunkRank has become more than a stand-alone site. It now offers an API so that people can use TunkRank scores in their own applications. Note that the raw TunkRank score (which is what the API gives you) are meaningful without the percentiles, since it models the expected number of users who will view a tweet by that user.

I’ve observed anecdotally that, when two users have similar numbers of followers, TunkRank favors the user who follows fewer users. That is particularly interesting, since the TunkRank measure only looks at the users who follow you, not the users whom you follow.

This hypothesis is consistent with my claim that users who follow a lot of other users generally participate in a culture of reciprocity (or, to put it less gently, an attention Ponzi scheme) that leads to their obtaining followers who themselves follow a lot of other users. A user’s follower-to-following ratio signals the likelihood that a user is to reciprocate if you follow him or her.

I suspect that the expectation of reciprocation is negatively correlates to a user’s TunkRank (and, in my view, influence), and that the best test for this hypothesis is to see if, holding follower count constant, the follower-to-following ratio correlates positively to TunkRank.

In any case, I’m excited about the progress, and again congratulate Jason for making this a reality.

General

Usability Begins At Home

As some of you may know, I have a muti-faceted identity: I have an awesome day job; I’m a prolific blogger (really?), I’m writing a book; and I have a wonderful family. Of course, it’s easy to forget that all of these facets co-exist, and that not everything stays put within its original context.

Well, the other night, I was talking to my wife (who works in the softwear industry) about the chapter in my faceted search book that addresses user interface design challenges. Specifically, one of the concerns is how to most effectively implement a search box on a site that uses faceted search. There are, of course, a variety of decisions about default behavior and options for configuring alternative behavior. But I did make a strong admonition: including more than one search box on the site will confuse users.

The following day, she sent me an email:

Hey smarty pants.

Am I mistaken, or do I see TWO search boxes on The Noisy Channel?

Ouch. After I’d gotten over the shock that my wife not only pays attention to my rambling but also reads my blog, it took me a moment to realize what she meant–the PostRank widget has its own search box. And it actually has different search behavior than the search box that is built in to WordPress. For example, a search for don’t know using the WordPress-supplied search box returns over six pages of results, while entering these words into the widget (at least prior to this post) leads to “No posts found.” Moreover, while the result ranking from the WordPress search is strictly by recency, the result ranking on the widget is based on engagement metrics (see the recent discussion on Twitter).

Since I can’t afford to run a formal usability test, I offer the question to you, my readers: are the two search boxes confusing? Have you ever even noticed them before? And now that you’re aware of them, would you recommend I make any changes to the site? Please bear in mind that I probably cannot customize either widget–I’m just a lazy blogger and use the best parts I can get off the shelf.

Also, my wife has graciously offered a discount on her employer’s wares (wears?) to Noisy Channel readers. Go to the Carole Hochman site and use noisy as a promo code to get 40% off of any non-sale items. The discount expires on May 10th. For the calendar-impaired, that’s Mother’s Day. 🙂

General

Digg Getting Faceted Search?

Post author By Daniel Tunkelang
Post date April 3, 2009
6 Comments on Digg Getting Faceted Search?

I’ve never really dug Digg (though I remember Dig Dug from the 80s), but I’m delighted to hear that they’ve announced plans to “fix search” by, among other things, adding faceted search to the site. Here’s a screen shot, by way of TechCrunch.

I’m still not keen on Digg’s social filtering model, which strikes me as particularly open to gaming. But giving the users more control over their exploration of the content will surely make the site more valuable, and possibly even discourage gaming by diluting its effectiveness.

The Digg blog doesn’t say when these “big changes to Digg search” are coming, but hopefully their release of a screen shot means they’re close. As always, I’m excited to see faceted search becoming more broadly accepted as a way to interact with information.

General

Twoogle?

Lots of buzz tonight about a TechCrunch-reported rumor that Google is in serious talks to acquite Twitter. I can certainly see why Google would rather buy Twitter now, before the company gets any more expensive. They paid a pretty penny for YouTube and perhaps have learned to move more quickly on popular technology they can’t compete with themselves.

Why would Twitter sell now if its value can only go up? Well, they could implode: for all of Twitter’s accelerating traffic and relentless buzz, the company still seems far from figuring out how to monetize any of it. And there’s much larger Facebook, looming over Twitter and copying some of its successful ideas. I don’t know how risk-averse Twitter is, but I’m sure the decision makers there recognize the high variance on the company’s set of potential outcomes.

I’m not placing bets on the outcome, or even on the veracity of the rumor. Google and Twitter should be talking, so I don’t see why the two companies wouldn’t be. I suspect the more interesting question is whether they can find a mutually agreeable valuation. At least Google, unlike Facebook, won’t have to negotiate the value of its own stock.

General

Jeff Jarvis Comes Clean

The other day, I attended a launch party where Jeff Jarvis talked about his now best-selling book, “What Would Google Do?” I came back and assembled my reactions in a post entitled “What Would Google Do? / What Does Google Do?“. One of my strongest objections to Jarvis’s shtick is that he extols transparency as a particularly “Googley” attribute.

But now, in an interview with Steve Rubel from Micro Persuasion, Jarvis seems to accept that this really isn’t the case:

Mr. Rubel: You also talk a lot about transparency. Google, however, isn’t the most transparent company. What does the ad industry need to change here?

Mr. Jarvis: Google is not perfect. It expects us all to be transparent — so we can be found in search, so we can benefit from our Googlejuice. But Google is not sufficiently transparent about its ad splits or its Google News sources. So, as our parents would say, this may be a case of doing what Google says more than what it does.

Amen! I’m not saying Google is evil, or trying to incite a round of accusing Google of not living up to its ideals. But I am glad to see Jarvis coming clean on the point I found most objectionable in his presentation. Now I’m at least open to finding out if I should actually get the book and read it.

General

Twitter Discovery Engine?

I wandered over to Techmeme a bit late today and was greeted with a blaring headline from Twitter co-founder Biz Stone: “The Discovery Engine Is Coming “. Twitter, search, and discovery–that’s enough to earn a click-through from me! But I’m not quite sure what they have in mind.

They say that their integrated search will look something like this 2008 sketch:

Hmm. That still feels like their current search to me, only with a different layout. Am I missing something here? When I hear “discovery”, I expect something a bit more exploratory. What I’m reading leads me to believe that they’re working on tightening up the interface, but that they’re not making any significant changes to the functionality. I don’t mean to sound ungrateful about improvements to a service I’m not even paying for, but I am a bit disappointed.

Is there more to this? After all, the blogosphere seems really excited about it, and I’d like to be too. Can anyone help me out?

General

Warming Up To Cuil

It wasn’t so long ago that Cuil launched and I offered a pretty cold appraisal. Of course, I wasn’t the only one to be underwhelmed after they under-delivered on their pre-launch hype. You’d think start-ups would have figured out by now how to better manage expectations. After playing with the site for a couple of days last summer, I wrote Cuil off as dead on arrival.

But, to my surprise, Cuil did not just roll over and die. In fact, it has improved a lot while I–and most of the world–ignored it. And, just the other day, Stephen Arnold wrote a post entitled “Cuil.com Gets Better” that inspired me to take another look, even before seeing today’s post on Search Engine Land about their new timelines feature.

Initial reaction: the folks at Cuil have definitely stepped up their game. But, compared to when they launched a year ago, I’d say the bar has gone up too. I’d compare Cuil to Kosmix, as they seem to have similar interface goals. Both have warts; I’ll need to play with both for a while to decide which I like better.

Here are two queries that I tried, inspired by Arnold’s post at the Beyond Search blog:

From an architecture point of view, it may not be fair to compare Cuil, which does its own crawling and indexing, to Kosmix, which federates results from other services. But they seem to be aiming for similar experiences, which is all that users care about. Though one advantage Cuil has is that, because it uses its own index, it is much, much faster than Kosmix.

In any case, I’m happy to see more attention going to exploratory search, and I am glad that a few companies are bold enough to try to make it work on the web. Perhaps Microsoft will try too, with Jan Pedersen and Stefan Weitz tilting Kumo-emblazoned banners at the grand challenges cited by their co-workers. These are early efforts, but they are promising.

And, to anyone whose technology I’ve criticized harshly in the past, please see posts like these as a sign that I do notice improvements, and that I am more than willing to reconsider my assessments in the face of new evidence.

General

The Unreasonable Effectiveness of Data

Post author By Daniel Tunkelang
Post date March 31, 2009
27 Comments on The Unreasonable Effectiveness of Data

Over the past week, there’s been lots of commentary about “The Unreasonable Effectiveness of Data“, an article by Googlers Alon Halevy, Peter Norvig, and Fernando Pereira in the most recent issue of IEEE Intelligent Systems.

Here are a few posts that have been appearing in my RSS reader:

Geeking with Greg: Semantic interpretation and the effectiveness of big data
Jeff’s Search Engine Caffe: Statistical Learning of Semantics from Web Data
Matthew Hurst: Strings are not Meanings
Stefano’s Linotype: Unreasonable Hypocrisy

I’m intrigued by the amount of attention this paper has attracted–especially the vitriol in this Stefano’s post:

What upset me about that paper is not how they say “oh sure, structure is great, but look overhere: there is a goldmine in all the sand” (which is something I fully resonate with) but they phrased it as a fight, deterministic vs. statistical, trying to convince people that adding structure it not the way to go, it’s basically a global waste of research resources.

And yet, without the <a> tag (that is: machine-readable imposed structure), they wouldn’t be where they are, not they would be able to speak from such a tall soapbox.

I’m actually sympathetic to the view that it’s usually better to have more data than heavier theoretical machinery. But I’ve seen this view taken to an extreme so absurd as to be worthy of an April Fool’s joke–in Chris Anderson’s Wired article about “The End of Theory“. Moreover, that same article quotes Peter Norvig as saying that “All models are wrong, and increasingly you can succeed without them.” Note: Peter Norvig explains here that he was misquoted.

So perhaps Stefano is right to react so harshly.

General

Wolfram Alpha: First-Hand Impressions

Post author By Daniel Tunkelang
Post date March 31, 2009
15 Comments on Wolfram Alpha: First-Hand Impressions

You’d think that, after my less than flattering coverage of the Wolfram Alpha pre-launch hype, they would have blocked my IP address even after the public launch. I certainly thought so.

But I was wrong. Last week, someone from their business development group contacted me to arrange a preview of the system. He spent an hour with me today, discussing the system and showing me how it worked. I couldn’t type into the search box myself, but he did let me suggest queries and then entered them for me. He also encouraged me to speak freely about what I’d seen–they’re actually concerned that the hype and its propagation through the echo chamber are setting inappropriate expectations for the product, and eager for cooler heads to prevail or at least temper the exaggeration.

While I stand by my critique of the marketing, I am persuaded that Wolfram Alpha has built something interesting. It is unfortunate that it’s invited so much comparison to Powerset and is being hyped as a potential Google killer.

As Nova Spivack and others correctly pointed out, Wolfram Alpha isn’t really competing with Google, but rather with Wikipedia and Freebase. Wolfram Alpha offers two major core competencies. The first is a vast amount of curated knowledge: they claim to have roughly 20 trillion “facts” as raw inputs, not the results of any kind of inference process. The second is the capability to relate those facts programmatically by exploiting their formal structure and to generate inferences from them.

Yes, Wolfram Alpha also has a natural language component, but that aspect of their offering is, by their own admission, its weakest point. Having a natural language interface is a necessary evil to provide a way for human beings to access this structured data. But unfortunately its the aspect that people have fixated on most–and will likely continue to do so.

In my view, a more productive way to think of Wolfram Alpha is as a highly structured content repository that offers an API in order to get at the content programatically. I saw example queries like (population of china) + (population of japan) / (population of USA). By itself, that’s a parlor trick that feels a bit like Google Calculator. But the ability to include expressions like is_a (china, country) and population (capital_city (china)) in an application (e.g., Excel) would create real value. Note: this is my fanciful syntax, not theirs.

In May, Wolfram Alpha plans to open up their web site to the public, which means that people will have access to the natural language interface to their content. While the launch will surely generate a lot of press, I suspect that most people will focus on the natural language interface, or on the inability to handle subjective information needs.

That’s a pity. The technology has challenges, but I think the questions people should be asking are if, when and how they will be able to integrate it with business applications. Wolfram Alpha may have value as a stand-alone alternative to Wikipedia for objective data, but its real potential is as a service to use in other applications.