Author: Daniel Tunkelang

High-Class Consultant.

Google Similar Images: A Glitch?

Post author By Daniel Tunkelang
Post date April 21, 2009
22 Comments on Google Similar Images: A Glitch?

After my post yesterday about Google’s new image similarity search, a colleague sent me this link:

Oops! But, more importantly, it argues strongly that Google isn’t just using image content to perform the similarity search. In fact, I’d assume from this example that text content can easily overwhelm the contribution of visual similarity. Very interesting…

Uncategorized

Advertorials Preferred To Ads?

Post author By Daniel Tunkelang
Post date April 21, 2009
1 Comment on Advertorials Preferred To Ads?

From “Brand Mentions Preferred over Ads” in eMarketer:

Compared with banner ads, pop-up ads, e-mail offers and sponsored links, articles that include brand information were most likely to lead US Internet users to read—and act.

In addition to making a product so compelling it demands coverage, this requires a more natural, PR-focused strategy of getting the word out. Or in some cases, tailoring ads so they look like articles.

I’m no champion of advertising, but at least ads are (usually) honest about their nature. And I am very much in favor of marketing campaigns that aim to earn “natural” name dropping in articles. But it seems eMarketing is either advocating a strategy of pushing advertorials, or neutrally reporting that the strategy is effective. Either way, I find it disturbing. I know we’re a long way from a utopia of transparency, but I thought we were past the infomercials-pretending-to-be-news stage.

General

Google: News Timeline

Post author By Daniel Tunkelang
Post date April 21, 2009

The other day, when I was blogging about Google’s news cluster timelines, I lamented their lack of a unified approach towards visualizing news over time. Their launch of Google News Timeline thus gives mixed feelings: it’s a cool interface, but it still doesn’t unify their approach to this space.

First, the good: the interface is aesthetic and responsive. It works very nicely on structured data (like music releases), and strikes me as a nice incremental improvement on the applications I’ve seen that use David Huynh’s SIMILE Timeline widget. It also lets you make queries based on a variety of data sources:

News Sources : News results (including article snippets, images and videos) from the past 30 days or so are from Google News. Older news results are from Google News Archive Search.

Magazines and Newspapers: You can search for magazines and newspapers that have been digitized and are available through Google News Archive Search and Google Book Search. Images of the front covers of these publications are displayed on the timeline, based on their original publication date.

Blogs: You can view blog post results on the timeline by selecting “Blogs” from the data source menu and typing the name of the blog in the query field.

Baseball Scores: Baseball scores from Retrosheet are displayed on the timeline by selecting “Sports Scores” from the menu bar and entering the name of a team.

Wikipedia Events, Births, and Deaths: You can add events, births, and deaths from Wikipedia by selecting “Wikipedia” from the menu bar and entering the category you’d like displayed on the timeline.

Media from Freebase: You can view information from Freebase about various types of media, including books, music and movies. For example, you can display albums of a particular artist or movies featuring a specific actor.

This variety seems like an embarrassment of riches–and yet I can’t produce the timelines I (and I’d think many people) want. For example, I’d like a timeline of the acquisition activity around Sun–starting from the reports about a month ago of IBM planning to acquire Sun through today’s news that Oracle is to be the lucky suitor. I can find the relevant set of stories using Newssift, but no timeline visualization (at least not yet). Meanwhile, Google gives me a cool interface and lots of options for formulating queries, but not the flexibility I want to pick my set of documents.

I think it’s telling that the best timelines come from searching on structured data. Not only is this data cleaner, but the access to it is based on set retrieval, unlike the ranked retrieval pervasive on Google.com. Perhaps that’s what Google struggles to provide a unified approach: there’s a mismatch between their ranked retrieval algorithms and interfaces designed for set retrieval. Or maybe I just need to wait for a few more beta releases, and it will all come together.

General

Google: Find Similar Images

Post author By Daniel Tunkelang
Post date April 20, 2009
20 Comments on Google: Find Similar Images

I spend a fair amount of time critizing Google for not embracing exploratory search. But today’s launch of a similar images feature is all about exploration, and I’m impressed with what I see.

One of the applications is clarification of ambiguous queries. Rather than using cliche examples like jaguar or apple, let’s try some original ones:

atlas -> cartography (as opposed to the titan, Ayn Rand’s novel, or the bodybuilder)
dollar: bill or coin?
gremlin: car, good mogli, or what happens when they’re fed after midnight

According to a TechCrunch interview with Google engineering director Radhika Malpani, the approach is based on indexing visual similarity, perhaps along the line of this well-reported WWW 2008 paper co-authored by Googler Shumeet Baluja.

The feature is neat, and I think similarity search is a great fit for image search. Regular readers may have read previous posts here about Modista, a startup specializing in exploratory visual search.

Still, I do have three criticisms. First, I find that many searches don’t return enough diversity to make similarity search helpful, e.g., blackberry returns no images of the fruit. Second, the images returned aren’t organized–which seems like a lost opportunity if Google knows enough to cluster them based on pairwise visual similarity. Third, similarity is too fine-grained: I find that similar images are often near-duplicates or the starting image.

Nonetheless, this is a solid launch, and I’m delighted to see Google do anything related to exploratory search.

Uncategorized

Sorry About The Typos!

I discovered belatedly that the Microsoft Word template my publisher had given me was suppressing spelling correction. My bad, I should have realized something was up as I was putting the draft together. In any case, if you are one of the people reviewing the manuscript, please return to the download page and grab the much clean-up copy there. My apologies to those who already suffered througgh the typos.

General

Mathew Ingram: Google Helps Newspapers

Post author By Daniel Tunkelang
Post date April 20, 2009
6 Comments on Mathew Ingram: Google Helps Newspapers

Mathew Ingram at The Nieman Journalism Lab wrote a post today entitled “Google helps newspapers — period” , intended at least in part to rebut two of my recent posts on the subject: “Is Differentiated Content Enough To Save Newspapers?” and my earlier “Yes, Virginia, Google Does Devalue Everything It Touches“.

He cites a proposal I made regarding the use of robots.txt. I’ll excerpt that proposal fully here:

I’m curious what would happen if a critical mass of publishers used robots.txt to stop being crawled–and publicly announced that they were doing so. In the short term, they’d lose a significant amount traffic–and that short-term hit in the current economic climate might amount to fiscal suicide. But in the long term it may be the only way for publishers to prove their own brand value, something they may have to do in order to bring Google and their other bêtes noires to the negotiating table.

Ingram responds:

If there were a finite market for news and information, then the search engine could be accused of devaluing it — but that’s not how information works. In fact, oceans of interchangeable news make certain kinds of content even more valuable, not less.

I have to admit that I don’t follow this argument. First, I’m not even sure what it means for the market for news and information not to be finite. Attention is certainly finite. I’ll assume that what Ingram meant is that the market isn’t fully tapped, and hence that it is possible to add to the total value, rather than simply to redistribute it.

He then says:

if a newspaper or media outlet finds its business model severely impacted by the fact that Google excerpts a single paragraph of a news story, then it deserves to fail

But I’m not arguing that Google is doing harm to the news sites through the use of excerpts–perhaps this argument is directed at someone else. Indeed, if newspapers wanted out, they could get out by using robots.txt. Instead, they not only allow Google in, but invest in SEO to get excerpted as often as possible. The current business model for online newspapers depends heavily on Google as a source of traffic.

But that’s also the problem. Here I disagree with Ingram and echo what Nick Carr has said: Google has become a powerful middleman for online content, much like Wal-Mart for physical goods. That’s great if you’re a consumer that likes a year’s supply of pickles for less than $3; not so great if you’re the premium pickle vendor in a catch 22: sell on Wal-Mart’s terms or forgo the nation’s leading grocery seller as a distribution channel. You can read about the Wal-Mart / Vlasic story here. Is the physical goods market a finite market in a way that the market for news and information is not?

Ingram continues:

if you are adding more value through context and analysis, then there are many more ways to monetize that than by slapping simple banner or text ads on it — which seems to be the only thing that Daniel and others can imagine newspapers doing.

Actually, my imagination is hardly limited to ad-supported models. As regular readers here know, I’d like to live in a world where people pay for digital content just as they pay for other goods and services they value. But I live in the real world, in which I don’t see any viable alternatives to the ad-supported model showing up soon.

Ingram concludes:

But if you are actually adding value, wouldn’t you like as many people to find out about it as possible? Cutting yourself off from the world’s largest search engine is like cutting off your nose to spite your face.

That argument strikes me as equivalent to one that, if you’re trying to sell a house, you should price it at a dollar to attract as many buyers as possible. Or that suppliers should do whatever is necessary for Wal-Mart to sell their goods at high volume. Not all supplier relationships lead to sustainable business models.

The problem, as I see it, is that when readers “find out about” a news article through Google, they read it in a hit-and-run fashion that doesn’t give the newspaper a chance to build a relationship with readers. From the reader’s perspective, the article may as well be published by Google. That’s great for Google’s brand equity, but not so great for the newspaper’s. I’m realistic that no newspaper can afford to indivdiually cut off Google or search engines in general. It’s a prisoner’s dilemma. But I am curious to see what would happen if a critical mass of publishers did so in concert. Yes, that would cost them and Google money. But it’s not irrational as a negotiating tactic if it leads Google to consider a distribution of rents that is still worth Google’s while but more favorable to publishers than the present one.

Ingram works in the newspaper industry, and he’s clearly put a lot of thought into this topic. Nonetheless, I remain unconvinced by his arguments. Perhaps we can find our way to a common ground. Whoever is right will hopefully convince the other that he is wrong!

Uncategorized

All Set On Reviewers!

Post author By Daniel Tunkelang
Post date April 19, 2009

On Friday, shortly after sending a copy of my Faceted Search manuscript to my publisher, I put out a call for volunteer reviewers. The response has been overwhelming, and I am personally overwhelmed to see this validation that my online “friends” and “connections” truly live up to the meaning of those words, volunteering their offline cycles to help out. It’s a wonderful reminder of community.

Given the diminishing return on feedback, I feel it would be irresponsible for me to continue soliciting reviewers to take time away from their day jobs to help me. I am already working to incorporate the feedback I’ve received, and on track to a June 15th publication date.

Thanks again to all who have reviewed the draft, are in the process of doing so, or were about to volunteer before reading this. Of course, you all be the first to hear when the book is available!

General

Is Differentiated Content Enough To Save Newspapers?

Post author By Daniel Tunkelang
Post date April 18, 2009
3 Comments on Is Differentiated Content Enough To Save Newspapers?

The Guardian headline sums it up: “Big newspaper sites ‘erode value of news’, says Sly Bailey“. Sly Bailey is the chief executive of the Trinity Mirror, one of the UK’s largest newspaper publishers. Here’s what she has to say:

A consumer is now as likely to discover newspaper content on Google, visit our sites, then flit away before even discovering that it was the Daily Mirror or the Telegraph that created the content in the first place.

Or worse, they may visit an aggregator like Google News, browse a digital deli of expensive-to-produce news from around the world, and then click on an ad served up to them by Google. For which we get no return. By the absurd relentless chasing of unique user figures we are flag-waving our way out of business.

So far, so good: she’s making the devaluation argument we’ve hopefully all seen by now, and one I agree with.

It’s where Bailey goes next that intrigues me:

She called for a change to the accepted norms, arguing that publishers could “reverse the erosion of value in news content” by rejecting a relentless quest for high user numbers, in favour of a move away from “generalised packages of news” to instead concentrate on content with “unique and intrinsic value”.

On one hand, of course it’s necessary for publishers to offer unique value, regardless of how they can monetize it, or else they commoditize themselves by default. On the other hand, it may not be sufficient. Without effective monetization, publishers create value but don’t capture it. That’s fine if you are Wikipedia (which certainly offers content with “unique and intrinsic value”) and manage to get by on donations. But it doesn’t work so well if you are an online newspaper whose efforts serve more to line Google’s pockets than your own.

Let me make this last point more concrete. Say that you’re an online newspaper, and you invest in developing unique content. Google will happily index your content (assuming you allow it to), and thus you create value on the web. You can even monetize some of that value, by delivering ads to people who visit your site. But Google delivers at least as many ads to those same people, with much less effort. Moreover, as long as Google is the gateway to your content (which is the status quo), you’re unlikely to change that distribution of rents, or to build reader loyalty.

What you really want as an online publisher is for people to seek out your content, not just to stumble into it through search engines and aggregators. I’m curious what would happen if a critical mass of publishers used robots.txt to stop being crawled–and publicly announced that they were doing so. In the short term, they’d lose a significant amount traffic–and that short-term hit in the current economic climate might amount to fiscal suicide. But in the long term it may be the only way for publishers to prove their own brand value, something they may have to do in order to bring Google and their other bêtes noires to the negotiating table.

There are alternative strategies, such as requiring registration or putting up pay walls. But those have the disadvantage that they break the broader link economy (though I may be using the phrase slightly differently from Jeff Jarvis), which on the whole is quite different from the relationship between publishers and search engines / aggregators. I at least believe that The Guardian obtains more brand credit from someone clicking through this post than from someone seeing it in a sea of search results or aggregated news articles. I recognize that the distinction isn’t always black and white, e.g., aggregators like Techmeme concentrate heavily on a small set of sites that readers recognize over time. In general, however, I’d say there is a difference between following a deliberate citation vs. clicking through a link produced without any human intentionality.

I realize I may come across like a romantic, emphasizing the human element, but the distinction I’m after isn’t sentimental. Rather, it’s the idea that the long-term value of publisher depends on readers knowing and caring who the publisher is. They need to break through the commodified experience of search engines that, by design, dilute the differentiation among brands. In any case, the current path for many publishers looks like tragedy without the romance. Those that aim for long-term survival will have to take some chances to buck this inertia.

General

Plan to Attend SIGIR ’09!

SIGIR 2009 will be held in Boston, Massachusetts, at the Sheraton Boston Hotel and at Northeastern University. The conference will run from July 19-23, 2009. Even if you are not a researcher and shy away from academic conferences, I urge you to take a look at the Industry Track. I may be a bit biased as its organizer, but I think you’ll agree that you’ll be hard pressed to find that much industry search expertise packed into a single day’s agenda at any other event, let alone one that is so reasonably priced.

Registration information isn’t quite finalized, but here is the info from the current registration page:

Conference registration

Registration for the conference will be available in mid-April. To help you plan, the expected advance registration fees for ACM members are below. The actual fees may change, but we do not expect they will be higher.

$710 for the main three-day conference, including the conference banquet ($450 for students)

$175 for tutorials

$150 for workshops

A special one-day registration will be available for Wednesday’s Industry Track. No student registration is currently anticipated for the Industry Track only. The registration fee for that day is not yet set.

Normal (after mid-May) and late/on-site (after early July) registration fees will be higher.

Attendees who are not members of the ACM are charged higher registration fees.

Travel grants and volunteer possibilities for students will be announced later.

Hotel registration

The main conference, including the Industry Track, will be held at the Sheraton Boston Hotel. Tutorials and workshops will be held on the Northeastern University campus.

For information about accommodations and how to reserve rooms, please look here.

Uncategorized

AmazonFail = TaxonomyFail?

Post author By Daniel Tunkelang
Post date April 17, 2009
3 Comments on AmazonFail = TaxonomyFail?

By now, #amazonfail seems like old news (yesterday’s detwitus?), though apparently Amazon’s PR folks are still doing damage control.

But what intrigues me was something in Clay Shirky’s nostra culpa post comparing the collective outrage against Amazon to the Tawana Brawley incident. While the post on a whole did not move me (perhaps because I don’t have any guilt to atone for), I did see a valuable nugget:

The problems they have with labeling and handling contested categories is a problem with all categorization systems since the world began. Metadata is worldview; sorting is a political act. Amazon would love to avoid those problems if they could – who needs the tsouris? — but they can’t. No one gets cataloging “right” in any perfect sense, and no algorithm returns the “correct” results. We know that, because we see it every day, in every large-scale system we use. No set of labels or algorithms solves anything once and for all; any working system for showing data to the user is a bag of optimizations and tradeoffs that are a lot worse than some Platonic ideal, but a lot better than nothing.

Indeed, perhaps the problem is that Amazon relies too much on algorithmic cleverness when it should be taking a more transparent HCIR approach. Perhaps not what Shirky was after, but it’s consistent with all of the versions I’ve heard of what went wrong.