Author: Daniel Tunkelang

High-Class Consultant.

Off To DC

I’m heading to Washington, DC tomorrow morning, a couple of days before the HCIR ’09 workshop. I’m not sure I’ll have any opportunities to blog while I’m in the nation’s capital, but of course I’ll post a write-up about the workshop when I’m back! Meanwhile, if you need your blog fix, I encourage you to check out some of the blogs I read.

General

Books! Books! Books!

When my daughter was born almost two years ago, I wondered if she’d grow up reading books. After all, I do most of my reading online, and increasingly find myself reading short articles rather than whole books. Needless to say, she’s loved books so far, even if she’s shredded a few.

But the bigger surprise for me is that books–specifically e-books–have become such a hot industry. When I briefly worked for a consulting firm after grad school in 1999, my first assignment was to evaluate the e-book market. The readers then consisted of the Rocket ebook and SoftBook Reader. Needless to say, I correctly predicted at the time that the ebook-market wasn’t ready for prime time.

But fast forward to the present. Amazon has given the e-book market some credibility: Citigroup says they sold 500K Kindles in 2008, and Forrester predicted they will sell 1.8M units this year.

But the last days (and even the last 24 hours!) of news show that the e-book market is only starting to open up:

In May, Sony, whose e-reader sales have lagged behind the Kindle, announced a partnership with Google in May in order to make copyright-free books available for free.
Google just announced a service called Editions that it plans to launch in 2010 (by when it will have presumably settled the Google Books Settlement Agreement).
The Internet Archive just announced the Bookserver project as “a growing open architecture for vending and lending digital books over the Internet”.
Spring Design just announced Alex, an e-book reader based on Google’s Android operating system.
Barnes & Noble is expected to announce an e-reader that competes directly with the Kindle and has generated lots of buzz through leaked photos.

I grew up on books, and I’m excited to see that, a decade after the initial market failures, e-books (like touchscreens) are a mainstream reality. I still worry about who will buy them, especially considering that the marginal cost of distributing a typical e-book is even less than that of distributing a 5-minute song. A quick scan of a popular file-sharing site reveals that the pdf version of bestseller The Lost Symbol takes up less than 3MB.

Still, I’ll take a moment to celebrate the progress of technology. I’ve always known that reading was cool, but now we have the gadgets to prove it!

Rocket ebook and Softbook.

General

Who Will Buy?

As some of you know, I’m a karaoke junkie. But it’s my wife who has the classier repertoire, including “Who Will Buy?” from the musical Oliver!:

Who will buy this wonderful morning?
Such a sky you never did see!
Who will tie it up with a ribbon
And put it in a box for me?

Of course, the trope that the best things in life are free predates musical theater, let alone the web. But recent years have witnessed dramatic changes in our price sensitivities in every genre of digital (or digitizable) content, and I’m curious (sometimes morbidly so) about where it goes from here.

I won’t make you suffer through a rant about the malaise of the music and news industries–those topics, important as they are, have been overplayed in the blogosphere. If you need a refresher, I suggest Lawrence Lessig and the Nieman Journalism Lab as some of the more rational voices contributing to the discussion.

But it’s not just news and music that are experiencing the effects of the “information wants to be free” movement. Consider these industries:

Books. Many publishers worry that the Kindle has been setting a consumer expectation that a book should only cost $10. Indeed, a recent price war between Amazon and Wal-Mart drove some of those prices down to $8.99. Is this a boon for consumers, or a body blow to the publishing industry? It’s easy to evoke the $0.99 / per song expectation set by iTunes–but that change was more about disaggregating albums than about changing the per-unit cost. Besides, books have not yet had to confront the scale of unauthorized distribution that we see in the music industry. Legal or not, free is a potent source of price pressure.
Software. Wolfram Alpha just made headlines by releasing a $50 iPhone app. Many have reacted that such a high price is outrageous and will doom the application to failure. They may be right on that latter point–the market will vote with its clicks soon enough. But I’m old enough to remember $50 as being in the ballpark of what it cost to purchase a new consumer software application. Even then, unauthorized distribution was an issue–remember the “don’t copy that floppy” campaign? Today, my impression is that few people consciously purchase consumer software–a trend that I at least date to Microsoft’s strategy of bundling its software into PC purchases. The most noted exceptions are console games (which are impressive holdouts in the consumer software space) and iPhone apps–with the caveat that only a tiny minority of apps make enough money for the creators to live on. (Update: just saw this note about how EA Sports President Peter Moore sees the current console game business model of cartridges and discs as a “burning platform”.)
Television. Between Boxee and Netflix, there is a real chance that digital content’s cash cow, cable television, will see its regional monopolies disrupted. I can’t imagine that anyone will shed a tear for the cable companies. And yet I can’t help but wonder what happens as the notion of premium content is subsumed by an expectation that video content should be free. Are we heading towards a proliferation of cheaply produced reality TV, contests, and game shows–all sponsored by rampant product placement?

If we are to believe Mike Masnick, then the price of content is driven to its marginal cost. It’s pretty clear that the marginal cost of distributing most digital content is, while not free, close enough to be a rounding error. Should we be looking forward to a world where no one can charge consumers for content? Folks like Jeff Jarvis and Chris Anderson are cheerleading such a world as not only inevitable but a good thing–though both of them have had the sense to make some money on non-free books while the going is good.

Yes, there are and will always be business models to support content creators. In particular, one-time content (live events, consulting services) has some degree of insulation from the inexorable trend toward free. But what an inefficient turn of events, if people are rewarded for creating one-time content but not for creating far more valuable content that is useful to a broad audience of consumers!

I know that there are non-financial incentives that drive scholars, open-source developers, and activists to create free content. Indeed, I personally write this blog without any direct financial incentive. Perhaps these incentives will be the driving forces for content creation in the 21st century. One way or another, I hope we find a way to fund the things we value, rather than devolving into a locally optimal rut where value creation isn’t economic for the creators.

p.s. You can find the lyrics to Oliver for free online, and you can easily view an free (unauthorized) copy of a performance of “Who Will Buy?” on YouTube. Or you can buy the song for $0.99.

Uncategorized

Third Annual Workshop on Search in Social Media (SSM 2010)

Post author By Daniel Tunkelang
Post date October 16, 2009
1 Comment on Third Annual Workshop on Search in Social Media (SSM 2010)

I’m proud to announce that Eugene Agichtein, Marti Hearst, and Ian Soboroff have invited me to help organize the upcoming Workshop on Search in Social Media (SSM 2010). The workshop will take place in conjunction with the ACM Conference on Web Search and Data Mining (WSDM 2010), a young conference that has quickly become a top-tier forum for work in these areas. The conference and workshop will take place in my home town of New York–Brooklyn, to be precise!

Here’s the key information from the workshop web site:

Overview

Social applications are the fastest growing segment of the web. They establish new forums for content creation, allow people to connect to each other and share information, and permit novel applications at the intersection of people and information. However, to date, social media has been primarily popular for connecting people, not for finding information. While there has been progress on searching particular kinds of social media, such as blogs, search in others (e.g., Facebook, Myspace, of flickr) are not as well understood.

The purpose of the 3rd Annual Workshop on Search in Social Media (SSM 2010), is to bring together information retrieval and social media researchers to consider the following questions: How should we search in social media? What are the needs of users, and models of those needs, specific to social media search? What models make the most sense? How does search interact with existing uses of social media? How can social media search complement traditional web search? What new search paradigms for information finding can be facilitated by social media?

SSM 2010 follows up on the highly successful SSM 2009 and SSM 2008 workshops held at SIGIR 2009 and CIKM 2008 respectively. We are looking forward to an equally exciting workshop at WSDM 2010 in New York!

Format and Topics

We are planning for a full-day workshop consisting of invited speakers, organized in both plenary and panel sessions, and a contributed poster/demo session.

We solicit short (under 2 pages) position papers, posters or demo proposals to be presented as part of a poster session, describing late-breaking and novel research results or demonstrations of prototypes or working systems. All topics at the intersection of information finding and social media are of interest, including, but not limited to:

Searching blogs, tweets, and other textual social media.

Searching within social networks, including expert finding.

Searching Wikipedia discussions and revision histories.

Searching online discussions, mailing lists, forums, and community question answering sites.

The role of human-powered and community question answering.

Novel models of information finding and new search applications for social media.

The role of timeliness, authority, and accuracy in social media search.

Interaction between traditional web search and social media search.

User needs assessments and task analysis for social media search.

Interactions between searching and browsing in social media.

Searching and exploiting folksonomies, tags, and tagged data.

Spam and adversarial interactions in social media.

Ideal papers may include late-breaking and novel research results, position and vision papers discussing the role of search in social media, and demonstrations of prototypes or working systems. Note that the workshop proceedings will not be archived or considered as formal publication, to encourage the informal atmosphere and to allow the authors to publish expanded versions of the work elsewhere.

The poster/demo proposals should be in standard ACM SIG format, more details to be posted soon.

Submissions are due on December 15th. I hope to see some of you there! Meanwhile, feel free to suggest ideas for invited speakers who have done interesting work at the intersection of social media and search, and I’ll share your suggestions with my co-organizers.

General

Innovation at Huffington Post: Data-Driven Headlines

Post author By Daniel Tunkelang
Post date October 15, 2009
31 Comments on Innovation at Huffington Post: Data-Driven Headlines

The other day, I was suggesting to one of my colleagues that Endeca‘s software could help authors write better (translate, more SEO-friendly) headlines. The details of that discussion are proprietary, but I’m sure you can imagine the gist. But we all wondered whether authors would be willing to stomach such a left-brain infringement on their right-brain creativity.

But apparently the Huffington Post is blazing new trails in this area. The Nieman Journalism Lab reports that:

The Huffington Post applies A/B testing to some of its headlines. Readers are randomly shown one of two headlines for the same story. After five minutes, which is enough time for such a high-traffic site, the version with the most clicks becomes the wood that everyone sees.

NJL also reports that Huffington Post social media editor–and long-time Noisy Channel reader–Josh Young uses Twitter to help crowd-source better headlines.

I’m sure this approach must rattle some old-school journalists. And there is a real danger of optimizing for the wrong outcome. For example, including the word “sex” in this message might improve its traffic (the popularity of this post attests to that), but to what end?

Still, I don’t see this use of technology as cramping anyone’s style. Most of us write to be read–especially those in the media industry who are trying to monetize their audiences. Measurable success matters, and there’s no harm in trying to maximize it.

General

Are Duplicate Tweets Spam?

Post author By Daniel Tunkelang
Post date October 15, 2009
3 Comments on Are Duplicate Tweets Spam?

The Twitterverse is all a-twitter with a new controversy: Twitter has rolled out a new feature that blocks duplicate tweets. They reported to the SocialOomph blog that:

Recurring Tweets are a violation no matter how they are done, including whether or not someone pays you to have a special privilege. We don’t want to see any duplicate tweets whatsoever- They pollute Twitter, and tools shouldn’t be given to enable people to break the rules. Spinnable text seems to just be a way to bypass the rules against duplicate updates and essentially provides the same problems.

Hence, from Thursday, October 15th, 2009, 00:00 AM CST we will prevent the entry of recurring tweets on Twitter accounts within the SocialOomph system. Existing recurring tweets on Twitter accounts will all be placed in paused state at that time, so that the content of the tweet text is still accessible to you, but no publishing to Twitter of those tweets will take place.

Not everyone is thrilled with this new feature. My friend (and Noisy Channel reader) Eric Andersen notes: “this doesn’t make a lot of sense to me – many highly regarded Twitter users (e.g. @GuyKawasaki) regularly re-post tweets…primarily because of the “dip” model: re-posting the same tweet means more people will see, especially with an int’l audience.”

On one hand, I loathe inefficient communication, and I see repeated tweets as exposing the inefficiency of the dip model. We won’t get into my differences of opinion with Guy Kawasaki. If Twitter offered better search and control to users, then I think it would make sense for them to consider duplicate tweets as a spam issue.

On the other hand, Twitter search is crude. And the dip model, much as it may raise my personal hackles, is, in fact, what many users embrace. Twitter takes pride in letting users drive innovation, and I think they should be cautious about being too autocratic. Surely many of the people who post duplicate tweets do so with unspammy intentions.

Let’s face it: Twitter is going through growing pains, even if it just inherited the mother of all trust funds. They really do have to address spam. But they might consider doing so in a less heavy-handed way. I suspect that duplicate tweets are mainly a problem because they affect the statistics for Trending Topics–a problem they could easily address without prohibiting the tweets themselves. Better search would make it users to take charge of the user experience–a small dose of HCIR would go a long way.

I think Twitter has the best of intentions, and that it is confronting a real problem. I hope they work harder to find the right solution.

Uncategorized

Go Shopping, Be Social

If you’re into search startups, then today’s a great day to check out what a couple of them are up to.

TheFind just launched (or relaunched?) a “buying engine” that aspires “to help every shopper find exactly what they want to buy, and to help every merchant, large and small, to reach those shoppers.” It has some nice interface elements, but I can’t say I’m sold on the overall user experience.

Meanwhile, Aardvark just launched a web-based version of its social search application. The site urges users to “ask any question in plain English, and Aardvark will discover the perfect person in your network to answer…in under 5 minutes!” As I’ve commented before, I think they need to embrace the philosophy of “when in doubt, make it public“. But hey, they made the Time’s top 50 websites for 2009, so perhaps they are right to ignore my advice.

General

Structured Search Is On The Table

Post author By Daniel Tunkelang
Post date October 13, 2009
15 Comments on Structured Search Is On The Table

Freebase. Wolfram Alpha. Google Squared. I hesitate to declare a trend, but there does seem to be a growing interest in more structured approaches to information seeking.

The latest entry is Factual, launched today by Gil Elbaz. Elbaz is no slouch: in 1998, he and Adam Weissman co-founded Applied Semantics (originally known as Oingo) and built a word sense disambiguation engine based on WordNet. In 2003, they sold the company to Google for $102M, where it became the bases of their very lucrative AdSense offering.

According to Factual’s website:

Factual is a platform where anyone can share and mash open data on any subject. For example, you might find a comprehensive directory of restaurants along with dozens of searchable attributes, a huge database of published books, or a list of every video game and their cheat codes. We provide smart tools to help the community build and maintain a trusted source of structured data.

Factual’s key product, the Factual Table, provides a unique way to view and work with structured data. Information in Factual Tables comes from the wisdom of the community and from our powerful data mining tools, and the result is rich, dynamic, and transparent data.

You can read more detailed coverage in Search Engine Land, TechCrunch, ReadWriteWeb, GigaOM, and VentureBeat.

To me, Factual sounds like a hybrid between Freebase and Many Eyes. And, like both, it’s free (as in free beer). Free cuts both ways: the Factual site states clearly that “There is currently no way for us to help you monetize these tables.” As with many companies at this stage, the business model is TBD.

I have mixed feelings. I like the increasing interest by startups in structured search. It’s a step in the right direction, since structure is a key enabler for interaction. But we already have one Freebase (and even Google Base), and it’s not clear that we need yet another company to enable crowd-sourced submission of structured data. Perhaps what we need is a way to incent the sort of behavior that has made Wikipedia so successful. As my colleague Rob Gonzalez (who is rumored to have a blog in the works) is always happy to point out, structured data repositories are a public good that no one is ever willing to pay for. The current best hope seems to be the Linked Data initiative, which sounds great in theory–though I think the jury is still out on whether it will succeed in practice.

My ambivalence aside, I am excited that some of the greatest minds in computer science are focused on bringing more structure to the information seeking progress. Even if some of these efforts prove to be false starts, we’re going in the right direction. Structured search is on the table.

Uncategorized

Faceted Search Book: Now At Half Price!

Post author By Daniel Tunkelang
Post date October 10, 2009
5 Comments on Faceted Search Book: Now At Half Price!

Not sure when (or why) this happened, but I just noticed that my Faceted Search book is now almost half off at Amazon, selling for $12.94. Not that it was ever that extravagant a purchase, but at this price you have 48% fewer excuses not to buy your own copy! And, speaking of Amazon, I would appreciate if folks who have read the book could take a moment to post a review there.

General

Google Is Sharpening Its Squares

Post author By Daniel Tunkelang
Post date October 9, 2009
7 Comments on Google Is Sharpening Its Squares

As some of you may remember, I’m excited about Google Squared, a project I see as a great first step toward exploratory search at a web scale. Yes, I know that Duck Duck Go, Kosmix and others are already taking on this challenge, but it makes a difference to see Google throw its weight behind such an ungoogley initiative. Plus Google Squared is ambitious, to say the least–the input is free-form text and the output is highly structured.

Since I’ve beaten up Wolfram Alpha for is overreliance on NLP, I can’t give Google a free pass. It would be nice to be able to give Google Square more structured guidance (yes, I’m still an HCIR fanatic). But Google Squared seems to achieve far more robust query interpretation than Wolfram Alpha’s–perhaps because supporting exploratory search is less brittle than question answering.

The quality of the tables that Google Square produces as results is still spotty, but it is a major improvement from the initial release. To those who wrote off Google Squared in June, I suggest you take a second look.