Category: General

General posts, typically analyzing HCIR issues.

Design For Interaction: My SIGMOD Slides

Post author By Daniel Tunkelang
Post date July 12, 2009
1 Comment on Design For Interaction: My SIGMOD Slides

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=designforinteraction-090712215609-phpapp02&stripped_title=design-for-interaction

These are the slides I presented at SIGMOD a couple of weeks ago. The animation on a few of the slides doesn’t come through on SlideShare, but you can always download the PowerPoint if you are so inclined. The other talks in the invited session on Human-Computer Interaction with Information, by Ed Chi and Jeff Heer, were fantastic, as was the joint keynote on visualization by Fernanda Viégas and Martin Wattenberg.

General

Catching Up On Last Week’s News

Post author By Daniel Tunkelang
Post date July 10, 2009

I hope everyone had a great week! It looks like I missed some interesting / controversial stories in the tech news / blogosphere, the most notable being:

Quick reactions:

Regarding the anti-SQL movement, I would have thought the main complaint would be that SQL is too arcane a language for ordinary users to ever use it directly. Instead, the article discusses developers’ complaints about databases, and these are mostly about price, speed, and scale. Evidently even free, open-source databases like MySQL are losing favor relative to tools like Hadoop and Hypertable that don’t offer support for SQL. Of course, this picture comes from a meetup of 150 people that might not be entirely representative of information technology workers.

I know first-hand from my experience at Endeca that, to quote Michael Stonebraker, the “one size fits all” approach to databases is an idea whose time has come and gone. At Endeca, we have built our own special-purpose database to address information needs ill-served by the available OLTP and OLAP technologies. Still, I think it’s premature to declare the death of SQL or of relational databases. But why let that stand in the way of a good story?

On to the open-source search engine comparison. I won’t rehash the critique of the study, which you can find in the 80+ comments from folks like Jeff Dalton, Bob Carpenter, and Otis Gospodnetic. Perhaps the most salient point is that it’s not clear how much sense it makes to perform “out of the box” evaluations. In any case, my impression is that Lucene is by far the dominant player in the open-source search space; the study, if it has any effect, will only be to reinforce that dominance.

And finally, the big news from the big G: a Google Operating System. Even my mom (who couldn’t name an existing operating system) was asking me about it, so clearly this one has made it into the mainstream media. And yet I don’t see why this is such a big deal. We have netbooks, and we even have Linux-based netbooks. As far as I’ve heard, the latter are popular with geeks and cheapskates, but that’s about it–most people are willing to pony up the few extra dollars for Windows XP. Will Google launching a netbook-oriented OS significantly affect this market? I suspect the only route to success is if they meet non-technical users’ needs (browsing, email, media, light document editing) while minimizing their overhead (maintenance, security, compatibility). Will they be a better Ubuntu? Perhaps, much in the way that Chrome is trying to be a better Firefox. Why Google choose to build its own free, open-source products rather than contribute to mature open-source projects is a mystery to me, but it’s their money and time to spend.

I think that cover’s the week’s big stories–or at least those that matter most to Noisy Channel readers. Somehow I didn’t manage to come up with an IR / HCIR angle on the Michael Jackson story, or perhaps it’s just that Danny Sullivan beat me to it.

Anyway, I’m back in the saddle, and should soon be back to my normal posting volume. Thank you all for being patient.

General

The Wild World of SIGMOD

I’m on my way home from SIGMOD 2009, my first experience attending a conference on databases. Actually, it was a my first experience attending two conferences on databases, since SIGMOD was held in Providence concurrently with PODS.

Ed Chi, Jeff Heer, and I were invited to SIGMOD for a session in which we shared our perspectives with the database community on Human-Computer Interaction with Information. Yes, database people care about HCIR too! As the SIGMOD organizers correctly pointed out, people interested in HCI us don’t often show up at database conferences, and I am both grateful and impressed that they took the intiative to remedy that. In a similar spirit, they invited Martin Wattenberg and Fernanda Viégas to deliver a joint keynote about visualization. Even for those of us who were already familiar with their Many Eyes work, it was a delightful presentation.

Of course, it was a great opportunity for me to learn what database people normally worry about. The conference opened with a kaynote by Hasso Plattner, co-founder of software giant SAP. The main take-away of his presentation was that column stores and multi-core computation have improved the efficiency of databases by at least two orders of magnitude, opening a new world of possibilities in information access.

Column stores are pretty hot in this community. I didn’t make it to the research session devoted to them (and which included the paper that received the best-paper award), but I did get to attend the presentation that has attracted the most attention outside SIGMOD, “A Comparison of Approaches to Large-Scale Data Analysis“, a paper by seven authors that compares Hadoop (the open-source implementation of Google’s MapReduce approach), an unspecified commercial row-storage (i.e., conventional) relational database, and the Vertica column-store databases. MIT Professor Sam Madden did the presentation, but the author most indentified with this work is probably Michael Stonebraker. Indeed, Madden had a number of slides where he asked WWMSS? (“What would Mike Stonebreaker say?”), with pithy quotes like “Hadoop is ‘go slow’ for OLAP.” Madden delivered an excellent presentation, but his analysis, which was less than favorable to Hadoop, did rile up some of the audience. Specifically, Berkeley professor Joe Hellerstein suggested that the comparisons were “using the wrong y-axis” by comparing the approaches based on processing time. It would indeed be interesting to compare the development time that was required to use each of the tool the authors compared.

Some other talks I attended and enjoyed:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
Similarity Caching
Indexing Uncertain Data
Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers
Incremental Maintenance of Length Normalized Indexes for Approximate String Matching
Why Not? (work on helping a user understand why an expected record does *not* appear in query results)
Query by Output (a database approach reminiscent of query-by-example in information retrieval)

I also saw two really nice demos:

All in all, I enjoyed three fun and intellectually stimulating days, complete with great food and a harbor cruise in Newport. I’m grateful to the SIGMOD organizers for the invitation to spend a few days in their world, and look forward to integrating what I learned here into my own work.

General

Even Google Should Beware Of Hubris

Post author By Daniel Tunkelang
Post date June 28, 2009
14 Comments on Even Google Should Beware Of Hubris

One of the best words we’ve inherited from the ancient Greeks is hubris (ὕβρις), defined on Wikipedia as “overweening pride, superciliousness, or arrogance, often resulting in fatal retribution or nemesis”. Homer used hubris to drive the plots (and moral lessons) of both of his famous epics, the Iliad and the Odyssey.

Hubris is, of course, a disease that afflicts winners, and it’s hard to pick a stronger winner in today’s online world than Google. Surely Google is the closest thing the web has to an Achilles or Ulysses. But hopefully Google’s legions of computer science PhDs remember a little bit of the Homer they were hopefully subjected to in high school or college.

Since that was probably a while ago, even for Google’s youthful employees (LinkedIn reports a median age of 29), here are two modern-day examples of hubris.

At the Enterprise Search Summit last month, Google’s lead product manager for enterprise search had this to say about Microsoft:

“One way of doing enterprise search would be to start something in 2001 that didn’t work. You could then do a complete overhaul in 2003, which also didn’t work. In 2007, you could launch a rip-and-replace system and then … you could acquire a large, random, non-integrated system.”

“I’m not going to name any specific company,” he quipped.

And, just a few days ago, Google’s senior manager of engineering and architecture punctuated a panel discussion at the Structure 09 conference–where he was sharing a stage with a counterpart from Microsoft–with the punchline “If you Bing for it, you can find it.” [CORRECTION: PLEASE READ “An Apology to Vijay Gill“]

There’s no question that Google is trouncing Microsoft in the online world. But that’s no reason to be catty. Indeed, Microsoft has paid dearly for its past hubris, so it’s not like Google needs to look back to Homer for history lessons. As Santayana warned, “Those who cannot remember the past are condemned to repeat it.” Perhaps, instead of worrying so much about how to keep up with the Twjoneses on real-time search, Googlers ought to take a moment to reflect on the information they’ve already indexed.

General

Are Spammers Taking Over Twitter?

Post author By Daniel Tunkelang
Post date June 27, 2009
4 Comments on Are Spammers Taking Over Twitter?

Until recently, I’ve noticed the occasional incdent where a Twitter “trending topic” was socially engineered by a spammer, usually by an application which auto-tweets on sign-up. But the problem seems to be getting noticeably worse.

Just a few days ago Habitat, a furniture store, used the trending topics as hashtags–including one associated with the disputed Iran election–to pimp their “totally desirable Spring collection”. It made for a great case study in how not to use Twitter.

And today I see that the top two trending topics are What McFLY Song Are and TweetBoard Alpha, both edging out #iranelection. The first spams through a quiz; the second through a request for invitations. It’s enough to make you want to scream, Stop Twitter Spam!

Of course, the solution may be to ignore the trending topics, which we can now see are easily gamed. Even when they’re legitimate, the topics aren’t necessarily all that useful. In the Twitterquake of Michael Jackson’s death, nine of the top ten trending topics related to the Gloved One–one of them even misspelled as Micheal. the tenth related to a hoax that Jeff Goldblum had died.

As I’ve said before, I actually look forward to a spamageddon that forces us to confront the attention scarcity problem head-on. At this rate, perhaps I won’t have to wait much longer.

General

Aardvark Burrows Out Of Beta

Post author By Daniel Tunkelang
Post date June 27, 2009
5 Comments on Aardvark Burrows Out Of Beta

I just received an email from Max Ventilla, CEO of social search startup Aardvark, to let me know that Aardvark is now open to anyone who wants to sign up. Well, anyone with a Facebook account–but I can’t imagine that there are many people who are curious to sign up for a service like Aardvark but don’t already have Facebook accounts!

Apparently Michael Arrington got the email too. But Aardvark’s larger PR coup is a feature in the Business section of the Sunday New York Times, entitled “Now All Your Friends Are in the Answer Business“.

I’ve blogged about Aardvark a bit–see my previous pair of posts about the blog + Twitter vs. Aardvark challenge. I like the idea of expert-mediated information seeking, though I have at least two concerns with Aardvark.

The first is in how Aardvark routes questions to experts–I’ve had mixed results that I attribute to the inherent challenge of inferring a topic through natural language processing. I think Aardvark would to well to offer guidance to users both in volunteering their own areas of expertise and in specifying their query topics.

The second is that questions and answers are private. I’m a big fan of “when in doubt, make it public“–and this is a clear-cut case where public is at least the right default. I’m curious how often people ask questions that someone else has already answered. Yes, there’s something to be said about getting an answer from someone in your own social network. But I don’t see any reason that the correspondence has to be private–especially for a question that you’re willing to have routed to a total stranger.

I hope Aardvark addresses both of these concerns, improving its routing and publishing question-answer pairs. As I’ve mentioned in recent posts, I think social search deserves a lot more attention than real-time search, and it’s great to see startups like Aardvark and Hunch working on it.

General

Search Innovation: Why Can’t We All Just Get Along?

Post author By Daniel Tunkelang
Post date June 26, 2009
5 Comments on Search Innovation: Why Can’t We All Just Get Along?

It’s unusual for HCIR to make it into the mainstream business press, so I was delighted when Pete Barlas reached out to me in connection with an article he published Wednesday in Investor’s Business Daily, entitled “Bing Feature Has Many Fathers; Rivals Lining Up To Take Credit“.

The genesis for the article was a dispute between Microsoft and Hakia. Hakia’s chief operating officer, Melek Pulatkonak, claims that Bing copied Hakia’s “galleries” features:

“We were approached by Microsoft to show them how the Hakia galleries worked, and we did, and now they have a similar feature — we showed them how to do it,” she said. “We were surprised that it is a featured part of and the most differentiated part of Bing.”

I like the folks at Hakia (I blogged about them a while ago), but here I think they’re over-reacting, at least. The idea of using query refinement to help users focus queries certainly predates both companies, and Hakia, by its own admission, is a relative newcomer to the scene, having launched in 2006.

But the story doesn’t end there. Barlas received a statement from Microsoft claiming that Bing implements faceted search. That’s true for some parts of the site, but it’s feels like a half-truth. Bing’s general web search offers search suggestions, but does not implement faceted search.

The plot thickens. Vivisimo‘s chief scientist, Jerome Presenti, claims that his company was “really the first one to provide a broad categorized search”. But, as Barlas points out, what Vivisimo offers is clustering, which is neither categorization (at least some of us make a sharp distinction between supervised categorization into predetermined categories and unsupervised clustering) nor faceted search. Marti Hearst offers a good analysis (including a critique of Vivisimo’s Clusty.com) in “Clustering versus faceted categories for information exploration“.

I take some of the credit for explaining these distinctions to Barlas, and he got it–though I’m sure some of the credit is due to others he talked with, including IDC analyst Sue Feldman and Danny Sullivan, editor-in-chief of Search Engine Land.

Squabbling among vendors makes for good press, and there’s a legitimate business interest when companies start threatening each others with lawsuits, as Hakia has said it’s considering. And there’s certainly room for arguments over who has a better approach or implementation.

But let’s–and here I speak as someone who often represents Endeca in these discussions–at least agree to standardize on basic terms that have now been around for a while, like categorization, clustering, and faceted search. There’s enough of a vocabulary problem for our users; let’s not cultivate one in our press relations and legal posturing.

General

And Bing’s Strongest Vertical Is…Kayak?

Post author By Daniel Tunkelang
Post date June 25, 2009
7 Comments on And Bing’s Strongest Vertical Is…Kayak?

Many people (myself included) have said that Bing’s strongest vertical is travel. And a number have noted the striking similarity between Bing’s travel search and Kayak.

David Radin:

This feels so much like Kayak that without asking, I assumed Microsoft licensed the technology from Kayak. Can you say “eerily similar”?

David Weinberger:

Bing’s ripping off of Kayak.com has me pretty cheesed.

Charlene Li:

Bing’s flight fare search reminded me very much of Kayak, my favorite travel search engine. In fact, it feels like an exact copy except for one major improvement — the integration of Farecast

That was a few weeks ago. Today, it looks like Kayak’s lawyers decided to do more than notice. As reported in Wired:

“We have contacted them through official channels about concerns about the similarities between Bing and Kayak,” Kayak’s chief marketing officer Robert Birge told Wired.com “From the look and feel of their travel product, they seem to agree with our approach to the market.”

Indeed. I am not a lawyer, and I have no idea whether Kayak has a legal case. Nonetheless, I can certainly empathize with Kayak’s designers, who must be less than amused to see their distinctive look and feel copied wholesale.

But perhaps the more damning point this makes is that Bing, for all of its claims to be different or innovative, is simply copying the leaders. They certainly have good taste to use Kayak as a model for a travel “decision engine”. But, legal or not, imitation isn’t innovation.

General

Can Real-Time Search Help Hedge Funds?

Post author By Daniel Tunkelang
Post date June 25, 2009
4 Comments on Can Real-Time Search Help Hedge Funds?

I haven’t exactly been generous in my opinons about the widespread obsession with “real-time” search. But in today’s Telegraph there’s at least a story that makes sense in theory: “Hedge fund managers betting Twitter will give them an edge in rapid trading“.

In practice, I’m pretty skeptical, as is Gwen Robinson at the Financial Times Alphaville blog. She writes:

That’s very interesting, because several hedge fund managers we spoke to dismissed the idea variously as “all twatter” and “rubbish” – not least because Twitter has carved a reputation more for unfounded speculation and even sensational disinformation than for ground-breaking, market-moving alerts for alpha-hungry fund managers.

I’ll concede that time really is money for for hedge funds and other traders who need to make decisions before the rest of the market catches up. But I’m dubious that Twitter–let alone an automated processing of tweets–will enable traders to make better decisions. Moreover, any success would immediately be gamed, along the lines of pump and dump scams. I suppose that hasn’t put a damper on the popularity of StockTwits, but popularity does not necessarily translate to profitability for the traders. I hear that Swoopo (a great example of exploiting behavioral economics) is popular too.

If real-time search is to be useful–and I think it really should be called alerting–then the information it provides has to have some sort of quality assurance, and not just freshness. There’s almost certainly a trade-off, since it usually takes time to vet information for quality, even if the vetting is through crowdsourcing. But that reality doesn’t seem to have sunk in yet for the real-time advocates. I say, give it time.

General

Free as in Copied from Wikipedia

Post author By Daniel Tunkelang
Post date June 23, 2009
3 Comments on Free as in Copied from Wikipedia

You have to love the irony: Waldo Jaquith of the Virgina Quarterly Review discovered that Free: The Future of a Radical Price, the latest book by Wired Editor-in-chief Chris Anderson, contains “almost a dozen passages that are reproduced nearly verbatim from uncredited sources”–and that was without access to an electronic copy of the book, so he suspects there may be more. Most (though not all) of the plagiarism is from Wikipedia.

Having recently written a book, I can attest to the temptation to copy and paste from Wikipedia, much as I was tempted to copy from the encyclopedia for essays in grade school. In fact, Wikipedia explicitly permits reuse of its content–but only with proper attribution and in conformance with Wikipedia’s Creative Commons Attribution/Share-Alike License or the GNU Free Documentation License. Evidently the book will be available for free online, so perhaps Anderson is in time to clean up the text through appropriate citation. Still, as commenter MrInBetween pointed out on Gawker, “Can’t decide which is more embarrassing — failing to cite wikipedia as a source or using wikipedia as a source.” Perhaps I’m old-school, but I feel that books should cite original sources, not encyclopedias.

In any case, the deeper irony is that the “radical” model Anderson advocates is at least partly responsible for encouraging an economy where it’s easier to profit from other people’s content than from your own. Splogs scrape legitimate blogs, copying their content in order to attract search traffic and generate ad revenue. Sites like the Huffington Post have pushed (or simply shredded) the envelope of fair use by excerpting others’ stories and employing SEO in order to leech their traffic. Radical or not, free can get pretty ugly.

It may not have been his intention, but Anderson has helped uncover a subtext of his advocacy: in a world where the only acceptable price for content is free, there’s a risk that respect for the value of content will correlate to its price.