Categories
General

The Wild World of SIGMOD

I’m on my way home from SIGMOD 2009, my first experience attending a conference on databases. Actually, it was a my first experience attending two conferences on databases, since SIGMOD was held in Providence concurrently with PODS.

Ed Chi, Jeff Heer, and I were invited to SIGMOD for a session in which we shared our perspectives with the database community on Human-Computer Interaction with Information. Yes, database people care about HCIR too! As the SIGMOD organizers correctly pointed out, people interested in HCI us don’t often show up at database conferences, and I am both grateful and impressed that they took the intiative to remedy that. In a similar spirit, they invited Martin Wattenberg and Fernanda Viégas to deliver a joint keynote about visualization. Even for those of us who were already familiar with their Many Eyes work, it was a delightful presentation.

Of course, it was a great opportunity for me to learn what database people normally worry about. The conference opened with a kaynote by Hasso Plattner, co-founder of software giant SAP. The main take-away of his presentation was that column stores and multi-core computation have improved the efficiency of databases by at least two orders of magnitude, opening a new world of possibilities in information access.

Column stores are pretty hot in this community. I didn’t make it to the research session devoted to them (and which included the paper that received the best-paper award), but I did get to attend the presentation that has attracted the most attention outside SIGMOD, “A Comparison of Approaches to Large-Scale Data Analysis“, a paper by seven authors that compares Hadoop (the open-source implementation of Google’s MapReduce approach), an unspecified commercial row-storage (i.e., conventional) relational database, and the Vertica column-store databases. MIT Professor Sam Madden did the presentation, but the author most indentified with this work is probably Michael Stonebraker. Indeed, Madden had a number of slides where he asked WWMSS? (“What would Mike Stonebreaker say?”), with pithy quotes like “Hadoop is ‘go slow’ for OLAP.” Madden delivered an excellent presentation, but his analysis, which was less than favorable to Hadoop, did rile up some of the audience. Specifically, Berkeley professor Joe Hellerstein suggested that the comparisons were “using the wrong y-axis” by comparing the approaches based on processing time. It would indeed be interesting to compare the development time that was required to use each of the tool the authors compared.

Some other talks I attended and enjoyed:

I also saw two really nice demos:

All in all, I enjoyed three fun and intellectually stimulating days, complete with great food and a harbor cruise in Newport. I’m grateful to the SIGMOD organizers for the invitation to spend a few days in their world, and look forward to integrating what I learned here into my own work.

Categories
Uncategorized

Looking for a IR / Data Mining Job?

No, I’m not recruiting for my team–though I’m always open to research collaborations. But I wanted to call readers’ attention to at least two places that are hiring folks with expertise in information retrieval.

The first is Panjiva, a startup that I’m advising. You can read more about them here. They are looking for a hands-on developer (yes, someone who can code) with background in information retrieval or data mining. The job is in Cambridge, MA, and they want someone local. Check out their jobs page.

The second is Twitter. Yes, you’ve heard of them. What you might not know if that they’re aggressively hiring in their search group. Apparently the company is growing–I’d thought they were at ~30 people, but I just did a reference call for someone and learned that they’ve doubled in the past few months. I have no stake in Twitter except as a user, but I’d love to see them improve their search capabilities. So, if you’re in or near San Francisco and looking for a search job on the bleeding edge, check it out.

Other folks who are trying to hire people with search / information retrieval background: I encourage you to post opportunities in the comments!

Categories
Uncategorized

Off to SIGMOD

I hope you’ve been enjoying my recent posting frenzy, because the noise will be a bit think in the next couple of weeks. I’m about to head to Providence, where I’ll be attending SIGMOD 2009 and presenting an invited talk on “Design for Interaction“. Yes, database people care about HCIR too! I don’t know what sort of spare time or connectivity I’ll have, but I’ll try to sneak in a blog post or two. Then next week I’ll be on vacation!

Categories
Uncategorized

Faceted Search Book Is In Print!

To paraphrase Navin R. Johnson from The Jerk, the new faceted search books are here! I received my author’s copies this weekend, and I’ve heard from a few people that they’ve received theirs. It even features a blurb from Peter Morville that made me feel warm and fuzzy.

So how do you get a copy? The simplest way may be to order directly from the publisher. Evidently no one has received their pre-orders from Amazon and Barnes & Noble yet. 😦 Alternatively, if you’re attending SIGIR, I understand that Morgan & Claypool, one of the SIGIR sponsors, will be bringing a bunch of copies, so you can pick one up there.

Categories
Uncategorized

Reports from HCOMP 2009

Check out Panos’s extensive live blogging from what, as far as I know, is the first Human Computation Workshop (HCOMP  2009). You can also see the associated #hcomp Twitter activity.

Evidently Luis von Ahn used his keynote to unveil MonoLingo, a human-powered system for translation, but only using people that know one language (no idea if he used the old joke).

According to Panos:

Monolingo relies on the fact that machine translation is pretty good at this point, but not perfect. So MonoLingo starts by by translating each word using a dictionary, giving multiple interpretations for each word. The human then (who is a native speaker of the target language) selects the translation for each word and forms the sentence that makes most sense.

I’m curious to hear more: as of this writing, the site is password-protected with no further information.

Categories
Uncategorized

Malcolm Gladwell to Chris Anderson: No “Free” Lunch

Malcolm Gladwell, a staff writer for The New Yorker and author of The Tipping Point, Blink, and Outliers, offers a scathing review of Free: The Future of a Radical Price, the recently written book by Chris Anderson. He doesn’t even mention the plagiarism scandal. Instead, he attacks the book’s thesis, which he characterizes as “an extended elaboration of Stewart Brand’s famous declaration that ‘information wants to be free.'”

Some choice excerpts concern YouTube as a case study:

Anderson is forced to admit that one of his main case studies, YouTube, “has so far failed to make any money for Google.”

“close enough to free” multiplied by seventy-five billion is still a very large number.

If [YouTube] were a bank, it would be eligible for TARP funds.

Ultimately, Gladwell dismisses Anderson as a “technological utopian”. That’s harsh, but I think it’s on target. There’s nothing new in proclaiming that we all wish everything were free. But there’s a lot of hand-waving in Anderson’s argument that cheap is “close enough to free to round down”. I encourage you to read Gladwell’s eloquent and entertaining review here.

Categories
General

Even Google Should Beware Of Hubris

One of the best words we’ve inherited from the ancient Greeks is hubris (ὕβρις), defined on Wikipedia as “overweening pride, superciliousness, or arrogance, often resulting in fatal retribution or nemesis”. Homer used hubris to drive the plots (and moral lessons) of  both of his famous epics, the Iliad and the Odyssey.

Hubris is, of course, a disease that afflicts winners, and it’s hard to pick a stronger winner in today’s online world than Google. Surely Google is the closest thing the web has to an Achilles or Ulysses. But hopefully Google’s legions of computer science PhDs remember a little bit of the Homer they were hopefully subjected to in high school or college.

Since that was probably a while ago, even for Google’s youthful employees (LinkedIn reports a median age of 29), here are two modern-day examples of hubris.

At the Enterprise Search Summit last month, Google’s lead product manager for enterprise search had this to say about Microsoft:

“One way of doing enterprise search would be to start something in 2001 that didn’t work. You could then do a complete overhaul in 2003, which also didn’t work. In 2007, you could launch a rip-and-replace system and then … you could acquire a large, random, non-integrated system.”

“I’m not going to name any specific company,” he quipped.

And, just a few days ago, Google’s senior manager of engineering and architecture punctuated a panel discussion at the Structure 09 conference–where he was sharing a stage with a counterpart from Microsoft–with the punchline “If you Bing for it, you can find it.” [CORRECTION: PLEASE READ “An Apology to Vijay Gill“]

There’s no question that Google is trouncing Microsoft in the online world. But that’s no reason to be catty. Indeed, Microsoft has paid dearly for its past hubris, so it’s not like Google needs to look back to Homer for history lessons. As Santayana warned, “Those who cannot remember the past are condemned to repeat it.” Perhaps, instead of worrying so much about how to keep up with the Twjoneses on real-time search, Googlers ought to take a moment to reflect on the information they’ve already indexed.

Categories
General

Are Spammers Taking Over Twitter?

Until recently, I’ve noticed the occasional incdent where a Twitter “trending topic” was socially engineered by a spammer, usually by an application which auto-tweets on sign-up. But the problem seems to be getting noticeably worse.

Just a few days ago Habitat, a furniture store, used the trending topics as hashtags–including one associated with the disputed Iran election–to pimp their “totally desirable Spring collection”. It made for a great case study in how not to use Twitter.

And today I see that the top two trending topics are What McFLY Song Are and TweetBoard Alpha, both edging out #iranelection. The first spams through a quiz; the second through a request for invitations. It’s enough to make you want to scream, Stop Twitter Spam!

Of course, the solution may be to ignore the trending topics, which we can now see are easily gamed. Even when they’re legitimate, the topics aren’t necessarily all that useful. In the Twitterquake of Michael Jackson’s death, nine of the top ten trending topics related to the Gloved One–one of them even misspelled as Micheal. the tenth related to a hoax that Jeff Goldblum had died.

As I’ve said before, I actually look forward to a spamageddon that forces us to confront the attention scarcity problem head-on. At this rate, perhaps I won’t have to wait much longer.

Categories
General

Aardvark Burrows Out Of Beta

I just received an email from Max Ventilla, CEO of social search startup Aardvark, to let me know that Aardvark is now open to anyone who wants to sign up. Well, anyone with a Facebook account–but I can’t imagine that there are many people who are curious to sign up for a service like Aardvark but don’t already have Facebook accounts!

Apparently Michael Arrington got the email too. But Aardvark’s larger PR coup is a feature in the Business section of the Sunday New York Times, entitled “Now All Your Friends Are in the Answer Business“.

I’ve blogged about Aardvark a bit–see my previous pair of posts about the blog + Twitter vs. Aardvark challenge. I like the idea of expert-mediated information seeking, though I have at least two concerns with Aardvark.

The first is in how Aardvark routes questions to experts–I’ve had mixed results that I attribute to the inherent challenge of inferring a topic through natural language processing. I think Aardvark would to well to offer guidance to users both in volunteering their own areas of expertise and in specifying their query topics.

The second is that questions and answers are private. I’m a big fan of “when in doubt, make it public“–and this is a clear-cut case where public is at least the right default. I’m curious how often people ask questions that someone else has already answered. Yes, there’s something to be said about getting an answer from someone in your own social network. But I don’t see any reason that the correspondence has to be private–especially for a question that you’re willing to have routed to a total stranger.

I hope Aardvark addresses both of these concerns, improving its routing and publishing question-answer pairs. As I’ve mentioned in recent posts, I think social search deserves a lot more attention than real-time search, and it’s great to see startups like Aardvark and Hunch working on it.

Categories
General

Search Innovation: Why Can’t We All Just Get Along?

It’s unusual for HCIR to make it into the mainstream business press, so I was delighted when Pete Barlas reached out to me in connection with an article he published Wednesday in Investor’s Business Daily, entitled “Bing Feature Has Many Fathers; Rivals Lining Up To Take Credit“.

The genesis for the article was a dispute between Microsoft and Hakia. Hakia’s chief operating officer, Melek Pulatkonak, claims that Bing copied Hakia’s “galleries” features:

“We were approached by Microsoft to show them how the Hakia galleries worked, and we did, and now they have a similar feature — we showed them how to do it,” she said. “We were surprised that it is a featured part of and the most differentiated part of Bing.”

I like the folks at Hakia (I blogged about them a while ago), but here I think they’re over-reacting, at least. The idea of using query refinement to help users focus queries certainly predates both companies, and Hakia, by its own admission, is a relative newcomer to the scene, having launched in 2006.

But the story doesn’t end there. Barlas received a statement from Microsoft claiming that Bing implements faceted search. That’s true for some parts of the site, but it’s feels like a half-truth. Bing’s general web search offers search suggestions, but does not implement faceted search.

The plot thickens. Vivisimo‘s chief scientist, Jerome Presenti, claims that his company was “really the first one to provide a broad categorized search”. But, as Barlas points out, what Vivisimo offers is clustering, which is neither categorization (at least some of us make a sharp distinction between supervised categorization into predetermined categories and unsupervised clustering) nor faceted search. Marti Hearst offers a good analysis (including a critique of Vivisimo’s Clusty.com) in “Clustering versus faceted categories for information exploration“.

I take some of the credit for explaining these distinctions to Barlas, and he got it–though I’m sure some of the credit is due to others he talked with, including IDC analyst Sue Feldman and Danny Sullivan, editor-in-chief of Search Engine Land.

Squabbling among vendors makes for good press, and there’s a legitimate business interest when companies start threatening each others with lawsuits, as Hakia has said it’s considering. And there’s certainly room for arguments over who has a better approach or implementation.

But let’s–and here I speak as someone who often represents Endeca in these discussions–at least agree to standardize on basic terms that have now been around for a while, like categorization, clustering, and faceted search. There’s enough of a vocabulary problem for our users; let’s not cultivate one in our press relations and legal posturing.