Categories
Uncategorized

The Craft of Exploratory Search

Posts like this one from Gene Golovchinsky make me feel sad that I didn’t actually attend CHI, though I’m glad I did get to hang out with the exploratory search clique that Wednesday evening. As I’m helping get the HCIR ’09 workshop together, it’s nice to see so much interest in the subject on the HCI side of the house.

Categories
General

CACM Article on DB/IR

In my rush to finish writing my book this month, I haven’t had much time for reading. But I did notice an article in the April ’09 issue of Communications of the ACM that caught my attention: “Database and Information-Retrieval Methods for Knowledge Discovery” (no subscription necessary for online access). I’m not sure what sort of IR system my brain uses, but that title certainly excited a lot of neurons!

It’s a worthwhile read, especially for people unfamiliar with the artificial dichotomy between database and information retrieval research. It’s a bit too academic for my taste–I would have liked to see at least some mention of the commercial efforts to bridge this gap between unstructured and structured information access (hint, hint). And of course there’s too much emphasis on ranking and nary a mention of interactive or exploratory interfaces.

But enough quibbling. Here are a few excerpts to when your appetite:

DB and IR are separate fields in computer science due to historical accident. Both investigate concepts, models, and computational methods for managing large amounts of complex information, though each began almost 40 years ago with very different application areas as motivations and technology drivers; for DB it was accounting systems (such as online reservations and banking), and for IR it was library systems (such as bibliographic catalogs and patent collections). Moreover, these two directions and their related research communities emphasized very different aspects of information management; for DB it was data consistency, precise query processing, and efficiency, and for IR it was text understanding, statistical ranking models, and user satisfaction.

Structured and unstructured search conditions are combined in a single query, and the query results must be ranked. The queries must be evaluated over very large data sets that exhibit high update rates…A programmer can build such an application through two separate platforms—a DB system for the structured data and an IR search engine for the textual and fuzzy-matching issues. But this widely adopted approach is a challenge to application developers, as many tasks are not covered by the underlying platforms and must be addressed in the application code. An integrated DB/IR platform would greatly simplify development of the application and largely reduce the cost of maintaining and adapting it to future needs.

With a knowledge base that sublimates valuable content from the Web, we could address difficult questions beyond the capabilities of today’s keyword-based search engines. For example, a user might ask for a list of drugs that inhibit proteases and obtain a fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Such advanced information requests are posed by knowledge workers, including scientists, students, journalists, historians, and market researchers. Although it is possible to find relevant answers, the process is laborious and time-consuming, as it often requires rephrasing queries and browsing through many potentially promising but ultimately useless result pages.

Enjoy!

Categories
General

DeWitt and Stonebraker vs. MapReduce, Round 2

A few months ago, database titans David J. DeWitt and Michael Stonebraker wrote  a polemic entitled “MapReduce: A major step backwards” that received a lot of attention, including responses like “Relational Database Experts Jump The MapReduce Shark“. Those unfamiliar with MapReduce might want to take a look at the Wikipedia entry.

Well, they’re at it again. As Eric Lai reports in Computerworld, DeWitt and Stonebreaker has written a paper with Daniel J. Abadi, Samuel Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin entitled “A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks“. The authors are a who’s who of database researchers, and the paper will appear in SIGMOD Record.

Zenobia Godschalk of Aster Data, a database that integrates SQL with MapReduce, has already chimed in: “why wouldn’t you use both SQL AND MapReduce?”

But this is no time for post-partisanship. Not when the database guys are clearly looking for a fight. As Lai suggests, this paper may be a response to Google’s announcement last November that it used MapReduce to sort a petabyte terabyte of data in just 68 seconds. Unfortunately, it seems likely that people will eventually reach the obvious conclusion that different approaches are better suited to different tasks. But hopefully we’ll see some nice sparks fly in the mean time.

Categories
Uncategorized

The SEO War On Terror

I try to keep this blog apolitical, but this is just too funny and sureally on-topic to pass up. According to The Register, the British Office of Security and Counter-Terrorism (OSCT) “plans to train government-approved groups in search engine optimisation techniques” in order to flood the internet” with propaganda.

Perhaps this is just a stimulus package for SEO consultancies. Or perhaps I vastly underestimate the power of information warfare. In any case, I was amused, and I thought it was the right thing to share.

Categories
Uncategorized

Enterprise Search eBook

Ron Miller, Contributing Editor at EContent Magazine and Editor at FierceContentManagement, just released a free eBook on enterprise search:

I’m pleased to announce my new eBook: Unlock the Power of Enterprise Search. I created this eBook in conjunction with Michelle Manafy, who is Editorial Director, Enterprise Group, for Information Today, Inc. The content in the eBook comes from articles I’ve written for EContent Magazine and the Enterprise Search Source Book.

I wrote the foreword, and I’m delighted with how the book came out.

Categories
Uncategorized

Which Blowhard Are You?

I’m a few weeks late to find this flow chart in Wired to tell you which blowhard you are (I’d only seen a small subset of it before). Find out if you are Chris Anderson, Dave Winer, Jason Calcanis, Jeff Jarvis, Mark Cuban, Mike Arrington, Nick Carr, or Seth Godin. Enjoy!

Categories
General

Transparency or FAIL

I’ve long been proponent of transparency in search engines and recommendation systems, on the grounds that transparency cultivates trust even in the face of the inevitable fallibility of algorithmic models. Perhaps my stance has an ideological tinge. But, as we’ve seen from recent events, transparency isn’t just an academic concern. I’d like to touch on three sets of recent incidents that highlight the need to take transparency more seriously.

Review Sites

Many of us may have had a good laugh to discover that a Belkin employee was using Mechanical Turk to pay reviewers $0.65 per positive review. In contrast, people were less amused by the allegations that Yelp was blackmailing the merchants reviewed on its site. And some question whether the business model of Get Satisfaction is inherently deceptive.

If there is a moral, it is that user-generated content assumes a social contract that the users and their opinions are sincere. We may all claim to be skeptical and cynical, but the repeated outrage at violations of trust suggests otherwise. Let’s not feign shock to discover that human beings are corruptible. But our systems should be less easily manipulated. A movement against excessive online anonymity would be a good start. There’s a trade-off between privacy and information accountability, and we should expect publishers to err toward the latter.

Social Media

Some might say I’m a bit overzealous on the subject of transparency in social media. But, as it turns out, I’m not the only one. Netezza, a data warehousing appliance company, operates a delightfully funny blog called Data Liberators. I commend them for a very perky social media marketing campaign. Except…they don’t make it especially clear that the site is operated by Netezza, and someone slammed them for it. I don’t think Netezza was trying to be sneaky. But transparency is a burden that falls squarely on content producers, not consumers–especially on marketers who aim to persuade. To his credit, Netezza’s VP of Marketing says that he intends to (but has not yet) put an About page on the blog, clearly indicating who runs it.

As it turns out, even the United States Federal Trade Commission (FTC) is taking an interest in transparency in social media. They’d like paid endorsers to disclose their sponsors, and for both to be accountable for adhering truth in advertising. Legally enforced or not, that strikes me as a good principle for companies that have a long-term interest in their brand equity.

Site Search

I almost feel sorry for Amazon, given the PR fires they’ve had to fight in the past weeks. First, there was the simulated rape game that they briefly carried in their catalog. Then there was the girl-scout-cookie-gate, which still isn’t quite resolved. But this past weekend was a true PR inferno: even now, #AmazonFail remains the top trending topic on Twitter because Amazon apparently started excluding LGBT books from “some searches and best seller lists” on the grounds of their being “adult” materials. (Just learned: Owen Thomas at Gawker claims “well-known hacker has come forward and claimed the whole thing was his prank” and that he exploited the feature that lets Amazon users flag books as “inappropriate.”)

I actually don’t have strong feelings about what Amazon chooses to sell on its site, or how it chooses to present it. If Amazon offends me enough, I’ll shop elsewhere. If they break the law, I trust the authorities to step in.

But surely their marketing department cares about the damage that recent incidents have been inflicting on their brand. Moreover, it seems that they are a victim of their lack of transparency–possibly even to themselves!  If they could clearly and convincingly explain what appears on their site and why, consumers would surely cut them a lot more slack.

In short, be transparent…or FAIL.

Categories
General

Why Publishers Don’t See Google As A Friend

The raging battle between publishers–particularly the newspapers–and Google has been so overplayed lately that I’m tempted to stop blogging about it until something actually happens beyond the war of words. Still, I recently read two paragraphs, in my view, neatly summarize the terms of conflict, and I felt compelled to share them.

The first is from Nick Carr, someone I rarely agree with but who in this case strikes me as spot-on:

What Google doesn’t mention is that the billions of clicks and the millions of ad dollars are so fragmented among so many thousands of sites that no one site earns enough to have a decent online business. Where the real money ends up is at the one point in the system where traffic is concentrated: the Google search engine. Google’s overriding interest is to (a) maximize the amount and velocity of the traffic flowing through the web and (b) ensure that as large a percentage of that traffic as possible goes through its search engine and is exposed to its ads.

The second is from Scott Karp, who actually cited the above paragraph in his own post:

Those who argue that Google is a friend to content owners because it sends them traffic overlook the basic law of supply and demand. The value of “traffic” is entirely relative. The more content there is on the web, the less value that content has — because of the surfeit of ad inventory and abundance of free alternatives to paid content — and thus the less value “traffic” has.

There is no doubt that Google is sending lots of traffic to publishers. The problem is that Google has also helped devalue that content, while at the same time taking a plum spot in the value chain. As I’ve said repeatedly,  publishers are complicit in their own malaise–in particular, they collectively made the choice to give Google so much leverage over them. Now, of course, they’re trying to renegotiate their relationship, the derision of the blogosphere notwithstanding.

But, as Carr points out (I’m agreeing with him again!):

When it comes to Google and other aggregators, newspapers face a sort of prisoners’ dilemma. If one of them escapes, their competitors will pick up the traffic they lose. But if all of them stay, none of them will ever get enough traffic to make sufficient money. So they all stay in the prison, occasionally yelling insults at their jailer through the bars on the door.

Critics like Jeff Jarvis contend that newspapers are trying to restore an economy of scarcity when they should be embracing an economy of abundance. But attention is a scarce resource, and nothing Google does can change that. Rather, Google played has brilliantly into this scarcity economy and seized consumers’ attention from the publishers, while pitting them against one another by facilitating their commoditization. Google may not be evil, but surely Machiavelli would be proud of this strategy.

Categories
General

What Goes With Wine? Facets, Of Course!

Yesterday’s Wall Street Journal features an article about “What’s Wrong With Wine on the Web“. While it mostly complains about problems with online wine shops, it does feature a few who do it right, and I’m proud that two of the four featured sites use faceted search and are powered by Endeca. Hopefully this helps make up for all the times that my colleagues (especially us vintage Endecans) have had to give the “wine demo“.

Here are screenshots of the two featured sites (yes, I’m into Spanish and South American reds):

Wine.com:

Good Values on South American Reds at Wine.com

K&L Wines:

Spanish Reds on a Budget at K&L Wine Merchants

Categories
Uncategorized

Shooting Down Magpies

As regular readers know, I’m ambivalent about advertising in general, but very clear when it comes to shill marketing campaigns. So it’s with pleasure that I see Marshall Kirkpatrick at ReadWriteWeb outing Magpie clients in his post, “How to Sell Your Soul on Twitter and Who’s Buying“–including Apple, Skype, Cisco, StubHub, and Box.net.

But be sure to read through to the comments: some commenters note that it may not be that Magpie’s clients are not the big brands themselves, but rather their affiliates. Still, I see some legitimate guilt by association even if that is the case. After all, a brand should provide and enforce ground rules about how its affiliates behave, especially when their behavior directly affects the brand’s reputation.