Already missing your Noisy Channel fix? Why don’t you check out some of the blogs I read:
Author: Daniel Tunkelang
High-Class Consultant.
Going on Auto-Pilot
I’m spending a week in Akumal without network connectivity. Yes, a real family vacation. No working, no blogging, no reading Techmeme.
But have no fear. I’ve scheduled daily posts in my absence. The Noisy Channel will not go silent! Obviously I won’t be able to participate in the comment threads, and I can only hope that the evil comment spammers won’t use this opportune moment to attack. Meanwhile, I urge you all to take this opportunity to have the last word–at least until I get back!
If you do need to contact the authorities in my absence, I suggest you send a message to Claude Shannon.
Harvesting Knowledge for Wikipedia
In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.
And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?
Of course, such a proposal immediately raises privacy concerns. It’s clear that any search logging mechanism has to be opt-in and in accordance with the Wikimedia Foundation’s privacy policy that “access to, and retention of, personally identifiable data in all projects should be minimal and should be used only internally to serve the well-being of the projects”. But there are at least two possibilities that I believe would be acceptable to privacy advocates:
- Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
- Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.
I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.
Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.
I propose the following alternative:
- Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
- Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.
My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).
I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.
Mechanical Turkey
Omar Alonso recently pointed me to work he and his colleagues at A9 did on relevance evaluation using Mechanical Turk. Perhaps anticipating my predilection for wordplay, the authors showed off some of their own:
Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task.
Yes, TERC for TREC. In any case, their results show lots to be thankful for:
- Fast Turnaround. We have uploaded an experiment requiring thousands of judgments and found all the HITs completed in a couple of days. This is generally much faster than an experiment requiring student assessors; even creating and running an online survey can take longer.
- Low Cost. Many typical tasks, such as judging the relevance of a single query-result pair based on a short summary, are completed for payment of one cent. (Obviously, tasks that require more detailed work require higher payment.) In our example, we could have all our 2500 judgments completed by 5 separate workers for a total cost of $125.
- High Quality. Although individual performance of workers varies, low cost makes it possible to get several opinions and eliminate the noise. As described in Section 5, there are many ways to improve the quality of the work.
- Flexibility. The low cost makes it possible to obtain many judgments, and this in turn makes it possible to try many different methods for combining their assessments. (In addition, the general crowdsourcing framework can be used for a variety of other kinds of experiments — surveys, etc.)
Other folks, particularly Panos Ipeirotis, have worked extensively with Mechanical Turk in their research. At the risk of political incorrectness, today, I’d like thank these folks for the successful exploitation of digital natives to explore new worlds of research.
When In Doubt, Make It Public
One of my recurring themes has been that we need to get over our loss of privacy. But today, as I was reading Jeff Atwood “Is Email = Efail?” post about the inevitability of email bankruptcy, I clicked through to a post of his from April 2007 entitled “When In Doubt, Make It Public” and in turn to a post by Jason Kottke entitled “Public and permanent“.
There I struck gold. Kottke suggests that a way to come up with a new buisiness model is “to choose web projects is to take something that everyone does with their friends and make it public and permanent.” Here are his examples:
- Blogger, 1999. Blog posts = public email messages. Instead of “Dear Bob, Check out this movie.” it’s “Dear People I May or May Not Know Who Are Interested in Film Noir, Check out this movie and if you like it, maybe we can be friends.”
- Twitter, 2006. Twitter = public IM. I don’t think it’s any coincidence that one of the people responsible for Blogger is also responsible for Twitter.
- Flickr, 2004. Flickr = public photo sharing. Flickr co-founder Caterina Fake said in a recent interview: “When we started the company, there were dozens of other photosharing companies such as Shutterfly, but on those sites there was no such thing as a public photograph — it didn’t even exist as a concept — so the idea of something ‘public’ changed the whole idea of Flickr.”
- YouTube, 2005. YouTube = public home videos. Bob Saget was onto something.
It’s a pretty compelling argument. Rather than wasting effort in a losing battle to protect the remants of our privacy, let’s embrace the efficiency of public conversation.
Lauren Weinstein wrote an interesting post today, suggestng that Google’s new SearchWiki feature “provides an interesting platform for the global distribution of secret messages“.
This practice, known as steganography, has been a concern for centuries, but most recently has come up in the context of alleged use by terrorists.
No, I don’t think Google is trying to be evil. Moreover, there are lots of other ways to broadcast steganographically encrypted messages on the web, such as posting comments on unmoderated blogs. But it’s interesting that this is the first “useful” application I’ve seen proposed for SearchWiki.
I haven’t written a community post in a while, but I thought that, with everyone getting into the Thanksgiving spirit, perhaps someone might be inspired to give to a Wikipedia entry in need. I’m talking about the semantic search entry, which–as the talk page notes–needs work.
As I told Ron Miller in my recent one-on-one with him:
Semantic search means different things to different people, but broadly falls into two categories: Using linguistic and statistical approaches to derive meaning from unstructured text, using semantic web approaches to represent meaning in content and query structure.
Perhaps someone here could reorganize the Wikipedia entry along these lines?
Or, if you don’t feel sufficiently expert on semantic search to rework the content, perhaps you could help despam the entry, following the example of what I did for the enterprise search entry. I moved the vendors to a separate entry and culled vendors that didn’t have their own Wikipedia entries (which is the accepted “notability” standard).
I know that editing Wikipedia entries is a thankless job. But someone has to do it. And if folks like us don’t then these pages often are overrun by spammers. Think of this as a small contribution to global knowledge management. At the very least, you’ll have my thanks.
Endeca vs. Google, Round 2
OK, it’s not quite Muhammad Ali vs. Joe Frazier or even David vs. Goliath. But, hey, it’s personal, and this is my blog!
A few months ago, I was quoted in a Forbes JargonSpy column, helping to explain why Google isn’t enough for the enterprise. Apparently that hit a nerve, since, shortly afterward, Google Enterprise Search product manager Nitin Mangtani published a sponsored “commentary” on Forbes that some viewed as an advertorial (though Google objected to that characterization).
But the story doesn’t end there. Ron Miller of FierceContentManagement wrote to Mangtani to follow up, and published a Q&A about his reason for publishing the Forbes piece, as well as his rebuttal to Google Search Appliance critics.
While Ron was preparing his questions, he reached out on Twitter to solicit input. I responded, and Ron graciously offered me the same one-on-one treatment, which was published today.
Melodrama aside, I feel that these discussions are useful. Enterprise search has been misunderstood for a long time, and conversations like these at least advance understanding. And hopefully they make for fun reading.
User-Generated Public Relations
While outsourcing the halting problem might not get you very far, outsourcing PR may be the next new thing for retailers. Saul Hansell at the New York Times reports that Amazon is now promoting “user-generated public relations“:
The company has announced what it calls its “Holiday Customer Review Team.” These are six Amazon customers who are particularly active in writing product reviews that it has offered to reporters to discuss gift picks. (They also contribute their recommendations on a page on Amazon’s site.)
Amazon says that members of the team are “real people giving unbiased advice to fellow consumers. They are not employed by Amazon.com, Inc. or its affiliates.”
As Hansell continues, though:
That’s not quite the whole story.
Some team members have been flown to Seattle to conduct broadcast interviews on behalf of the company. Moreover, they have been given free products to review and keep.
Cynical as I am about advertising, I’m actually in favor of people learning about products through sincere reviews, and I can even see a way that companies might favor their biggest fans, a la Steve Jobs. But, to coin a phrase, information wants to be transparent. I expect Amazon to see a backlash if these “unbiased” reviewers turn out to be shills.
Browser Wars: The 2008 Edition
Fresh from reporting why he switched from Firefox to Chrome, CNET’s Stephen Shankland reports that Chrome has a larger market share than he expected:
For comparison, here are the stats for The Noisy Channel, based on the last 30 days (note that stats don’t reflects users reading the blog through RSS readers):
- Firefox: 58.4%
- Internet Explorer: 19.2%
- Safari: 9.9%
- Chrome: 6.8%
- Mobile (assorted): 3.0%
- Opera: 1.6%
- Mozilla 1.x: 1.1%
- Konqueror: 0.1%
Not quite the same mix as Shankland is seeing at CNET, but Chrome’s share is respectable.
Note: the Chrome market share may be slightly skewed by my using Chrome to post, since I’ve found it handles my WordPress web client better than Firefox. I still am faithful to Firefox for everything else. As someone posted recently, no Adblock = no Chrome.
Even so, I’m sure that doesn’t account for more than 1% of traffic. A noticable minority of Noisy Channel readers are giving Chrome a chance.

