The Noisy Channel

 

Harvesting Knowledge for Wikipedia

November 27th, 2008 · 2 Comments · General

In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.

And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?

Of course, such a proposal immediately raises privacy concerns. It’s clear that any search logging mechanism has to be opt-in and in accordance with the Wikimedia Foundation’s privacy policy that “access to, and retention of, personally identifiable data in all projects should be minimal and should be used only internally to serve the well-being of the projects”. But there are at least two possibilities that I believe would be acceptable to privacy advocates:

  • Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
     
  • Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.

I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.

Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.

I propose the following alternative:

  • Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
  • Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.

My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).

I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.

2 responses so far ↓

  • 1 Brian Davson // Nov 29, 2008 at 9:08 pm

    I’ve had similar discussions with colleagues. But we are interested in all queries, and similarly I’d love to see the web server logs for wikipedia (which I don’t seem to find anywhere). Both provide some broad measures of the interests of (some fraction of) the websurfing public, which can be enlightening and empowering. Yes, I understand the potential for abuse (both for privacy and for spam), but I’d rather see them released in some form.

  • 2 Daniel Tunkelang // Dec 6, 2008 at 6:18 pm

    Well, the spamming value is predicated on how the logs are used. It’s the privacy concerns that surely hold back the Wikimedia Foundation from sharing this data. I am certain that the logs could be scrubbed in such a way that they’d provide useful information without disclosing any personally identifiable data. But I can see how the Wikimedia Foundation would be concerned about starting down a slippery slope.

Clicky Web Analytics