In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.
And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?
- Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
- Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.
I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.
Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.
I propose the following alternative:
- Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
- Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.
My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).
I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.