Categories
General

Software Agents and Rationality

Back when I was an undergraduate (yes, a long long time ago), there was a lot of excitement about software agents, also called intelligent agents. The general idea was that a software agent would be able to pursue goal-directed behavior on a person’s behalf. Of course, what that meant ran the gamut from the mundane (e.g., autodialers) to science fiction (e.g., Braniac in the Superman comics).

With the increasing role that the web plays in our interactions, I wonder about the role of software agents on the web. We already see comment spammers and prankster instant messaging bots, as well as more benign shopbots.

But a question that plagues me is how to reconcile the inherent rationality of software agents with the systematic irrationality of the human beings they represent. Herb Simon argued that humans exercise bounded rationality, but the research from prospect theory suggests that the situation is even worse: not only are we bounded by our limited mental resources, but we don’t even make the most rational use of the resources we have.

So, if software agents start making decisions on our behalf, I wonder how happy we’ll be with those decisions. Will software agents have to simulate our deviations from rationality? Or will we have to learn to be more rational?

Finally, I shouuld not that machine agents are not restricted to the web or even to software. Just pick up the New York Times, and you can read about attempts to make Terminators a reality. Those efforrts raise concerns not only about rationality, but about ethics and accountability.

I’ll be back.

Categories
General

Beware of Google

According to the generally accepted history of Google, the company’s name originated from a common misspelling of the word “googol“, which refers to 10100.

But, for folks who spend their nights worrying whether Google is evil, you might want to explore the possibility that its name comes from the horrid monster depicted in V. C. Vickers’s 1913 children’s tale, “The Google Book“.

Don’t worry, I’m not scaring my daughter with tales of Googles and Yahoos.

Categories
Uncategorized

If You Like The Noisy Channel, …

Already missing your Noisy Channel fix? Why don’t you check out some of the blogs I read:

Categories
Uncategorized

Going on Auto-Pilot

I’m spending a week in Akumal without network connectivity. Yes, a real family vacation. No working, no blogging, no reading Techmeme.

But have no fear. I’ve scheduled daily posts in my absence. The Noisy Channel will not go silent! Obviously I won’t be able to participate in the comment threads, and I can only hope that the evil comment spammers won’t use this opportune moment to attack. Meanwhile, I urge you all to take this opportunity to have the last word–at least until I get back!

If you do need to contact the authorities in my absence, I suggest you send a message to Claude Shannon.

Categories
General

Harvesting Knowledge for Wikipedia

In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.

And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?

Of course, such a proposal immediately raises privacy concerns. It’s clear that any search logging mechanism has to be opt-in and in accordance with the Wikimedia Foundation’s privacy policy that “access to, and retention of, personally identifiable data in all projects should be minimal and should be used only internally to serve the well-being of the projects”. But there are at least two possibilities that I believe would be acceptable to privacy advocates:

  • Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
     
  • Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.

I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.

Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.

I propose the following alternative:

  • Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
  • Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.

My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).

I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.

Categories
General

Mechanical Turkey

Omar Alonso recently pointed me to work he and his colleagues at A9 did on relevance evaluation using Mechanical Turk. Perhaps anticipating my predilection for wordplay, the authors showed off some of their own:

Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task.

Yes, TERC for TREC. In any case, their results show lots to be thankful for:

  • Fast Turnaround. We have uploaded an experiment requiring thousands of judgments and found all the HITs completed in a couple of days. This is generally much faster than an experiment requiring student assessors; even creating and running an online survey can take longer.
  • Low Cost. Many typical tasks, such as judging the relevance of a single query-result pair based on a short summary, are completed for payment of one cent. (Obviously, tasks that require more detailed work require higher payment.) In our example, we could have all our 2500 judgments completed by 5 separate workers for a total cost of $125.
  • High Quality. Although individual performance of workers varies, low cost makes it possible to get several opinions and eliminate the noise. As described in Section 5, there are many ways to improve the quality of the work.
  • Flexibility. The low cost makes it possible to obtain many judgments, and this in turn makes it possible to try many different methods for combining their assessments. (In addition, the general crowdsourcing framework can be used for a variety of other kinds of experiments — surveys, etc.)

Other folks, particularly Panos Ipeirotis, have worked extensively with Mechanical Turk in their research. At the risk of political incorrectness, today, I’d like thank these folks for the successful exploitation of digital natives to explore new worlds of research.

Categories
General

When In Doubt, Make It Public

One of my recurring themes has been that we need to get over our loss of privacy. But today, as I was reading Jeff Atwood “Is Email = Efail?” post about the inevitability of email bankruptcy, I clicked through to a post of his from April 2007 entitled “When In Doubt, Make It Public” and in turn to a post by Jason Kottke entitled “Public and permanent“.

There I struck gold. Kottke suggests that a way to come up with a new buisiness model is “to choose web projects is to take something that everyone does with their friends and make it public and permanent.” Here are his examples:

  • Blogger, 1999. Blog posts = public email messages. Instead of “Dear Bob, Check out this movie.” it’s “Dear People I May or May Not Know Who Are Interested in Film Noir, Check out this movie and if you like it, maybe we can be friends.”
  • Twitter, 2006. Twitter = public IM. I don’t think it’s any coincidence that one of the people responsible for Blogger is also responsible for Twitter.
  • Flickr, 2004. Flickr = public photo sharing. Flickr co-founder Caterina Fake said in a recent interview: “When we started the company, there were dozens of other photosharing companies such as Shutterfly, but on those sites there was no such thing as a public photograph — it didn’t even exist as a concept — so the idea of something ‘public’ changed the whole idea of Flickr.”
  • YouTube, 2005. YouTube = public home videos. Bob Saget was onto something.

It’s a pretty compelling argument. Rather than wasting effort in a losing battle to protect the remants of our privacy, let’s embrace the efficiency of public conversation.

Categories
Uncategorized

SearchWiki: A Platform for Steganography?

Lauren Weinstein wrote an interesting post today, suggestng that Google’s new SearchWiki feature “provides an interesting platform for the global distribution of secret messages“.

This practice, known as steganography, has been a concern for centuries, but most recently has come up in the context of alleged use by terrorists.

No, I don’t think Google is trying to be evil. Moreover, there are lots of other ways to broadcast steganographically encrypted messages on the web, such as posting comments on unmoderated blogs. But it’s interesting that this is the first “useful” application I’ve seen proposed for SearchWiki.

Categories
Uncategorized

Semantic Search Wikipedia Entry: Needs Help

I haven’t written a community post in a while, but I thought that, with everyone getting into the Thanksgiving spirit, perhaps someone might be inspired to give to a Wikipedia entry in need. I’m talking about the semantic search entry, which–as the talk page notes–needs work.

As I told Ron Miller in my recent one-on-one with him:

Semantic search means different things to different people, but broadly falls into two categories: Using linguistic and statistical approaches to derive meaning from unstructured text, using semantic web approaches to represent meaning in content and query structure.

Perhaps someone here could reorganize the Wikipedia entry along these lines?

Or, if you don’t feel sufficiently expert on semantic search to rework the content, perhaps you could help despam the entry, following the example of what I did for the enterprise search entry. I moved the vendors to a separate entry and culled vendors that didn’t have their own Wikipedia entries (which is the accepted “notability” standard).

I know that editing Wikipedia entries is a thankless job. But someone has to do it. And if folks like us don’t then these pages often are overrun by spammers. Think of this as a small contribution to global knowledge management. At the very least, you’ll have my thanks.

Categories
General

Endeca vs. Google, Round 2

OK, it’s not quite Muhammad Ali vs. Joe Frazier or even David vs. Goliath. But, hey, it’s personal, and this is my blog!

A few months ago, I was quoted in a Forbes JargonSpy column, helping to explain why Google isn’t enough for the enterprise. Apparently that hit a nerve, since, shortly afterward, Google Enterprise Search product manager Nitin Mangtani published a sponsored “commentary” on Forbes that some viewed as an advertorial (though Google objected to that characterization).

But the story doesn’t end there. Ron Miller of FierceContentManagement wrote to Mangtani to follow up, and published a Q&A about his reason for publishing the Forbes piece, as well as his rebuttal to Google Search Appliance critics.

While Ron was preparing his questions, he reached out on Twitter to solicit input. I responded, and Ron graciously offered me the same one-on-one treatment, which was published today.

Melodrama aside, I feel that these discussions are useful. Enterprise search has been misunderstood for a long time, and conversations like these at least advance understanding. And hopefully they make for fun reading.