The Noisy Channel

 

Reflecting on AltaVista

December 1st, 2008 · 1 Comment · General

Today is Dec 1, and it seems like an appropriate day to reflect on DEC’s one-time foray into web search: AltaVista. In fact, AltaVista was publicly launched as an internet search engine on December 15, 1995 as altavista.digital.com.

I was an avid AltaVista user, and I was shocked by the rapidity of its demise. Why did AltaVista fail?

According to Don Dodge, former Director of Engineering at Altavista:

The AltaVista experience is sad to remember. We should have been the “Google” of today. We were pure search, no frills, no consumer portal crap.

DEC is guilty of neglect in its handling of AltaVista. Compaq put a bunch of PC guys in charge who relied on McKinsey consultants and copied AOL, Excite, Yahoo and Lycos into the consumer portal game. It should have been clear that being the 5th or 6th player in the consumer portal business wouldn’t work. AltaVista spent hundreds of millions on acquisitions that never worked, and spent $100M on a brand advertising campaign. They spent NOTHING to improve core search. That was the undoing of AltaVista. (via Greg).

Perhaps. I think that doesn’t give Google enough credit for its key innovation: using link analysis to compute a then unspammable measure of a site’s authority, and then using that authority as a prior for its relevance. Of course, spammers caught up and have engaged Google in an arms race ever since, but the head start was enough for Google to establish its supremacy.

Is there a moral? Surely Dodge is right in condemning DEC’s business strategy. But I am sad to see how web search technology has settled in its current local optimum. So, at the risk of being cliché, I’ll draw the lesson that no technologist can afford to be complacent.

→ 1 CommentTags:

Software Agents and Rationality

November 30th, 2008 · No Comments · General

Back when I was an undergraduate (yes, a long long time ago), there was a lot of excitement about software agents, also called intelligent agents. The general idea was that a software agent would be able to pursue goal-directed behavior on a person’s behalf. Of course, what that meant ran the gamut from the mundane (e.g., autodialers) to science fiction (e.g., Braniac in the Superman comics).

With the increasing role that the web plays in our interactions, I wonder about the role of software agents on the web. We already see comment spammers and prankster instant messaging bots, as well as more benign shopbots.

But a question that plagues me is how to reconcile the inherent rationality of software agents with the systematic irrationality of the human beings they represent. Herb Simon argued that humans exercise bounded rationality, but the research from prospect theory suggests that the situation is even worse: not only are we bounded by our limited mental resources, but we don’t even make the most rational use of the resources we have.

So, if software agents start making decisions on our behalf, I wonder how happy we’ll be with those decisions. Will software agents have to simulate our deviations from rationality? Or will we have to learn to be more rational?

Finally, I shouuld not that machine agents are not restricted to the web or even to software. Just pick up the New York Times, and you can read about attempts to make Terminators a reality. Those efforrts raise concerns not only about rationality, but about ethics and accountability.

I’ll be back.

→ No CommentsTags:

Beware of Google

November 29th, 2008 · 2 Comments · General

 

According to the generally accepted history of Google, the company’s name originated from a common misspelling of the word “googol“, which refers to 10100.

But, for folks who spend their nights worrying whether Google is evil, you might want to explore the possibility that its name comes from the horrid monster depicted in V. C. Vickers’s 1913 children’s tale, “The Google Book“.

Don’t worry, I’m not scaring my daughter with tales of Googles and Yahoos.

→ 2 CommentsTags:

If You Like The Noisy Channel, …

November 28th, 2008 · No Comments · Quick Bites

Already missing your Noisy Channel fix? Why don’t you check out some of the blogs I read:

→ No CommentsTags:

Going on Auto-Pilot

November 28th, 2008 · No Comments · Noise

I’m spending a week in Akumal without network connectivity. Yes, a real family vacation. No working, no blogging, no reading Techmeme.

But have no fear. I’ve scheduled daily posts in my absence. The Noisy Channel will not go silent! Obviously I won’t be able to participate in the comment threads, and I can only hope that the evil comment spammers won’t use this opportune moment to attack. Meanwhile, I urge you all to take this opportunity to have the last word–at least until I get back!

If you do need to contact the authorities in my absence, I suggest you send a message to Claude Shannon.

→ No CommentsTags:

Harvesting Knowledge for Wikipedia

November 27th, 2008 · 1 Comment · Community, General

In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.

And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?

Of course, such a proposal immediately raises privacy concerns. It’s clear that any search logging mechanism has to be opt-in and in accordance with the Wikimedia Foundation’s privacy policy that “access to, and retention of, personally identifiable data in all projects should be minimal and should be used only internally to serve the well-being of the projects”. But there are at least two possibilities that I believe would be acceptable to privacy advocates:

  • Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
     
  • Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.

I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.

Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.

I propose the following alternative:

  • Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
  • Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.

My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).

I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.

→ 1 CommentTags:

Mechanical Turkey

November 27th, 2008 · 2 Comments · General

Omar Alonso recently pointed me to work he and his colleagues at A9 did on relevance evaluation using Mechanical Turk. Perhaps anticipating my predilection for wordplay, the authors showed off some of their own:

Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task.

Yes, TERC for TREC. In any case, their results show lots to be thankful for:

  • Fast Turnaround. We have uploaded an experiment requiring thousands of judgments and found all the HITs completed in a couple of days. This is generally much faster than an experiment requiring student assessors; even creating and running an online survey can take longer.
  • Low Cost. Many typical tasks, such as judging the relevance of a single query-result pair based on a short summary, are completed for payment of one cent. (Obviously, tasks that require more detailed work require higher payment.) In our example, we could have all our 2500 judgments completed by 5 separate workers for a total cost of $125.
  • High Quality. Although individual performance of workers varies, low cost makes it possible to get several opinions and eliminate the noise. As described in Section 5, there are many ways to improve the quality of the work.
  • Flexibility. The low cost makes it possible to obtain many judgments, and this in turn makes it possible to try many different methods for combining their assessments. (In addition, the general crowdsourcing framework can be used for a variety of other kinds of experiments — surveys, etc.)

Other folks, particularly Panos Ipeirotis, have worked extensively with Mechanical Turk in their research. At the risk of political incorrectness, today, I’d like thank these folks for the successful exploitation of digital natives to explore new worlds of research.

→ 2 CommentsTags:

When In Doubt, Make It Public

November 27th, 2008 · 8 Comments · General

One of my recurring themes has been that we need to get over our loss of privacy. But today, as I was reading Jeff Atwood “Is Email = Efail?” post about the inevitability of email bankruptcy, I clicked through to a post of his from April 2007 entitled “When In Doubt, Make It Public” and in turn to a post by Jason Kottke entitled “Public and permanent“.

There I struck gold. Kottke suggests that a way to come up with a new buisiness model is “to choose web projects is to take something that everyone does with their friends and make it public and permanent.” Here are his examples:

  • Blogger, 1999. Blog posts = public email messages. Instead of “Dear Bob, Check out this movie.” it’s “Dear People I May or May Not Know Who Are Interested in Film Noir, Check out this movie and if you like it, maybe we can be friends.”
  • Twitter, 2006. Twitter = public IM. I don’t think it’s any coincidence that one of the people responsible for Blogger is also responsible for Twitter.
  • Flickr, 2004. Flickr = public photo sharing. Flickr co-founder Caterina Fake said in a recent interview: “When we started the company, there were dozens of other photosharing companies such as Shutterfly, but on those sites there was no such thing as a public photograph — it didn’t even exist as a concept — so the idea of something ‘public’ changed the whole idea of Flickr.”
  • YouTube, 2005. YouTube = public home videos. Bob Saget was onto something.

It’s a pretty compelling argument. Rather than wasting effort in a losing battle to protect the remants of our privacy, let’s embrace the efficiency of public conversation.

→ 8 CommentsTags:

SearchWiki: A Platform for Steganography?

November 26th, 2008 · 4 Comments · Quick Bites

Lauren Weinstein wrote an interesting post today, suggestng that Google’s new SearchWiki feature “provides an interesting platform for the global distribution of secret messages“.

This practice, known as steganography, has been a concern for centuries, but most recently has come up in the context of alleged use by terrorists.

No, I don’t think Google is trying to be evil. Moreover, there are lots of other ways to broadcast steganographically encrypted messages on the web, such as posting comments on unmoderated blogs. But it’s interesting that this is the first “useful” application I’ve seen proposed for SearchWiki.

→ 4 CommentsTags:

Semantic Search Wikipedia Entry: Needs Help

November 26th, 2008 · No Comments · Community

I haven’t written a community post in a while, but I thought that, with everyone getting into the Thanksgiving spirit, perhaps someone might be inspired to give to a Wikipedia entry in need. I’m talking about the semantic search entry, which–as the talk page notes–needs work.

As I told Ron Miller in my recent one-on-one with him:

Semantic search means different things to different people, but broadly falls into two categories: Using linguistic and statistical approaches to derive meaning from unstructured text, using semantic web approaches to represent meaning in content and query structure.

Perhaps someone here could reorganize the Wikipedia entry along these lines?

Or, if you don’t feel sufficiently expert on semantic search to rework the content, perhaps you could help despam the entry, following the example of what I did for the enterprise search entry. I moved the vendors to a separate entry and culled vendors that didn’t have their own Wikipedia entries (which is the accepted “notability” standard).

I know that editing Wikipedia entries is a thankless job. But someone has to do it. And if folks like us don’t then these pages often are overrun by spammers. Think of this as a small contribution to global knowledge management. At the very least, you’ll have my thanks.

→ No CommentsTags:

Clicky Web Analytics