Categories
General

What Is (Not) Search?

I had a conversation the other day that raised a conundrum: what is *not* search? What do I mean by that? Well, as Stephen Arnold points out in a recent post, “search” can be anything from a “find-a-phone number problem” to a “glittering generality” that encompasses end-to-end information processing.

Language is imperfect, so is it really that important to define what is and isn’t “search”? It certainly matters when you’re trying to sell search technology! But, more importantly, we need some shared understanding in order to make progress.

At the very least, I propose that we distinguish “search” as a problem from “search” as a solution. By the former, I mean the problem of information seeking, which is traditonially the domain of library and information scientists. By the latter, I mean the approach most commonly associated with information retrieval, in which a user enters a query into the system (typically as free text) and the system returns a set of objects that match the query, perhaps with different degrees of relevancy.

Beyond that, we need to recognize that search exists within the context of tasks. It is easy to lump every task that involves information seeking as “search”, but doing so oversimplifies a complex landscape of activities and needs. I believe we are headed for a world where end users think about tasks rather than about the search activities that form part of those tasks. In that world, search technologists provide infrastructure, not the end-user destination.

Categories
General

Reflecting on AltaVista

Today is Dec 1, and it seems like an appropriate day to reflect on DEC‘s one-time foray into web search: AltaVista. In fact, AltaVista was publicly launched as an internet search engine on December 15, 1995 as altavista.digital.com.

I was an avid AltaVista user, and I was shocked by the rapidity of its demise. Why did AltaVista fail?

According to Don Dodge, former Director of Engineering at Altavista:

The AltaVista experience is sad to remember. We should have been the “Google” of today. We were pure search, no frills, no consumer portal crap.

DEC is guilty of neglect in its handling of AltaVista. Compaq put a bunch of PC guys in charge who relied on McKinsey consultants and copied AOL, Excite, Yahoo and Lycos into the consumer portal game. It should have been clear that being the 5th or 6th player in the consumer portal business wouldn’t work. AltaVista spent hundreds of millions on acquisitions that never worked, and spent $100M on a brand advertising campaign. They spent NOTHING to improve core search. That was the undoing of AltaVista. (via Greg).

Perhaps. I think that doesn’t give Google enough credit for its key innovation: using link analysis to compute a then unspammable measure of a site’s authority, and then using that authority as a prior for its relevance. Of course, spammers caught up and have engaged Google in an arms race ever since, but the head start was enough for Google to establish its supremacy.

Is there a moral? Surely Dodge is right in condemning DEC’s business strategy. But I am sad to see how web search technology has settled in its current local optimum. So, at the risk of being cliché, I’ll draw the lesson that no technologist can afford to be complacent.

Categories
General

Software Agents and Rationality

Back when I was an undergraduate (yes, a long long time ago), there was a lot of excitement about software agents, also called intelligent agents. The general idea was that a software agent would be able to pursue goal-directed behavior on a person’s behalf. Of course, what that meant ran the gamut from the mundane (e.g., autodialers) to science fiction (e.g., Braniac in the Superman comics).

With the increasing role that the web plays in our interactions, I wonder about the role of software agents on the web. We already see comment spammers and prankster instant messaging bots, as well as more benign shopbots.

But a question that plagues me is how to reconcile the inherent rationality of software agents with the systematic irrationality of the human beings they represent. Herb Simon argued that humans exercise bounded rationality, but the research from prospect theory suggests that the situation is even worse: not only are we bounded by our limited mental resources, but we don’t even make the most rational use of the resources we have.

So, if software agents start making decisions on our behalf, I wonder how happy we’ll be with those decisions. Will software agents have to simulate our deviations from rationality? Or will we have to learn to be more rational?

Finally, I shouuld not that machine agents are not restricted to the web or even to software. Just pick up the New York Times, and you can read about attempts to make Terminators a reality. Those efforrts raise concerns not only about rationality, but about ethics and accountability.

I’ll be back.

Categories
General

Beware of Google

According to the generally accepted history of Google, the company’s name originated from a common misspelling of the word “googol“, which refers to 10100.

But, for folks who spend their nights worrying whether Google is evil, you might want to explore the possibility that its name comes from the horrid monster depicted in V. C. Vickers’s 1913 children’s tale, “The Google Book“.

Don’t worry, I’m not scaring my daughter with tales of Googles and Yahoos.

Categories
General

Harvesting Knowledge for Wikipedia

In the United States, Thanksgiving is a harvest festival in which we express gratitude for our bounty and turn our thoughts towards altruism–at least while we’re not stuffing ourselves with turkey and pumpkin pie.

And that got me thinking to the mother of online altruistic endeavors, Wikipedia. Specifically, I thought about the bounty of information represented in Wikipedia search queries–especially search queries for which no entry exists. Could we somehow harvest these to improve Wikipedia by suggesting new entries?

Of course, such a proposal immediately raises privacy concerns. It’s clear that any search logging mechanism has to be opt-in and in accordance with the Wikimedia Foundation’s privacy policy that “access to, and retention of, personally identifiable data in all projects should be minimal and should be used only internally to serve the well-being of the projects”. But there are at least two possibilities that I believe would be acceptable to privacy advocates:

  • Make opt-in explicit on a per-query basis. In fact, Wikipedia already has a request mechanism that shows up precisely when a search query fails.
     
  • Allow users to opt in to logging for all failed queries, making it clear that the benefit of avoiding extra clicks would come at the cost that they might forget they had agreed to contribute all such queries to the log.

I actually suspect that Wikipedia could log all queries by default, as long as there is no personally identifying information associated with the queries. By that, I mean that, at most, the log indicates whether the query was issued by a registered user. But no id (not even an anonymized one), no time stamp, etc. After the AOL scandal, people are understandably paranoid.

Of course, privacy isn’t the only concern. The other major concern is spammers. The mechanism I’m proposing would attract spammers like moths to a flame, so the apparent popularity of a search term must be taken with a grain of salt. Associating requests with personal identifiers would probably solve the spam problem, but it is out of the question because of the privacy concerns discussed above. CAPTCHAs might help, but they would pose a high entry barrier that, in practice, would probably discourage most users from making requests.

I propose the following alternative:

  • Trust registered users not to be spammers (but don’t log their names). I don’t know how easy it is for a a spammer to register–or to be detected (e.g., because of an implausibly high activity level).
  • Reality-check candidate terms against content, using the Wikipedia corpus, the broader web, or any other available resources. That way, a spammer would have to wage a two-front war on both the query log and the data.

My colleagues and I at Endeca successfully used this last approach at a leading sports programming network (I demoed it at the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU).

I know that there are bigger problems in the world than improving Wikipedia. My heart reaches out to those in Mumbai who are reeling from terrorist attacks. But we must all act within our circle of influence. Wikipedia matters. Scientia potentia est.

Categories
General

Mechanical Turkey

Omar Alonso recently pointed me to work he and his colleagues at A9 did on relevance evaluation using Mechanical Turk. Perhaps anticipating my predilection for wordplay, the authors showed off some of their own:

Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task.

Yes, TERC for TREC. In any case, their results show lots to be thankful for:

  • Fast Turnaround. We have uploaded an experiment requiring thousands of judgments and found all the HITs completed in a couple of days. This is generally much faster than an experiment requiring student assessors; even creating and running an online survey can take longer.
  • Low Cost. Many typical tasks, such as judging the relevance of a single query-result pair based on a short summary, are completed for payment of one cent. (Obviously, tasks that require more detailed work require higher payment.) In our example, we could have all our 2500 judgments completed by 5 separate workers for a total cost of $125.
  • High Quality. Although individual performance of workers varies, low cost makes it possible to get several opinions and eliminate the noise. As described in Section 5, there are many ways to improve the quality of the work.
  • Flexibility. The low cost makes it possible to obtain many judgments, and this in turn makes it possible to try many different methods for combining their assessments. (In addition, the general crowdsourcing framework can be used for a variety of other kinds of experiments — surveys, etc.)

Other folks, particularly Panos Ipeirotis, have worked extensively with Mechanical Turk in their research. At the risk of political incorrectness, today, I’d like thank these folks for the successful exploitation of digital natives to explore new worlds of research.

Categories
General

When In Doubt, Make It Public

One of my recurring themes has been that we need to get over our loss of privacy. But today, as I was reading Jeff Atwood “Is Email = Efail?” post about the inevitability of email bankruptcy, I clicked through to a post of his from April 2007 entitled “When In Doubt, Make It Public” and in turn to a post by Jason Kottke entitled “Public and permanent“.

There I struck gold. Kottke suggests that a way to come up with a new buisiness model is “to choose web projects is to take something that everyone does with their friends and make it public and permanent.” Here are his examples:

  • Blogger, 1999. Blog posts = public email messages. Instead of “Dear Bob, Check out this movie.” it’s “Dear People I May or May Not Know Who Are Interested in Film Noir, Check out this movie and if you like it, maybe we can be friends.”
  • Twitter, 2006. Twitter = public IM. I don’t think it’s any coincidence that one of the people responsible for Blogger is also responsible for Twitter.
  • Flickr, 2004. Flickr = public photo sharing. Flickr co-founder Caterina Fake said in a recent interview: “When we started the company, there were dozens of other photosharing companies such as Shutterfly, but on those sites there was no such thing as a public photograph — it didn’t even exist as a concept — so the idea of something ‘public’ changed the whole idea of Flickr.”
  • YouTube, 2005. YouTube = public home videos. Bob Saget was onto something.

It’s a pretty compelling argument. Rather than wasting effort in a losing battle to protect the remants of our privacy, let’s embrace the efficiency of public conversation.

Categories
General

Endeca vs. Google, Round 2

OK, it’s not quite Muhammad Ali vs. Joe Frazier or even David vs. Goliath. But, hey, it’s personal, and this is my blog!

A few months ago, I was quoted in a Forbes JargonSpy column, helping to explain why Google isn’t enough for the enterprise. Apparently that hit a nerve, since, shortly afterward, Google Enterprise Search product manager Nitin Mangtani published a sponsored “commentary” on Forbes that some viewed as an advertorial (though Google objected to that characterization).

But the story doesn’t end there. Ron Miller of FierceContentManagement wrote to Mangtani to follow up, and published a Q&A about his reason for publishing the Forbes piece, as well as his rebuttal to Google Search Appliance critics.

While Ron was preparing his questions, he reached out on Twitter to solicit input. I responded, and Ron graciously offered me the same one-on-one treatment, which was published today.

Melodrama aside, I feel that these discussions are useful. Enterprise search has been misunderstood for a long time, and conversations like these at least advance understanding. And hopefully they make for fun reading.

Categories
General

Browser Wars: The 2008 Edition

Fresh from reporting why he switched from Firefox to Chrome, CNET’s Stephen Shankland reports that Chrome has a larger market share than he expected:

For comparison, here are the stats for The Noisy Channel, based on the last 30 days (note that stats don’t reflects users reading the blog through RSS readers):

  • Firefox: 58.4%
  • Internet Explorer: 19.2%
  • Safari: 9.9%
  • Chrome: 6.8%
  • Mobile (assorted): 3.0%
  • Opera: 1.6%
  • Mozilla 1.x: 1.1%
  • Konqueror: 0.1%

Not quite the same mix as Shankland is seeing at CNET, but Chrome’s share is respectable.

Note: the Chrome market share may be slightly skewed by my using Chrome to post, since I’ve found it handles my WordPress web client better than Firefox. I still am faithful to Firefox for everything else. As someone posted recently, no Adblock = no Chrome.

Even so, I’m sure that doesn’t account for more than 1% of traffic. A noticable minority of Noisy Channel readers are giving Chrome a chance.

Categories
General

Ephemeral Conversation Is Dying

Bruce Schneier had a a column in the Wall Street Journal a few days ago entitled “Why Obama Should Keep His BlackBerry – But Won’t“. He uses Obama’s BlackBerry dilemma to make the broader point that, we’ve moved from an assumption of privacy to a world where everything is recorded. His argument is that, rather than trying to turn back the clock, we need to adjust our legal and cultural norms to the reality of our digital trails. (via Vincent Gable’s comment at whydoeseverythingsuck.com)

I’m reminded of these thoughts from Danah Boyd’s Master’s Thesis on managing identity in a digital world:

Although it may seem advantageous to have historical archives of social interactions, these archives take the interactions out of the situational context in which they were located. For example, by using a search engine to access Usenet, people are able to glimpse at messages removed from the conversational thread. Even with the complete archive, one is reading a historical document of a conversation without being aware of the temporal aspect of the situation. As such, archived data presents a different image to a viewer who is accessing it out of the context in which it was created.

Digital archives allow for situational context to collapse with ease. Just as people can access the information without the full context, they can search for information which, when presented, suggests that two different bits of information are related. For example, by searching for an individual’s name, a user can acquire a glimpse at the individual’s digital presentation across many different situations without seeing any of this in context. In effect, digital tools place massive details at one’s fingerprint, thereby enabling anyone to have immediate access to all libraries, public records and other such data. While advantageous for those seeking information, this provides new challenges for those producing sociable data. Although the web is inherently public, people have a notion that they are only performing to a given context at a given time. Additionally, they are accustomed to having control over the data that they provide to strangers. Thus, people must learn to adjust their presentation with the understanding that search engines can collapse any data at any period of time. 

And Danah wrote those words seven years ago, before Facebook and Twitter invented micro-blogging and inspired people to voluntarily live in virtual fishbowls. I’ve blogged about the end of privacy through difficulty, but it seems we’re heading in a direction of no privacy at all. It will be interesting to see how the next generation frames this discussion.