Categories
General

Disincenting Spam

Greg called my attention today to news that Digg is shifting from popularity-based aggregation to personalized news. I can’t say I’m thrilled at the prospect of a system that “would make guesses about what [users] like based on information mined from the giant demographic veins of social networks”. I don’t suppose the results are necessarily worse than showing users stories based solely on their popularity, but at least the latter offers me some transparency.

But it was an older post Greg pointed to that caught my attention: “Combating web spam with personalization“. Here is his argument in a nutshell:

Personalized search shows different search results to different people based on their history and their interests. Not only does this increase the relevance of the search results, but also it makes the search results harder to spam.

In this 2006 post, Greg is specifically referring to the personalized search that Google was beta testing back in 2004. Google has since implemented personalized search, but without sharing much detail about how it works.

Nonetheless, Greg’s argument reminds me of one of the first posts I wrote on this blog. I was criticizing Google’s approach of keeping its relevance approach secret and particularly the argument that Amit Singhal has advanced to justify it–that the subjectivity of relevance makes it harder to develop an open approach to relevance. My response: “the subjectivity of relevance should make the adversarial problem easier rather than harder, as has been observed in the security industry”. 

I suppose personalization can help fight spam even if it is not coupled with transparency to the user. But what a great opportunity to do both by providing more user control over the information seeking process.

Categories
General

Google Exec Udi Manber: In-House Search is “Not That Good”

On Friday, David Needle of InternetNews published an article with the provocative title, “Google Exec Disses Google’s In-House Search“. The essence of the article: Udi Manber, the Google VP of Engineering who is responsible for core search, evaluated Google’s internal search tools less than enthusiastically, saying “It’s not that good — I’m complaining about it”.

The article is a bit short on details. It quotes Nitin Mangtani describing recent updates to the Google Search Appliance to enable clustering of search results. But the most telling snippet is towards the end of the article, when Manber expresses his views on user interfaces:

While the search giant is constantly tinkering with new user interfaces, Manber said the simplicity of its standard, bare bones design remains tough to beat.

“Google has been very successful by being very minimal,” Manber said. “We’re doing hundreds of experiments with user interfaces; I see two to three new ones everyday.”

He added that Google might offer users the option of different views on its main search page, similar to the way it does so already on its personalized iGoogle page.

“Otherwise, I expect very incremental changes.” He said advanced users appreciate things like 3D and interfaces that offer more detailed views, but for the vast majority, “what happens now works. You type in two words, click and you’re done. You can’t beat that.”

To borrow a popular political slogan, yes we can. In fact, as IDC analyst Sue Feldman (also quoted in the article) said the other day, “One of the problems we have with search is that people ask such lousy questions…anytime tools hand people clues, it helps.”

Google’s success on the consumer web affords them the luxury of hiding their heads in the sand when it comes to enterprise information access. And I understand the appeal that Google’s computer scientists (and others) feel in approaching the information seeking problem as one of optimizing relevance ranking.

It’s not just Manber. Here are some quotes from Google Enterprise Product Manager Cyrus Mistry at a recent presentation:

  • “[the ideal search engine] knows exactly what you meant, gives you exactly what you want.”
  • “If you think tagging is the way to go, good luck. See me in 10 years.”
  • “We’ll decide where to show it” (an explanation of the value proposition of Google’s universal search, which blends results from multiple sources into a single ranking)

It’s easy for Google to be cocky when they’re making $1.35B in quarterly profits. But it doesn’t make them right, especially when it comes to an area that accounts for about 1% of their business. Mind reading may not be impossible; some of my colleagues at CMU are working on it as we speak.

In the mean time, the only practical means our systems have for determining user intent is their input. And, as has been widely reported, the average search query contains 1.7 words. Perhaps the entropy of web search makes it possible to reliably infer intent from such a small signal. But enterprise search–which is to say information seeking in the enterprise–is harder.

At Endeca, we use our own technology in house. Our solution isn’t perfect, and we’re constantly working to improve it. But, most importantly, we’re going after the right problem. To respond to Mistry’s comments:

  • The information access tool does not presume to know exactly what you meant or what you want, but instead works with you to establish this understanding through dialogue.
  • Tagging can be a very effective way to bring in human expertise, especially when it is distributed across a broad population of users. But the tagging mechanism has to be easy for users, and the system needs to be smart about extrapolating from those tags to fill in the gaps.
  • Often the best way to present diverse results is not by blending them into a single ranking but rather exposing that diversity to users in the form of a progressive refinement dialogue.

Google aspires to “organize the world’s information” but admits that its approach falls short when it comes to organizing the information inside their firewall. I commend Manber for his candor. But I hope he and his fellow Googlers take the next step and recognize that they have to think outside the search box.

    Categories
    Uncategorized

    Advertisers are Irrational

    I’ve learn a lot from what Herb Simon, Danny Kahneman, and others tell us about the fallacy of assuming that human behavior conforms to unbounded–or even bounded–rationality. But it’s always nice to see reminders in real-world scenarios, especially ones where real money is at stake.

    If you enjoy this topic, I recommend Greg Linden’s post: “Are advertisers rational?“. Or, if you’re up for it, read the original paper by Jason Auerback, Joel Galenson, and Mukund Sundararajan: “An Empirical Analysis of Return on Investment Maximization in Sponsored Search Auctions“.

    Categories
    Uncategorized

    Blogs I Read: Jeff’s Search Engine Caffe

    One of the great things about blogging is that its public nature helps keep me honest. For all that I talk about “give to get,” I could do a bit more of it myself. One way I’d like to try is by adding a new category of posts called Blogs I Read to talk about other blogs that appeal to me and, I hope, to readers here at The Noisy Channel.

    To inaugurate this series, I’m starting with Jeff’s Search Engine Caffe, published by Jeff Dalton. Jeff is a grad student in the PhD program at UMass Amherst’s Center for Intelligent Information Retrieval. He’s a bit more practically minded than your average PhD student in information retrieval, perhaps owing to his previous experience as a software engineer at Globalspec, where he worked on vertical search for engineering and manufacturing.

    It’s thanks to Jeff that I’m blogging in the first place. I first met Jeff at SIGIR 2006 in Seattle, but it was at ECIR 2008 in Glasgow that he persuaded me to start a blog. Moreover, his advertising my blog on his own was a critical factor in helping me build up a critical mass of readers.

    But I hardly need gratitude as a pretext to read Jeff’s blog. Jeff does a great job of keeping up with happenings in information retrieval, particulary those that span academia and industry like Yahoo BOSS and developments in blog search.

    I know that graduate students aren’t exactly encouraged to blog, since the currency of the realm is peer-reviewed publication. But I hope that Jeff keeps up blogging as a way to share his ideas with a broader audience.

    Categories
    Uncategorized

    Recent Sightings of Citing

    I’ve been excited to see some of the material from this blog making its way into the wider world. Here are some recent mentions I’ve found:

    Sharing ideas is what blogging is all about, and I’m delighted that others are finding some of these ideas worth citing.

    Categories
    General

    Duck Duck Go!

    Recently I’ve starting using Twitter Search to find people talking about topics that interest me. One of my serendipitous finds was Gabriel Weinberg, who is reported to have single-handedly built a search engine called Duck Duck Go. I’ll suspend judgement on the name–after all, Beyond Search blogger Steve Arnold proudly calls himself an addled goose.

    Regular readers know that I’m highly skeptical of quixotic attempts to take on the web search market. And I have no reason to believe that Duck Duck Go will achieve meaningful market share in our lifetime.

    But Weinberg has truly done more with less. For example, when I do a query for SIGIR , I get a disambiguation dialog that bootstraps on Wikipedia. Yes, these are also the top two hits on Google, but with a dialog that implements clarification before refinement.

    Unambiguous queries like Endeca or Warren Buffett need no clarification and instead return clean pages of top results from the major content types.

    I supsect that Weinberg is heavily leveraging Wikipedia. But why not? Why work hard when you can work smart?

    And Duck Duck Go can go off the rails, particularly for harder queries. I haven’t tested it enough to scientifically compare its quality to that of the major web search engines.

    Still, it makes a strong first impression and have a great interface. At the very least, he’s raising the user experience bar for the common web search use cases. Check it out!

    Categories
    Uncategorized

    Taxonomies 101

    Thanks to Gwen Harris at Taxonomy Watch for calling my attention to James Kelway’s two-part article on creating user-centered taxonomies. It’s a great introduction to the subject.

    Categories
    Uncategorized

    Is the Cloud a Trap?

    A colleague of mine just pointed me to an article in Freedom to Tinker by Luis Villa entitled “Cloud(s), Hype, and Freedom“. It’s a nice analysis of some of the ideas that motivated two of my recent posts: “The Future is Mostly Cloudy” and “2.0 Means Give-to-Get“. Enjoy!

    Categories
    General

    2.0 Means Give-to-Get

    I’ve been living in a kaleidoscope of “2.0”s recently: web 2.0, enterprise 2.0, government 2.0. I know some of those are already shedding 2s in favor of 3s. But I wanted to reflect on the core tenet of these 2.0 visions: give-to-get.

    First, let me give immediate credit to the folks at the Greater IBM Connection, who put the phrase in my head at a recent summit. Those who have studied pre-2.0-history, may recognize give-to-get as the Golden Rule: Do unto others as you would have them do unto you. I don’t know of any other precept that has achieved such a universality across cultures.

    What does give-to-get mean in the context of these 2.0 approaches to technology?

    Web 2.0:

    • Blogs where readers’ comments become even more valuable than the original posts.
    • Sites that send users away when they can’t meet users’ needs themselves.

    Enterprise 2.0:

    • Application architectures evaluated based on their ability to play well with others.
    • Professional conversations increasingly taking place openly, outside company firewalls.

    Government 2.0

    Some people who are far more expert than I have written about this stuff:

    What I hope is clear is that, despite the overplaying of “2.0” as a buzzword, the real value of this trend is in promoting one of the cornerstones of our success as a species: enlightened altruism.

    And, at the risk of beating a dead horse, I’d like to call attention to my own efforts to give credit where it’s due on this blog. I’ve consciously reduced internal linking, only using it to refer to earlier posts. But, as you’ll see, this “altruism” is quite self-serving. I’ve been delighted to see folks link to this blog in order to cite its ideas.

    Because what 2.0 is ultimately about is better information sharing for all of us. And for that, we all have to give to get.

    Categories
    General

    Alerting: Push or Pull?

    The other day, I was ranting about how Google is conflating the goals of search and advertising. One of the questions that we discussed over at Greg Linden’s blog was whether the difference between search and advertising is push vs. pull. But, as we concluded, that isn’t quite it. The difference is not the means, but rather then end: meeting the user’s needs rather than those of advertisers.

    And, indeed, the perfect example of a user-driven push interface is alerting. In a typical alerting system, users specify a running query that triggers whenever matching content is published. Certainly this is more akin to web search than to advertising.

    But, like web search, alerting runs into the challenge of adversarial information retrieval. If SEO is about maximizing exposure to users through high rankings in search results, there must be an analogous concept of maximizing exposure to users by triggering their alerts.

    For example, I happen to know that Gartner analyst Whit Andrews, like many of us, has an alert on his own name. By placing his name in this post,  I am quite confident that he will read it.

    But why go after individuals when you can spam wholesale? Including the name of a company in a blog post is certain to attract the attention of a fair number of employees. Including the ticker symbol of a publicly traded company in a document will trigger stock tracking alerts. Et cetera.

    Others have noticed the ability to spam through alerting systems. I imagine that alerting systems will eventually engage in similar strategies to search engines to inhibit spam and decide what is relevant. And perhaps those same systems will include ads.