The Noisy Channel

 

Search User Interfaces and Data Quality

December 3rd, 2009 · 29 Comments · General

One of the many things I’ve enjoyed in my first few weeks of working at Google is the opportunity to talk with many people who care about user interfaces and think about HCIR. Indeed, some of the folks working on “more and better search refinements” are just steps away from my desk. Very cool!

But working on the inside has also help me appreciate what Bob Wyman tried to tell me months ago–that Google has no philosophical predilection towards black box approaches, but rather is only limited by what technology makes possible and what its engineers can implement. I’d qualify that slightly by saying that I perceive an additional constraint: Google does have a strong predilection towards data-driven decisions. Some folks have found that approach objectionable in the context of interface design.

Anyway, if you’re a regular here, then you’re probably predisposed towards HCIR and exploratory search. In that case, I’d like to take a moment to help you appreciate the challenge I face on a day-to-day basis.

Which one of these two statements do you most agree with?

  1. We need better data quality in order to support richer search user interfaces.
  2. Richer search user interfaces allow us to overcome data quality limitations.

On one hand, consider two search engines whose interfaces are designed to support exploratory search: Cuil and Kosmix. Sometimes they’re great, e.g., [michael jackson] on Cuil and [iraq] on Kosmix. But look what can happen for queries that are further out in the tail, e.g. [faceted search] on Cuil [real time search] on Kosmix. Yes, the kinds of queries I make. :-) I don’t mean to knock these guys–they’re trying, and their efforts are admirable. Moreover, both generally return respectable search results on the first pages (in Kosmix’s case, through federation). But the search refinements can be way off, and that undermine the overall experience. I strongly suspect that the problem is one of data quality, along the lines of what others have argued.

On the other hand, some of the work that I did with colleagues at Endeca (e.g., work presented at HCIR 2008 on “Supporting Exploratory Search for the ACM Digital Library”) at least dangles the possibility that the second statement holds–namely, a richer user interface could help overcome data quality limitations. Interaction draws more of the information need out of the user, and the process may be able to mask imperfection in the data. For example, it’s clear to users–and clear from the search refinements–that [michael jackson beer] and [michael jackson -beer] are about different people. If we can just get that incremental information from the user, we don’t have to achieve perfection in named entity recognition and disambiguation.

I think there’s some truth in both arguments. Data quality is a major bottleneck for effectively delivering an exploratory search experience, and data quantity, much as it helps, is not a guarantee of quality. Richer interfaces offer the enticing possibility of leveraging human computation, but they also introduce the risk of disappointing and alienating users. Even for an HCIR zealot like me, the constraints of reality are sobering.

And yes, speed and computational cost matter too. But hey, it wouldn’t be a grand challenge if it were easy!

29 responses so far ↓

  • 1 Seth Grimes // Dec 3, 2009 at 6:59 am

    Daniel, another approach to richer search user interfaces is to turn a certain class of data-quality problems — search-user errors — to your advantage. Google clearly does this with “Did you mean:” suggestions, presumably derived by mining query logs. This business is described by Marti Hearst in her book Search User Interfaces, although rather than pointing you directly there, I’ll point you to a recent article of mine, Text Data Quality, because I look at a selection of larger issues.

    But regarding “better data quality in order to support richer search user interfaces,” it seems to me that the limitations shown by your Cuil and Kosmix examples are *analytical* limitations related to semantic integration of information from disparate, distributed sources. That is, their data seems not to be of low quality, rather the assembly of search findings can be questionable.

    I give a similar example, the ability of Nielsen Buzzmetrics’ Blogpulse application to handle search-term variants, in another article, Text Data Quality: Mistakes and More.

    The other search data quality issue is in selecting source materials and processing them to create an index in order to avoid, taking a recent example, the situation where Jews are listed as a cause for AIDS, in that case because the engine didn’t distinguish “aids” and “AIDS”. I have an article coming out next week on this subtopic.

  • 2 Daniel Tunkelang // Dec 3, 2009 at 9:30 am

    Seth, those are good points. Indeed, one of the first lessons in improving site usability is to see what’s going wrong, e.g., common searches that return no results or have surprisingly low click-through rates.

    On your distinction between data quality and analytics limitations, you’re right that I’m eliding it. I guess that’s because I put extracted data in the same bucket as raw data and think of data quality as encompassing both. I realize that’s a bit sloppy, but I think the distinction between the two isn’t as clear as we might want it to be. Ground truth is hard to find–your “raw” data may in fact be the output of someone else’s extraction pipeline!

    Looking forward to your article about the Netbase incident.

  • 3 jeremy // Dec 3, 2009 at 4:21 pm

    Naturally there has to be some sort reasonable minimum quality to the underlying data, whether raw or extracted/processed. And I’m certainly not against further data collection and processing to make that side of things even better.

    But it’s no secret that I think the biggest jumps in relevance are going to come from intelligent user interaction. I cite others here as my sources:

    http://irgupf.com/2009/03/25/good-interaction-design-trumps-smart-algorithms/

    http://irgupf.com/2009/11/04/good-interaction-ii-just-ask/

  • 4 jeremy // Dec 3, 2009 at 4:59 pm

    However, I will take issue with one of your other statements:

    But working on the inside has also help me appreciate what Bob Wyman tried to tell me months ago–that Google has no philosophical predilection towards black box approaches, but rather is only limited by what technology makes possible and what its engineers can implement.

    Maybe this is true today, because folks at Google are finally coming to the correct understanding. But in my experience with talking with folks at Google through most of this decade, and talking to other research colleagues who have talked to folks at Google through most of this decade, is that there has been a strong bias toward black box approaches. The history of everything I’ve been told by Googlers is too long to recount here, but let me refer to you to statement by Eric Schmidt:

    http://www.boingboing.net/2004/10/29/transcript_of_google.html

    The other thing to remember is that the average person does not want to debug their computer. We prefer instead the idea of a person typing something in and Google — or someone else — figuring things out for you. [emphasis mine] But very few things are organized around that principle of simplicity; we love and appreciate the complexity in technology but people using the internet really don’t want that. When you see an ease of use breakthrough, it’s such a wonderful thing.

    Here, Schmidt is saying that Google is biased against user-driven interaction. Google would rather figure everything out for you.

    I think that may be changing now, witnessed in large part by your hiring :-) But it is revisionist history to say that Google has always been this way. Maybe Google currently does not have a philosophical predilection toward black box approaches. But for 90% of its corporate history, it has.

  • 5 jeremy // Dec 3, 2009 at 5:04 pm

    I’d qualify that slightly by saying that I perceive an additional constraint: Google does have a strong predilection towards data-driven decisions. Some folks have found that approach objectionable in the context of interface design.

    Generally, data driven methods are fine, as long as you’re clear about what you think it is you are measuring, and why, and how. Too often I am not convinced that the measurements that get used, and more importantly the metrics used to infer meaning from those measurements, are the correct ones.

    I’m of the philosophical camp that how you choose to measure something affects the outcome much more than what it is that you’re measuring.

    What I think may have been happening for many years is that the metrics they are using are themselves subtly biased against user interaction in search. So if course if your metric is wrong, you’re never going to implement the right solution, because you’ll never be able to interpret the data in the right way. (Does that make sense?)

  • 6 jeremy // Dec 3, 2009 at 5:12 pm

    One last followup to the “philsophical predilection” issue. From Greg Linden, I quote Marissa Mayer:

    http://glinden.blogspot.com/2005/03/personalized-search-at-pc-forum.html

    “We need to get better not at doing searches, but at providing answers people are looking for, ” [said Marissa.]…Mayer [said] that Google’s goal isn’t to force users to have to think about search …

    It sounds pretty clear to me that this is a philosophical predilection toward black box approaches.

  • 7 jeremy // Dec 3, 2009 at 5:14 pm

    BTW, sorry for the flurry of comments. :-)

  • 8 Sailesh // Dec 3, 2009 at 6:23 pm

    Thanks for your post, Daniel, and congrats on the new gig at Google. I’m part of the Kosmix team, and appreciate the feedback—as always.

    I agree that refinements for tailish queries can be a challenge, and are key to returning meaningful content. At Kosmix we address this though a combination of large amounts of data and cutting edge categorization algorithms. For example: deep vein thrombosis , kevin grevioux. Through our categorization algorithms, we are able to understand the context surrounding the topic and hence provide relevant search refinements.

    As for the“real time search” topic page you mentioned, we had a small bug that was affecting the quality of the page—this has been fixed, so you should see an improvement soon.
    Please keep the feedback coming—and if you ever find yourself visiting the Googleplex in Mountain View, let us know—we’re just down the street and would love to meet you someday!

  • 9 Daniel Tunkelang // Dec 3, 2009 at 9:58 pm

    Jeremy, no need to apologize–your comments are great! As you know, I’m a big fan of intelligent user interaction. The challenge is that implementing it does require a minimum level of data quality–whether raw or mined. Rich user interfaces implemented on top of impoverished data lead to very unintelligent user interaction–and different users have different tolerance thresholds for the hiccups.

    As for Google’s “figuring things out” for the user, that’s not quite the same as being a black box–if the figuring out is transparent to the user. But I agree that it isn’t in the spirit of HCIR. In any case, there is a lot of thinking about interaction now–and you can see that from the various efforts Google is making public.

    Saliesh, thanks for swinging by. I hope you know from my past posts that I’m a fan of Kosmix, and I really appreciate the efforts you are making to promote exploratory search on the web. Indeed, if you or one of your colleagues would like to write a guest post about it here, please let me know! Meanwhile, I’ll look forward to the improvements, and I do look forward to meeting face to face on of these days.

  • 10 Bob Wyman // Dec 4, 2009 at 2:58 am

    Daniel, Welcome to Google! Now that you’re “on the inside” we should continue the conversation over lunch. (I’m up on 10…)

    bob wyman

  • 11 Daniel Tunkelang // Dec 4, 2009 at 10:39 am

    Bob, thanks! Looking forward to meeting on 8. :-)

    http://www.yelp.com/biz/hemispheres-new-york

  • 12 Bob Carpenter // Dec 7, 2009 at 5:52 pm

    Did you actually like Google’s suggested refinements for [real time search] any better? It’s currently giving me: real time quotes, real time systems, real time processing, near real time, real time now, real time traders, opposite real time, dart real time.

    Arguably, these might be better for most users than a link to a Twitter search research project.

    The organic hits are about real time search, so I’m guessing the top hits aren’t what’s being searched for suggestions as in traditional query refinement. That’s not surprising given the compute cost for hits-based refinement. After all, Google doesn’t have limitless computing power nor a limitless search latency allowance.

    Google’s suggestions for [faceted search] are more relevant, mentioning some of Endeca’s competition: faceted search sharepoint, federated search, faceted navigation, solr faceted search, flamenco faceted search, faceted metadata search, drupal faceted search, faceted search wikipedia.

    I have to agree that I was shocked you were going to Google because I also perceived them to have a strong distaste for anything other than the simple search box. Clearly rolling out refinement suggestions (albeit way below the fold) means it’s not impossible for them to add features. If Google’s more open minded than they appear in public, maybe the executive quotes are just marketing spin. After all, we’d all love the search engines to figure out what we want without our having to think.

  • 13 Daniel Tunkelang // Dec 7, 2009 at 6:39 pm

    For better or worse, I’m already on record on this one. :-) In any case, you’re right that computing power isn’t limitless, and that Google historically has taken a strong stance on minimizing query latency (see #3).

    Anyway, I do understand the shock. But I will say that, now that I’m on the inside, I don’t see distaste for features like search refinements. Indeed, they’re already there–even though there’s clear room for improvement. The challenge is innovating in the interface *and* delivering high quality–all within a reasonable budget of computation and latency. It’s hard!

  • 14 jeremy // Dec 8, 2009 at 2:10 pm

    The challenge is innovating in the interface *and* delivering high quality

    Isn’t one of the larger operational concepts behind exploratory search that you don’t place all the burdens of high quality on the search engine, but instead interactively utilize the intelligence of the user, with some minimum standard quality offerings from the search engine, to come to a better overall result? The high quality comes not from more data and smarter algorithms, but from intelligent interaction on decent raw data.

    I guess the way I see it is that by utilizing more user intelligence, you can also reduce the demands on computation and latency: you spend less effort trying to work it all out computationally, yourself, on the backend. HCIR is not only fully compatible with better results and experiences, but does not have to increase computation to do so.

    That’s more what I meant by “black box” above. Black box is actually made up of two concepts: “black” and “box”. Sure, you can make the “black” into “transparent”, but then you still have a “box”. And what good is transparency without interaction, without the ability to reach into that box and alter/play with those things that you can transparently see? Transparency without interactive adjustment, while better than we have it now, is still pretty much a non-starter.

    So no matter if the box is black or clear, Google has historically long been in favor of keeping the box sealed shut. Again, there are small pieces of evidence that this attitude is changing. But Google still has a public perception problem; they’ve made statements to the contrary for so many years, that it’s hard now to know what to believe. You may see a different attitude, on the inside. But on the outside, I (we? at least Bob does too…) see contradiction and confusion.

  • 15 jeremy // Dec 8, 2009 at 2:14 pm

    More evidence of Google’s long-standing, on-going(?) bias in favor of sealed-shut-box approaches:

    http://news.zdnet.co.uk/internet/0,1000000097,39254657,00.htm

    Saleel Sathé, a lead product manager in Microsoft’s MSN Search division predicted that there will be big changes in the user interface of Internet search engines — so that users no longer type a few words in a single search box.

    “Search engines have shot themselves in the foot by providing a search box, where users provide relatively little information,” Sathé said, during a panel on search technology at the conference.

    “Over the next five years we will see significant improvements in how [user interfaces] operate. The average search query is 2.3 words… but if you asked a librarian for information you would not just give them 2.3 words — you would give them the opportunity to give you the rich detailed answer you want.”

    But Matthew Glotzbach, the director of product management for Google’s enterprise products, disagreed, claiming that advances in technology will mean that users will not need to provide more information.

    “In the distant future we will not be able to get you to take more action, because we will get close enough with what you give us. A lot of emphasis will continue on doing that in the background — getting the technology to figure out [what you want],” he said. “Larry Page [the co-founder] of Google often says, ‘the perfect search engine would understand exactly what you mean and give back exactly what you want’.”

    Doing it all in the background, for you, without any interaction, is essentially keeping it in the “box”. No matter what color that box is.

    It appears that Google is changing, that they’re more willing to take cues from users, adopting the stance taken by Microsoft 3-4 years ago. Again, this is good. But it goes against the image that Google has been cultivating for itself for a long time now.

    I think what might help, Daniel, is if you could take a larger role in explaining to the rest of us just what those attitudes are, on the inside. More public, semi-official Google statements on the value of exploratory search, HCIR, interactivity, dialogue, etc. would be invaluable.

    Does your current position allow you to take such a role?

  • 16 Daniel Tunkelang // Dec 8, 2009 at 10:45 pm

    There are lots of ways to improve search experience through HCIR. I’d be delighted to see less reliance on ranking and more emphasis on a conversation with the user. Such an approach reduces the dependence on successfully inferring the user’s intent from a search query, but may well increase the dependence on good data quality to support the conversation. It would be even better to leverage the user’s effort to make up for gaps in data quality. But it may be unreasonable to expect users to make the requisite effort. There’s a minimum level of data quality needed to get most users to adopt HCIR interfaces.

    On your other point: I disagree that transparency without explicit support for interaction is a non-starter. Users will serve themselves by rewriting queries and learning from results. But I readily concede that explicit support for interaction would be much better–assuming the quality of the experience meets the expectations set by the interface.

    As for making “semi-official Google statements”, I’m certainly not ready to do that yet! Indeed, I’d rather preserve my independence by speaking for myself and not for my employer.

    In any case I don’t doubt that some of my colleagues are primarily concerned with returning high-precision results on the first page. I’m in favor of that myself for queries where the intent can be inferred from a couple of words. But the panel you cite is from 2006. I think Google has become more open-minded about HCIR in the past three years. While I’m sure I have colleagues who would stand by the quote you cite, but I know I have many who wouldn’t.

  • 17 jeremy // Dec 11, 2009 at 2:02 pm

    On your other point: I disagree that transparency without explicit support for interaction is a non-starter. Users will serve themselves by rewriting queries and learning from results.

    When I said “transparency without interactive adjustment” I didn’t (necessarily) mean that interactivity had to be implemented using an explicitly pulsating button, dynamic slider, or shiny link or something like that. Interactive adjustment is simply the ability to reach into the search engine in some manner, and get that algorithmic box to give you different results than the ones it wants to give you by default. To be able to instruct the box to change its internal functionality in some manner that better suits your needs. The interface for making that tweak could even be a command-line option.

    What I am complaining about is when there aren’t even command-line options for various types of interactive adjustments a user wants to make.

    If the transparency that the engine is showing you is text-based transparency, then Google does indeed allow you to reach in and change the text terms that you use to get the results back out of the box. It will still do funky automatic substitutions because it thinks it knows better than you, and sometimes it automatically ignores some of the words you use, and weights other words higher. But it also allows you to get around that by using the “+” operator to tell Google to stop ignoring that query term, or the “-” to tell it to ignore other terms. So Google does provide limited support for interactive adjustment for that mode (text), already.

    But what about the rest of the 400+ signals/modes that Google uses to rank/select pages. How is a user suppose to rewrite their query, for those 400+ different modes? What I am saying is that any information (transparently provided) that involves those other modes will be useless to the user for the purpose of issuing additional queries, if those modes/methods of interaction are not supported by the box. And most of them are not. I’ll say it again: Google would rather make all the decisions for you, and not give you access into the box: “A lot of emphasis will continue on doing that in the background — getting the technology to figure out [what you want]”.

    For example, let’s talk about time, i.e. when was a web page first crawled (or how long has the web page been around without changing)? Suppose that you have search engine transparency, in that it displays information (maybe even a histogram) on when each piece of information in your results set was first indexed/last changed. And because of this transparency, suppose that you start to notice that the information that is best satisfying your need is getting progressively older/earlier. The transparency gives you that awareness.

    What is the “rewriting of the query” that I am suppose to do, to give me results that are sorted in “least recent” order? As we talked about a few months ago, Google allows you to sort by most recent, or by past 1 day, 1 week, 1 month, etc. But those are all very different than “least recent”. So how am I supposed to learn from the (transparent) results and rewrite my query? The search engine does not provide the ability for me to do that sort of interactive adjustment. It does not provide explicit support for the type of interaction I am trying to engage in. Sure, there are ways around it: I can issue 4,015 queries, successively, in order to find the best results from 1 day ago, 2 days ago, 3 days ago, etc. up to the past 11 years of Google’s existence. But we’ve already agreed that this is, frankly, unreasonable. The search input mechanism, the query reformulation that Google allows, is not open box enough for you to express your desire to see least recent information first. The box is sealed shut, with no way in. Thus, transparency really is dead in the water, if the information that is being shown is ultimately unalterable. While awareness does not hurt you, awareness without an ability to take action does not help you.

    Here is another example where the box is not open, where there is transparency without interactive adjustment: Popularity. Google shows you the PageRank of its SERPS if you install their toolbar. Maybe in the future they might even start showing a histogram of result PageRanks, in the results itself. That is transparency. Now suppose I learn from this transparent information that a lot of the pages that I am most interested in are not popular, have low PageRank. Tell me what query I am suppose to type in order to reformulate my query so as to instruct Google to start giving me lower PageRank results. I can’t do it. It’s transparency without manipulability, and it doesn’t really help me to know that a lot of the pages that I like have low PageRank, because I can’t do anything about it. There is no possible way for me to reach into Google’s closed ranking box and give some sort of instruction or direction to rank the information in a manner that is more consistent with what I want to see. So what use is knowing a PageRank, if I can’t do anything (cannot rewrite my query) with that information?

    I could go on with even more examples: Sentiment. Suppose Google has a sentiment classifier, and makes that information transparent. Well, how am I suppose to rewrite my query, to tell Google’s ranking algorithm to give me all the positive sentiment items, and none or the negative ones (or vice versa)? The rewriting is impossible. The box does not support that sort of interaction. It’s sealed shut. There is no way for me to make use of the information that I have learned from the results to ask another query which incorporates that new, learned information.

    And that’s what I’m talking about. That’s transparency without interactive adjustment. And I do still think that it’s a non-starter.

  • 18 jeremy // Dec 11, 2009 at 2:34 pm

    As for making “semi-official Google statements”, I’m certainly not ready to do that yet! Indeed, I’d rather preserve my independence by speaking for myself and not for my employer.

    I guess what I am after is someone who will take a roll like Matt Cutts, but for the domain of exploratory information seeking. Cutts gives semi-official advice to webmasters about Google philosophies and best practices around web authoring/SEO. He does not reveal any internal Google secrets, of course, but he does provide helpful insights into Google thinking processes.

    Right now, I as a user of Google services have no idea how I should approach Google search, when it comes to information seeking. I have no idea what the “best practices” are for users wishing to utilize Google to its full potential. So what I end up doing is hammering out a couple of query terms and hope for the best. When it works, great. When it doesn’t, I have very little consistent idea how to change/modify/improve/adjust, other than to guess at a couple of different terms, and hammer again.

    What Google really needs, therefore, is someone who can interpret search strategies, the way Cutts interprets SEO strategies.

    In any case I don’t doubt that some of my colleagues are primarily concerned with returning high-precision results on the first page. I’m in favor of that myself for queries where the intent can be inferred from a couple of words. But the panel you cite is from 2006. I think Google has become more open-minded about HCIR in the past three years.

    Three years really isn’t that long ago. Even in “internet time”. Why? Because while computer memory, disk space and processing power double all the time, but human nature and the information seeking strategies that spring from that human nature don’t change every three years. Human nature is fundamentally the same today, in 2006, in 2003, in 2000, and in 1998.

    Google seemed fundamentally opposed to HCIR/exploratory search from 2003 to 2006. And from 2000 to 2003. And from 1998 to 2000. So I believe you that their attitude is starting to change. But it didn’t change for so long, that it’s hard to know, really, what the philosophy is now.

    So that’s all I mean about “semi-official” statements. Someone who can interpret or translate internal Google philosophies into user-friendly external seeking strategies. This can of course be your own voice, and you can do the interpretation through your own filters and understandings. I’m not saying that you have to be an “official” Google speaker. I’m just saying it would be nice to have some sense of how I, as a user, should be understanding what it is that Google is trying to do for me.

  • 19 Daniel Tunkelang // Dec 11, 2009 at 10:47 pm

    You know I’m a fan of transparency, and there are certainly ways in which I’d love to make even the existing interface more transparent–such as exposing query rewrite rules and letting users have complete control over them. As you can imagine, my authority doesn’t extend that far. :-)

    Moreover, I begrudgingly accept that, at least for the foreseeable future, some of the signals and query interpretation process need to remain secret. While most users have little interest in overriding the default ranking, spammers and unscrupulous SEOs would use every piece of information at their disposal to ply their trade. The result of disclosing those secrets would like be–at least in the short term–net negative for most users. For all of my enthusiasm about HCIR, I don’t want to take the approach of making the status quo worse so users are forced to make more effort just to get reasonable results. That might advance the HCIR agenda, but it would be at the expense of the very users HCIR is supposed to help!

    As you might remember, this was the very subject of one my first blog posts. I haven’t given up on the aspiration to combat spam while still being transparent. But I admit that I don’t have a solution in hand, and that it would be reckless to prematurely throw away an approach that, while not perfect, is good enough for an enormous user base over an enormous set of information needs.

    I also suspect that some of the queries you’d like to be able to make might incur significant computational costs. I know that some of mine would. That is another consideration–and not a trivial one.

    Anyway, while I can’t speak for Google, I will try to evangelize Google’s HCIR initiatives in my own voice. And to explain, to the extent I can, why those initiatives take a lot of time and effort to deliver.

  • 20 jeremy // Dec 14, 2009 at 5:21 pm

    Well, it’s hard for me to see how spammers could take advantage of a “sort by least recent / first indexed” user capability. The only way to come up at the top of that list would be to go back in time and create the page ten years ago ;-)

    I also have a hard time seeing how spammers could take advantage of “query by sentiment” and “query by low popularity”. If anything, the current state of the art in Google search is either to (1) not use sentiment, (2) prefer results that contain the “dominant” sentiment, or (3) automatically create a diversity of sentiment in the top 10 results. I’m not sure exactly what Google does, but arguably it is one of these three options.

    But whatever it is that Google currently does, spammers only have to target the current system. If positive sentiment results are currently favored by Google, then spammers only have to create positive sentiment spam pages in order to “conquer” that particular ranking signal. Even if Google results contain a diversity of sentiments, spammers have only to match (key onto) the sentiment of the top-ranked result.

    If, however, you give the user the option to choose positive or negative, then a spammer has to create two different spam pages, one with a positive sentiment, and one with a negative sentiment. That’s the only way they can rank highly to both user query formulations.

    And here’s the kicker: By creating two pages that are almost exactly the same in every other respect (inlinks, non-sentiment keywords, doclength, etc.) the spammers might actually make it even easier for Google to detect spam. Why? Because this near-duplication in every aspect except for sentiment is unlikely to occur due to chance. The existence of near-duplicate content like this would therefore be a strong indication of spam content. And therefore much easier to filter out.

    You have a point when it come to computational complexity. Maybe some of these ranking functions are too complex to implement at scale.

    But I don’t quite buy the spammer argument. Spam prefers a monoculture, does it not? Search as it currently exists is a monoculture. More user control of the search process dilutes that monoculture, adds heterogeneity. Heterogeneity makes it more difficult for spam to take hold. Just like real viruses / pathogens.

  • 21 Daniel Tunkelang // Dec 15, 2009 at 1:23 am

    I’ll concede your point on “sort by least recent”–but I do wonder about the computational implications of supporting such a sort, particularly if default ranking tends to favor more recent, rather than less recent, results.

    As to your monoculture point, that’s basically the argument I made last year, and I’m not recanting. Though, as I noted even then, personalization does make it possible to diversify without transparency–and Google did just roll out personalized search in a big way.

    As a power user and an HCIR advocate, I’d like to see more transparency. But I also have seen enough to know that there is a cottage industry of reverse-engineering Google’s default ranking–which I think we have to accept is what most users would stick with, even given more options. So, as strongly as I feel about transparency, intellectual honesty compels me to admit that enumerating the signals that Google uses would help with that reverse-engineering process.

    That said, I do see places where there are low-hanging fruit for HCIR–and you can be sure that I’m pursuing those opportunities!

  • 22 jeremy // Dec 15, 2009 at 3:44 am

    Daniel,

    I’ll concede your point on “sort by least recent”

    I wasn’t attempting to get you to concede anything :-) Really, I am more just trying to sort out all the various overlapping issues together with you, challenging your interpretations, and being challenged by you at the same time. It’s all good.

    I guess I’m just not understanding.. how would “least recent” be more computationally complex? In whatever secret sauce weighting function Google currently uses… the “sort by most recent” feature (signal) is probably something like (todaysDate – documentDate). A single arithmetic subtraction, right? So what’s the problem with changing that signal to (1/1/1970 – documentDate)? The computational complexity is the same, is it not? A single subtraction?

    Or are there implications for the index, that the inverted lists are stored in a most-recent order so as to minimize seek time, or something?

    So, as strongly as I feel about transparency, intellectual honesty compels me to admit that enumerating the signals that Google uses would help with that reverse-engineering process.

    Well, then let’s go back to one of the suggestions that I made in that earlier music IR/HCIR thread:

    http://thenoisychannel.com/2009/02/28/a-sound-approach-to-exploratory-music-search/#comment-2175

    Essentially, what I am saying is that you can still have transparency, without exposing every single little detail of your ranking function. How? By having “macros”, or sub-global ranking functions that capture a small set of features, which together represent some aspect or mode of search. The user can then pick and choose among these macros, and control his/her experience using them, without having to go so far down into the ranking signal as to say that the user wants a 0.073 weight on the kurtosis of the FFT signal.

    Think of it as a “mid-level” query.

    Though, as I noted even then, personalization does make it possible to diversify without transparency–and Google did just roll out personalized search in a big way.

    Now it’s my turn to concede the point :-) Indeed, personalization does shake up the monoculture a bit. But only a bit. Each user still has a monolithic, non-decomposable ranking function that governs his/her experience. Personalization is like starting with a thousand acres of Russet potatoes and then subdividing that field into millions of 10′x10′ plots and building walls around each plot. Sure, you’ve subdivided the field, and each person=plot is now “personalized” because it is separated from everyone else. But each plot still only contains Russet potatoes. Because of the soil, some of the plots might grow bigger plants, some might grow smaller plants. But they’re still all Russets.

    I would be more impressed by an approach to personalization that let the users dig up his/her 10′x10′ plot of potatoes and plant swiss chard instead.

    and you can be sure that I’m pursuing those opportunities

    And I continue to wish you the best in this, and hope that you enjoy our discussions. I certainly do.

  • 23 jeremy // Dec 15, 2009 at 3:48 am

    (BTW, I’m sure you know that Russets are the varietal used for making McDonald’s french fries. Just to continue the “Google — I’m Lovin’ It” analogy ;-) )

  • 24 Daniel Tunkelang // Dec 15, 2009 at 9:37 am

    Yes, my point about “least recent” was about indexing. In general, knowing something about how people will sort their results gives you opportunities to optimize an index for that sort. Rare access patterns into an index can be disproportionately expensive to support.

    And your point is well taken about macros. I do think it should be possible to find a middle ground that offers enough transparency to be useful but enough obscurity to maintain the current balance of power between spam and anti-spam. Part of the challenge of making the case is that we know the spammers will exploit any sliver of signal they get, while the users won’t necessarily take advantage of the transparency. We need to educate users to make this a net win.

    As for the personalization potato metaphor, I suppose it’s a question of the speed and meaningfulness of mutations in microcosms. I don’t know the answer there. But, even assuming that personalization disrupts the monoculture enough to deter spammers, I agree that diversity with transparency would be more satisfying. That said, it’s hard to make consumers eat their greens both literally and metaphorically!

  • 25 jeremy // Dec 15, 2009 at 1:21 pm

    Coupla minor points:

    If it is indeed a question of indexing, then that’s still not computationally complex, especially at query time. You simply build a parallel index, and sort it in reverse order, non? At query time, you can answer millions of “oldest first” queries just as quickly as you answer the “newest first” queries. It’s offline complexity, and therefore totally manageable, I would think.

    And yes to user education. I asked Cutts about this at your panel at SIGIR this year, why Google doesn’t do more by way of education. Tool tips that appear on mouseover, explaining what everything is/does, for example. He shrugged.

    Re microcosmic mutation: Yeah, I don’t have a sense, either, about how fast users will mutate into truly heterogeneous subcultures. Actually, more broadly, I’ve been having this discussion w/ Greg Linden over on his blog for about 4 years now on exactly how useful (and/or necessary!) implicit personalization is. What percentage of the average user’s queries even need to be implicitly personalized? The classic use case is always given as one user wanting “Jaguar” the car, another wanting the animal, and another wanting the MacOS. Clarifying ambiguity. And yet when I look at my own query habits, I don’t feel like I issue more than 1-2 ambiguous queries a month. So how often is personalization really even going to kick in?

    I realize that the answer to that question is likely a valuable trade secret, so I don’t expect an answer. But that kinda adds to my frustration as a user: I’d like some sort of awareness tool that told me when implicit personalization was being applied to my query and which results were getting boosted because of this personalization. And I’d like that awareness tool to also be clickable, which would have the function of turning the personalization off.

    Somehow, I don’t think I’m going to get what I want.

  • 26 Daniel Tunkelang // Dec 15, 2009 at 11:29 pm

    Let me turn the question around. To the best of my knowledge, no web-scale search engine has ever offered users the functionality you’re describing–specifically letting the user specify how results are sorted. Resorting the top default-ranked results doesn’t count. I’m curious to hear your thoughts as to why not.

  • 27 jeremy // Dec 16, 2009 at 2:16 am

    I’ve got 4-5 different thoughts. Not sure which one to get into. Perhaps in Feb?

    I think, perhaps, that Whit had it right. You can try and strike balances, but at the end of the day you are beholden to the people who give you money. Not to the people who cause that money to flow, but to the people who actually hand the money over to you. And that ultimately affects how far you can…or more importantly how far you cannot…go with any solutions that you offer:

    http://thenoisychannel.com/2009/08/03/sigir-2009-day-3-industry-track-analyst-panel/

    Not that I agree with what he said. I sided with Sue and Theresa about what “should” be. But Whit was describing what is. And unfortunately I think he might be right.

  • 28 Bharath // May 1, 2010 at 4:54 am

    In the context of Cuil and Kosmix not doing well for queries like “faceted search”, I’d like to introduce an alpha product that’s just released. Nuggetize. I think you’ll find much better facets, organized into relevant categories here. Eg: http://nuggetize.com/nuggetize?q=faceted+search&session_name=

  • 29 Daniel Tunkelang // May 1, 2010 at 5:38 pm

    Bharath, thanks for the letting me and readers here know about it. And I saw your blog post about it at http://bharathruminates.blogspot.com/2010/05/nuggetize-faceted-search-for-web.html. If you’re interested in writing a guest post about it here, let me know.

Clicky Web Analytics