Let me preface this post with a clear disclaimer: I work at Google, but the views I express on this blog are my own personal views.
Last week, Google head of webspam Matt Cutts posted a full-throated defense of Google’s transparency on Google’s European Policy Blog in response to complaints that a few companies raised to the European Commission. Long-time readers of my blog know that I’m a big fan of search engine transparency and have made my own calls on this blog for Google to be more transparent. The fact that I work at Google now doesn’t change my values. But being on the inside has informed my perspective.
In particular, as Matt elaborates in his post, Google deserves more credit for transparency than it often gets from its critics. For example, Google has published:
- “The Anatomy of a Large-Scale Hypertextual Web Search Engine“, which not only details the formula for PageRank, but also mentions other signals that Google uses to rank search results: anchor text, location of query terms within documents, proximity of query terms, etc.
- details of its key infrastructure innovations: MapReduce, the Google File System, Bigtable, and Protocol Buffers
- hundreds of research papers by Googlers in diverse areas of computer science
He goes onto describe the various webmaster tools and social media resources that Google has made available. The popularity of these tools is a testament to their utility.
Still, as Matt points out:
we don’t think it’s unreasonable for any business to have some trade secrets, not least because we don’t want to help spammers and crackers game our system. If people who are trying to game search rankings knew every single detail about how we rank sites, it would be easier for them to ‘spam’ our results with pages that are not relevant and are frustrating to users — including porn and malware sites.
As I blogged back in 2008, I still hope that someday we won’t need to have to rely on a relevance analog of security through obscurity in order to deter spam and abusive SEO practices. But I recognize that we haven’t developed such an analog, and hence that complete transparency today for web search ranking algorithms would have a far greater downside than upside for ordinary users.
I suspect that a prerequisite for complete transparency in search requires moving from a ranking-based retrieval approach to a set-based approach. For many web search information needs (e.g., navigational queries), it’s hard to see how users would benefit from such a radical change. For queries that represent more exploratory information needs, a set-based approach would be (at least in my view) far preferable to one based on ranking. But there’s a lot of work to do on the content side before such exploratory interfaces for the web are usable.
In summary, I’m happy to see Matt taking a public stand in Google’s defense. I don’t always agree with my employer’s decisions, but I do believe that my colleagues act in good faith and with good intentions. I understand how many people–especially site owners–fixate on whatever Google keeps secret. In a world where so many people compete for attention, information is power. Google tries to provide maximum quality to users while keeping the playing field level for site owners. As Google Fellow Amit Singhal points out, “this stuff is tough“.
17 replies on “Google and Transparency”
[…] Google and Transparency […]
While this topic deserves a longer response, let me just quickly say that I don’t think Google should be trying to claim any credit for publishing the “Anatomy of a Large Scale Hypertextual Web Search Engine” paper. Why? Because Google the company never published this paper. Sergey and Larry, the grad students, published this paper. Their affiliations on the paper are “Stanford”, not Google.
So it is disingenuous, at the very least, to give Google credit for this. This information was released (at least submitted for publication with the intention of being released) before Google the company existed. So whatever L&S’s attitudes about transparency were as grad students, I think it changed once they became incorporated as Google.
And I cannot count the number of times I’ve seen Google talks in which the Google speaker has intentionally gone out of his or her way to mention that the form of PageRank that they use in today’s engine is nothing like the one published in the paper. I don’t know if that’s the truth, or a deliberate attempt at obfuscation. But the upshot is that this means that Google is even less transparent than claimed, because it has the feel of trying to throw people off track — now one doesn’t know what to believe anymore.. the published paper or the research talks.
So to go back now and try and claim “transparency credit” for this paper… makes me uneasy.
And sure, there are hundreds of papers by Googlers. But how many of those papers were done (primarily) by summer interns, who already came to Google with a solid sense of a research idea? And of the ones that are Google-internal only, how many of those are on production-code systems, vs raw research? Microsoft publishes hundreds of papers as well, the vast majority of which are never incorporated into shipped products. Does that also make Microsoft transparent?
Given that Larry and Sergey are founders and major shareholders, I think Google should get at least a little credit for their actions. 🙂
And yes, the measure doesn’t conform precisely to the paper–and in any case it’s only one factor. But, to put it in the context of Matt’s post: “One of the most widely-discussed parts of Google’s scoring has always been PageRank. That ‘secret ingredient’ is hardly a secret.”
As for the papers, MapReduce, the Google File System, Bigtable, and Protocol Buffers are essential production tools. I agree with Matt that Google deserves credit for enabling Hadoop, thus supporting its largest web search competitor.
In any case, no one at Google denies keeping the precise details of ranking secret. But I’d say that critics who extrapolate that the entire ranking algorithm is a black box overstate their case, given how much has been disclosed. Unless you feel that being transparent is like being pregnant–an all or nothing deal. I don’t believe that.
Of course, the other issue is Google’s motive for what secrecy it does keep. As per an official blog post by Google’s counsel, Google stands accused of demoting the positions of competitive sites. I don’t personally believe this to be the case, nor have I even seen anyone present evidence for it. I do understand how some people–particularly site owners unhappy with their ranking–may feel Google is guilty unless it discloses everything in order to prove itself innocent. Given the stakes, I don’t think it’s reasonable to expect Google to make such a sacrifice–at the expense not only of its own competitive position but also of its users.
Jeremy, I know you have many points of disagreement with Google’s approach, particularly with regard to transparency. Perhaps the extent of Google’s disclosure isn’t enough to earn a passing score in your book. I’m hardly giving it an A+. But I hope we can agree that it isn’t black and white.
It is no wonder that Google’s algorithms get a reputation for being intransparent. For a long time, adding a search term to a Google query would give a smaller result set; recently I added a search term and the result set got larger! Long, long time ago, a Google query would give the same results all over the world, now I get wildly different results depending on where I am (being in Belgium, I get mostly Dutch-language pages). And for about a decade connecting search terms with a dash operator meant that the terms should occur as a phrase in the document — not any longer. No wonder I use Google with a feeling that I don’t really know what’s going on under the hood.
Anyway, there are good reasons for obscurity for the commercial and search quality reasons you mentioned. However, in order to advance scientific understanding of search quality, there is also a need for transparency. I mean that it should be possible for IR and HCIR researchers to run different algorithms and approaches on a Gooogle-size index and with real-life queries, and publish the algorithms and results. It would be nice to see Google contribute to such an endeavour, in order to prevent a situation where the knowledge about how to deliver a great search experience is confined to only a handful of big companies.
Gregor, I understand changes in search bavior over time can be confusing, but I think Google has tried to explain many of them. Contextual synonym inference may explain how adding a term makes the result set larger–though I can’t know for sure without seeing your particular example. Localization of results is something Google has publicized with pride. And I’m not sure why you’re suing a dash rather than quotation marks to specify phrases–you might want to look at this help page
In any case, I agree that it would be great for researchers to run different algorithms and approaches on a Google-size index and with real-life queries. I think researchers are in a position to build a sufficiently large index–in part because of the publications I describe above. But what researchers really want–at least from what I have heard–is access to query logs and actual user traffic. That raises major privacy concerns for users, as well as concern about abuse by spammers. Don’t forget what happened a few years ago.
Well, I liked the dash because I am lazy – it was just so easy to replace a blank with a dash — phrase search without any extra keystrokes. But I am not complaining – Google has done a lot to make searching easy and effective.
I agree that this is a really massive subject for a blog post. However, I agree that Google can’t give away the store. I have no problem with not knowing every detail in how Google ranks pages. I’d be silly to presume that I would be able to understand it all anyway.
Given that Larry and Sergey are founders and major shareholders, I think Google should get at least a little credit for their actions.
With all respect, I still strongly disagree about giving even a little credit — not to L&S — but to Google. You have to look at the motivation of Larry and Sergey, and what was going on at the time. At the time that they actually submitted the paper for publication, they were researchers at a university and funded by NSF money. Their job, their primary product, their deliverable, was publications. So when they published PageRank they did so before they ever started Google the company, and did so because it was the mandate of their academic position and government funding to do so. Not because of any attempt to be more or less transparent. Had they not published the paper, they would have not been properly doing their jobs at that time. Transparency had nothing to do with it.
I hear what you’re saying; they are indeed the founders of Google. But I still do not see how you can give Google credit for something that was a requirement of a pre-Google job. Doesn’t make any sense.
But I think Gregor gets to the heart of this issue, above. What Gregor is saying is that it doesn’t really matter what the exact mathematics of the Google ranking algorithm is. What matters is the end-user penetrability of that algorithm. Is Google transparent enough so that a searcher using the Google system can give Google the right signals, in the right way, at the right time, so as to find the information that is needed? And I think the answer to that is still no. It’s changing, and I do see small HCIR signs here and there. But at the end of the day, that’s the question that really matters.
Of course, making Google more HCIR-transparent will most likely involve exposing, in some fashion or other, more of the Google ranking algorithm. It might not be a raw exposure, but there will still have to be an exposure of some sort in order to give the users more control over their experience.
So the question: How transparent is Google in that regard?
From that perspective, all those papers on MapReduce (which I do give transparency credit for) and PageRank (which I don’t give transparency credit for — it was their job at the time!) don’t matter one way or the other. Because the technical details contained in those papers are still not exposed to the end user in a transparent way. There is no way for the end user to say “Stop using the PageRank popularity signal for my query, because I know that what I am looking for is in the tail!”
Transparency is therefore still at a minimum for what really matters: End user experience.
Fair enough, I’ll concede that Google as a company doesn’t get credit for publishing the paper. But it still is a disclosure that sheds light on the black box, and I think that supports Matt’s general argument.
But you raise an interesting point. I’m with you on the HCIR perspective of focusing on end-user penetrability. And Gregor is a refreshing example of a user who cares about the details of how the search engine works.
But I see far greater concern about the ranking details by site owners than by users, e.g. a lot more literature about site owners can improve their site ranking than on how users can search more effectively.
According to the Christensen’s theory of disruptive technology (at least Wikipedia’s explanation), ” a disruptive technology may enter the market and provide a product which has lower performance than the incumbent but which exceeds the requirements of certain segments, thereby gaining a foothold in the market.” If users value transparency, then there should be room for someone to offer a product that lags the incumbent on most measures but offers greater transparency. I believe this is already happening in markets like enterprise search, but I don’t see the evidence for it in web search.
And, to be clear, I’ve advocated for it! But that’s doesn’t make it so.
“I believe this is already happening in markets like enterprise search, but I don’t see the evidence for it in web search.”
How much of this is chicken/egg? I remember Steve Jobs claiming that there was no evidence that users wanted to watch video on an extremely small screen. Six months later Apple released the video iPod, and now lots of people do.
If users don’t know or don’t understand what is possible, they’ll never ask for it. There has to be some education, first. How much education is Google doing?
“But it still is a disclosure that sheds light on the black box, and I think that supports Matt’s general argument.”
Matt’s general argument isn’t that there are things about Google that are open/known. His argument is that Google the company actively has an agenda to make more things transparent, carries through with that agenda, and that Google should get more credit for making things transparent (the active followthrough) than its critics give it.
Because PageRank was disclosed by NSF-funded grad students doing their job of publishing, and not by Google, it doesn’t support Matt’s argument at all.
I’m sorry that I’m being such a stickler about this point, but I see it quite strongly. What we’re talking about is what Google as a company is actively doing.
For that matter, Cutts writes: “Google has continued to publish literally hundreds of research papers over the years. Those papers reveal many of the “secret formulas” for how Google works”
I’ll ask the question again: “And sure, there are hundreds of papers by Googlers. But how many of those papers were done (primarily) by summer interns, who already came to Google with a solid sense of a research idea? And of the ones that are Google-internal only, how many of those are on production-code systems, vs raw research? Microsoft publishes hundreds of papers as well, the vast majority of which are never incorporated into shipped products. Does that also make Microsoft transparent?”
In other words, how much of this is “transparency through obscurity”. I.e. if there are hundreds of papers with secret formulas in them, and only 7 of those formulas actually make it into production code, and it is never make known which 7 those are, then how can this reasonably be called transparency? If there is so much flak thrown up, then any real disclosure is essentially so obfuscated so as to not really be transparency at all.
I do agree with you, Daniel, that it’s not a black and white issue. Google doesn’t get an A+ on transparency, nor do they get an F-. But the picture is also not quite as clean as Cutts paints it, either.
Fair point re: user education. But it’s not just a question of what users see as possible–it’s also a question what search engines can do, and at what cost. I do think it would be interesting to have an independent search engine that was completely transparent and to see if users who tried it stuck with it. Does Wikia Search (R.I.P.) count as such an experiment?
As for the research papers, I think Google has made a point of promoting what it believes to be the most important ones–and they are infrastructure papers, not search papers per se. But the success of Hadoop that I alluded to earlier is evidence that this transparency isn’t completely obscured.
I’m glad we agree that we’re debating shades of gray. I’d like to strive toward a lighter shade than the current one, but I agree with colleagues that there’s a real risk to users in over-disclosing the details of ranking. I hope to see the day where it’s possible to have relevance without obscurity.
“I do think it would be interesting to have an independent search engine that was completely transparent and to see if users who tried it stuck with it.”
Agreed. Caveat, however: transparency alone wouldn’t be enough. Because not all queries (e.g. navigation) need transparency. So the independent search engine that we’d need to build to run this experiment has to be equal in quality to Google for navigational queries, with the ability to switch to transparent interaction for informational/exploratory queries. Given that requirement, Wikia Search does not count as such an experiment.
“But the success of Hadoop that I alluded to earlier is evidence that this transparency isn’t completely obscured.”
Yup, I agreed to the Hadoop/Mapreduce benefits in my comment #8, above. Still, that’s not a part of the system (other than query response time) that the user sees. I’m interested in HCIR.
“I’m glad we agree that we’re debating shades of gray.”
Of course. We’re not republicans, now, are we? 🙂
“I agree with colleagues that there’s a real risk to users in over-disclosing the details of ranking. I hope to see the day where it’s possible to have relevance without obscurity.”
One day when you can finally, legally talk about it, I’d like to know what it is you’ve learned since joining the Big G that seems to have moved you more toward this risk-averse state. Maybe that’ll have to be in 20 years. But I have sheer academic curiosity about the thinking process that is occurring.
[…] Google and Transparency – Noisy Channel […]
[…] Google und Transparenz – Noisy Channel […]
[…] Google und Transparenz – Noisy Channel […]
[…] they’ve made an impressive attempt to increase the transparency of relevance ranking. But, as I blogged earlier this year, I think that, at least for the time being, Google is making the right decision to keep some of its […]