Categories
General

Matt Cutts Lays Down The Law

I was just reading an article on TechCrunch about how New York-based advertising firm MediaWhiz has launched a new product today called InLinks for advertisers who want their sites associated with specific keywords. Those words, when they appear in content, will turn into links to the advertisers’ sites. Basically, they are selling “Google juice”.

Natually, Google isn’t impressed. Specifcally, Google’s Matt Cutts says:

Google has been very clear that selling such links that pass PageRank is a violation of our quality guidelines. Other search engines have said similar things. The Federal Trade Commission (FTC) has also given unambiguous guidance on this subject in the recent PDF at http://www.ftc.gov/os/2008/03/P064101tech.pdf where they said “Consumers who endorse and recommend products on their blogs or other sites for consideration should do so within the boundaries set forth in the FTC Guides Concerning Use of Endorsements and Testimonials in Advertising and the FTC’s guidance on word of mouth marketing,” as well as “To date, in response to this concern, the FTC has advised that search engines need to disclose clearly and conspicuously if the ranking or other presentation of search results is a function of paid placement, and, similarly, that consumers who are paid to engage in word-of-mouth marketing must disclose that fact to recipients of their messages.”

Cutts also cites regulations in the United Kingdom and European Union concerning misleading trade practices that prohibit or at least discourage what MediaWhiz is doing.

Despite my distaste for Google’s black box approach to relevance, it’s pretty easy to see that Google has a higher moral ground than MediaWhiz in this instance. Relevance may be subjective and socially constructed, but no one wants it to be for sale except the people who can make money on selling it.

Still, it’s interesting that Cutts uses the word “violation” to describe the activity of companies that don’t have a contractual relationship with Google. Granted, he’s talking about violations of guidelines, not breach of contract, but it’s still sounds pretty legalistic. I wish I could take credit for this observation, but that honor goes to Kenneth Miller, who had this to say in a comment:

“Google has been very clear that selling such links that pass PageRank is a violation of our quality guidelines”. The way Matt Cutts phrased this imparts more authority to Google than was probably meant. At first glace it caught me off guard too – since the word violation seems to impart a breach of agreement. Surely, the internet does not exist at Google’s pleasure, and you do not enter into any contractual relationship with Google upon putting content online. Granted, if Google chooses to ignore your content because of it’s composition it may very well be the case that nobody will ever find it. It do find it interesting, however, that the company which has littered the internet with contextual text adds would have the gall to be up in arms about this obvious progression of their original idea. If Google wants to play the censorship game, perhaps they should start with the scores of morally questionable material one can easily find by using their search engine. I mean, so long as you want to talk about what is moral rather than what is for all intents and purposes still legal.

Let’s not blow this out of proportion–Google is not pressing charges against MediaWhiz, and no one is suggesting that Google would have any authority to do so. Even if “Google juice” is as valuable as SEO consultants make it out to be (some debate about that here), it’s certainly not a legal entitlement.

Nonetheless, it’s clear that Google’s decisions as a private entity have a dramatic effect on the link economy. An increasing number of folks see this as a problem, though few seem to be making constructive suggestions.

Here is one: make relevance and authority computation transparent. If the ranking of a search result comes with an audit trail, then there will be less value in gaming the ranking algorithm. Moreover, this transparency would be a first step towards making the ordering of results something controlled by users, rather than the search engine.

I know that what I am proposing isn’t the Googley way. But ultimately it is the only way that we will win the arms race against spammers.

Categories
General

Tweet First, Ask Questions Later

If anyone has any doubt as to the real-world impact of social media, consider the recent battle between Motrin and the “mommy bloggers”. Motrin had released an ad, launched to coincide with international baby-wearing week,  that presented an irreverent take on “wearing your baby”. Apparently too irreverent: a critical mass of indignant baby-wearing moms used blogs and Twitter to express their outrage, and Johnson & Johnson quickly pulled the ad and apologize prominently on the Motrin home page.

Let’s not waste time debating the ad (full disclosure: my wife, who is a baby-wearing mom, loved it). The real story here is that social media is a game changer for ad hoc protests. In the past, it might have taken weeks to organize a boycott. Now we see coordinated activism–and results–in hours. This is a big deal, and surely a wake-up call to anyone who still believes that social media are a fad.

But the increasing power of social media also raises concerns about information accountability, an issue I’ve discussed before on this blog. What happens if we use the power of social media to get a message out there and it’s wrong? Sure, it’s possible to recant and even issue public apologies (even South Park style), but extensive research shows the lasting power of a first impression, even in the face of contradictory evidence (this is a form of anchoring bias).

Does the immediacy of social media impose new responsibility on publishers because of the potential for harm? Or should non-professional journalists (aka “citizen journalists”)  err on the side of “tweeting first and asking questions later”, letting the professionals take care of sorting out the facts.

The laws regarding defamation tend to favor publishers in the United States, in notable contrast to the corresponding laws in the United Kingdom, which essentially place the burden of proof on the publisher rather than on the offended party.

As an American, I can’t help but wonder if our laws reflect a different time, when publishers were scarce and highly conscious of their reputations. In a day when everyone can be a publisher–and an anonymous one no less–the balance of power to influence the public seems to have changed.

Overall, I see this power as a good thing, a triumph of democracy. Nonetheless, with great power comes great responsibility. Perhaps it’s best for the laws to err on the side of protecting free expression rather than protecting people from the harm that can be caused by malicious or reckless expression. That is the American ideal–to promote freedom of expression while recognizing that it comes at a price.

But it would nice to see publishers, whether professional or amateur, adjust to the new power of the virtual pen. Now that we have power, let’s show the world that we can use it responsibly.

Categories
General

The Difference Between Google and Yahoo

Today, Yahoo announces a new tool to provide keywords describing search results that they make available to developers using their public BOSS API.

Meanwhile Google announces a new tool to tell you what keywords you should be paying Google to use in your AdWords campaign to advertise your site.

I imagine that the technology behind both tools isn’t all that different–or at least doesn’t have to be. But, while Yahoo makes friends in the technology community (especially among researchers), Google makes friends in the advertising community–and makes itself oodles of money. It’s all good to have friends, but someone has to pay the bills.

Categories
General

Faceted Search for the Web: A Grand Challenge?

At the HCIR workshop last month, one of the posters was from Microsoft researchers Jaime Teevan, Susan Dumais, and Zachary Gutt, entitled “Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web“.

From the abstract:

Those [challenges] that we have identified stem from the fact that such datasets are 1) very large, making it difficult to assign quality meta-data to every document and to retrieve the full set of results and associated metadata at query time, and 2) heterogeneous, making it difficult to apply the same metadata to every result or every query.

Drilling further into their position paper reveals three challenges:

  • A lack of good, automatically generated metadata.
  • Uncertainty as to which facets will be most valuable for particular information needs.
  • The cost of dynamically computing facet distributions for result sets.

While these are all serious challenges, I feel that the position paper overstates them. Not all of the metadata needs to be generated automatically, and in any case there are lots of opporutnities to crowd-source metadata creation. Facet selection is more challenging, but the work we’ve done at Endeca on query clarification suggest that this problem is tractable. Finally, the computational challenge strikes me as an artifact of today’s ranked retrieval systems for web search, which are a bad fit for what is essentially a set retrieval problem.

I’m not saying that this problem isn’t hard. In fact, I think that the authors neglect the biggest challenge, which is the adversarial nature of web search. Arguably this problem is an implicit (but unstated) aspect of the metadata problem–there will be too much of an incentive to game it.

Nonetheless, I think the time is ripe to consider faceted search approaches for large, heterogeneous corpora like the web. And perhaps we can work around the adversarial model while we’re at it. But that’s the subject for another post.

Categories
General

Recommending Diversity

Another nice post from Daniel Lemire today, this time about a paper by Mi Zhang and Neil Hurley on “Avoiding monotony: improving the diversity of recommendation lists” (ACM Digital Library subscription required to see full text).

Here’s an abstract of the abstract:

Noting that the retrieval of a set of items matching a user query is a common problem across many applications of information retrieval, we model the competing goals of maximizing the diversity of the retrieved list while maintaining adequate similarity to the user query as a binary optimization problem.

It’s nice to see a similarity vs. diversity trade-off for recommendations analogous to the precision vs. recall trade-off for typical information retreival evaluation.

Our experience at Endeca is certainly that most of the approaches out there underemphasize diversity, which not only leads to the “monotony” problem but also breaks down when the query does not unambiguously express the user’s intent. Since our approach emphasizes interaction, we leverage the diversity of the options we present to maximize the opportunity for users to make progress in satisfying their information needs.

I would like to second Daniel Lemire’s suggestion to perform user studies to investigate the optimal balance between diversity and accuracy. They’d make for great papers. Just remember to send him (and me!) copies!

Categories
General

Symposium on Semantic Knowledge Discovery, Organization and Use, Day 2

Day 2 of the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU brought out representatives of the titans of web search:

  • Yahoo: Patrick Pantel (who actually just joined Yahoo) regaled us with entertaining tales “Of Search and Semantics”, whisking us through the history of search, arguing that semantics are making a commercial impact today, and then describing some current research at Yahoo. He elaborated a fair amount on SeeLEx, which stands for seed list expansion and is, in own own words, Yahoo’s version of Google Sets. But, unlike Google Sets, SeeLEx offers at least some transparency into the basis for similarity. I couldn’t fnd anything published other than the slides in my notebook from the symposium, but this is interesting work.
     
  • Google: Marius Pasca delivered an excellent talk, though its title of “Web Search as an Online Word Game for Knowledge Discovery” was a bit misleading. Unfortunately, neither of the Google speakers provides slides for the symposium notebook–I hope that isn’t a mandate from Google’s corporate policy. In any case, the talk was along the lines of his AAAI 2008 paper, “Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction“. It presented an intriguing approach for inferring class atrributes based on a distributional analysis of query space–an approach that reminded me of his earlier CIKM paper on “Acquisition of categorized named entities for web search“. Unfortunately neither of these papers is available without the appropriate memberships, but perhaps Marius will send them to you if you ask nicely.
     
  • Microsoft: Bill Dolan, Principal Researcher and manager of MSR’s Natural Language Processing Group, asked “Where does NLP stop and AI Begin?” He took us through the history of the MindNet project, and ruefully explained that it “worked beautifully” but “just not very often”. He made a compelling argument that semantic knowledge discovery researchers need to take a step back from AI-hard problems and focus on problems like paraphrase that are more amenable to the kind of progress we’ve seen in areas like machine translation.
Their presentations were followed by a three-hour (!) poster / demo session. As one of the demo presenters, I had 90 seconds in the poster boaster session to pitch my demo, and then attendees could wander around the posters and demos during the two-hour lunch break. I had some great conversations with the folks who did swing by, but I’m not sure this was the ideal format for conducting such a session.
The session after the demos was a bit of a blur for me, but the final discussion session was very engaging. One of the hot topics in the semantic knowledge community, much as it is in the information retrieval community, is the need for query logs–and, more generally, for good data–to conduct research. Having representatives of the three major web search engines there made the conversation certainly more interesting.
All in all, an excellent symposium, and I’m very grateful to Satoshi Sekine for organizing it.
Categories
General

Google Flu Trends: The Privacy Backlash Begins

Don’t say I didn’t tell you so. Declan McCullagh reports at CNET that privacy groups are expressing concern about Google Flu Trends:

The Electronic Privacy Information Center and Patient Privacy Rights sent a letter this week to Google CEO Eric Schmidt saying if the records are “disclosed and linked to a particular user, there could be adverse consequences for education, employment, insurance, and even travel.” It asks for more disclosure about how Google Flu Trends protects privacy.

I agree with Declan that

If you think that knowing that Alaska’s “influenza-like illness” number for the week of November 9 is 2.035 and California’s number is 1.384 is somehow worrisome and can identify you personally, it’s time to break out your tinfoil hat.

But, as the article gets to, the deeper concern is the one expressed by the Electronic Privacy Information Center:

There are no clear legal or technological privacy safeguards that prevent the disclosure of individual search histories. Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or presidential authority.

I’m not paranoid, and I actually think that both privacy advocates and web search companies have often exaggerated privacy issues, especially since the AOL fiasco a couple of years ago. But EPIC is raising is a legitimate concern, and I think Google doesn’t seem to be providing very reassuring answers.

Specifically, web search companies are very protective of their log data in the name of privacy, much to the chagrin of researchers. And yet those same companies feel that privacy advocates exaggerate their concerns about the data being collected in the first place. Google / Yahoo / Microsoft: you can’t have it both ways!

A final point: Declan comments that “If users don’t like that, nobody’s forcing them to use Google.” I chalk that attitude up to his libertarianism rather than to any partiality he may have towards his wife’s employer. I have a libertartian streak myself, so I’m sympathetic. But, just as a practical matter, this is the sort of behavior that historically attracts regulation. Google and its rivals would do well to acknowledge the legitimacy of their critics’ concerns and regulate themselves first.

Categories
General

Symposium on Semantic Knowledge Discovery, Organization and Use, Day 1

Today was the first day of the two-day NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU.

Here are some highlights:

  • Marti Hearst started us off with a discussion of tricks for statistical semantic knowledge discovery–namely, using “lots o’ text”, unambiguous cues, and “rewrite and verify”.
     
  • Dekang Lin showed off the power of “lots o’ text” by showing how the Google n-gram data could be used to peform various semantic discovery tasks.
     
  • Peter Turney argued that we need to combine symbolic representations for episodic information (i.e., what we obtain from information extraction) with spatial (i.e., vector space model) representations for semantic information.
There were a bunch of other talks that focused on the details of building and using semantic knowledge bases, but I’ll freely admit that I’m a bit of an outsider in this world. Nonetheless, I find the participation impressive in both quality and quantity.
I’ll post more notes tomorrow. And, if you’re in New York, I encourage you to attend tomorrow. They are letting people walk in, even they haven’t registered in advance.
Categories
General

Big Google can be Benign

An article in today’s New York Times reports on Google Flu Trends, which aspires to detect regional outbreaks of the flu before they are reported by the Centers for Disease Control and Prevention. As reported in the article:

Google Flu Trends is based on the simple idea that people who are feeling sick will probably turn to the Web for information, typing things like “flu symptoms” or “muscle aches” into Google. The service tracks such queries and charts their ebb and flow, broken down by regions and states.

It’s a clever idea, though obviously it raises privacy concerns. Google mitigates those concerns by “relying only on aggregated data that cannot be used to identify individual searchers.”

It will be interesting to see popular reaction to this offering in the United States and in more privacy-conscious Western Europe. On one hand, health-related search logs are the bête noire of privacy activists–and with good reason, since people are terrified of losing their health insurance. On the other hand, Google seems to have only the best intentions here, and the service they provide may do a lot of good.

I personally hope we can see efforts like these succeed. Of course, it’s essential that Google and anyone else who pursues such efforts  be transparent about what data they collect and how they protect individuals from inadvertent disclosure. Ideally, they don’t collect more data than is needed–especially when that data is dangerous in the wrong hands.

Though I have to wonder, might anyone try to game such a system? Maybe I have an over-active imagination, but systems like these are seem to be ripe targets for denial-of-insight attacks. Whit, another one for your files?

Categories
General

Should We Build Task-Centric Search Engines?

Greg blogged today about a video of a DEMOfall08 panel on “Where the Web is Going” where Peter Norvig from Google and Prabhakar Raghavan from Yahoo both advocated that, rather than supporting only one search at a time, search engines should focus on helping people accomplished larger tasks, such as booking a vacation or finding a job. I won’t be so vain as to assume they read my blog, as these are canonical examples of tasks that current search engines don’t address very well.

I do see value in building task-specific applications that encapsulate the process of accomplishing particular classes of tasks–including any information seeking neccessary towards that end. But I’m not convinced that such applications live inside of search engines.  Rather, I think that a search engine (if that is still the right word for it) should be adaptive enough for task-centric applications to leverage it as a tool.

Perhaps it’s natural that leading researchers from Google and Yahoo have a search-centric view of the world. Given my daily work, I sometimes lapse into that view myself. But it’s important for us to realize that search–or, more broadly speaking, information seeking–is a means to an end. At least in the future I envision, information seeking support tools will be so well embedded in task-centric applications that we will almost never be conscious of information seeking as a distinct activity.