Categories
General

The Difference Between Google and Yahoo

Today, Yahoo announces a new tool to provide keywords describing search results that they make available to developers using their public BOSS API.

Meanwhile Google announces a new tool to tell you what keywords you should be paying Google to use in your AdWords campaign to advertise your site.

I imagine that the technology behind both tools isn’t all that different–or at least doesn’t have to be. But, while Yahoo makes friends in the technology community (especially among researchers), Google makes friends in the advertising community–and makes itself oodles of money. It’s all good to have friends, but someone has to pay the bills.

Categories
Uncategorized

Yahoo BOSS, Now With Key Terms

I’d just hit “publish” on my last post about the challenges of faceted search for the web when I saw this post from Jeff about Yahoo announced an extension to their public BOSS API that provides “key terms” for search results.

Jeff quotes this excerpt from their description:

Key Terms is derived from a Yahoo! Search capability we refer to internally as “Prisma.”… Key Terms is an ordered terminological representation of what a document is about. The ordering of terms is based on each term’s frequency and its positional and contextual heuristics…Each result contains up to 20 terms describing the document.

Yes, I know, key terms aren’t a faceted classification. And I don’t know what quality or consistency this feature provides. Still, it’s a step towards addressing the first and most serious challenge raised in the Microsoft researchers’ position paper. And it’s nice to see news about Yahoo beyond the saturation coverage of Jerry Yang stepping down.

Categories
General

Faceted Search for the Web: A Grand Challenge?

At the HCIR workshop last month, one of the posters was from Microsoft researchers Jaime Teevan, Susan Dumais, and Zachary Gutt, entitled “Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web“.

From the abstract:

Those [challenges] that we have identified stem from the fact that such datasets are 1) very large, making it difficult to assign quality meta-data to every document and to retrieve the full set of results and associated metadata at query time, and 2) heterogeneous, making it difficult to apply the same metadata to every result or every query.

Drilling further into their position paper reveals three challenges:

  • A lack of good, automatically generated metadata.
  • Uncertainty as to which facets will be most valuable for particular information needs.
  • The cost of dynamically computing facet distributions for result sets.

While these are all serious challenges, I feel that the position paper overstates them. Not all of the metadata needs to be generated automatically, and in any case there are lots of opporutnities to crowd-source metadata creation. Facet selection is more challenging, but the work we’ve done at Endeca on query clarification suggest that this problem is tractable. Finally, the computational challenge strikes me as an artifact of today’s ranked retrieval systems for web search, which are a bad fit for what is essentially a set retrieval problem.

I’m not saying that this problem isn’t hard. In fact, I think that the authors neglect the biggest challenge, which is the adversarial nature of web search. Arguably this problem is an implicit (but unstated) aspect of the metadata problem–there will be too much of an incentive to game it.

Nonetheless, I think the time is ripe to consider faceted search approaches for large, heterogeneous corpora like the web. And perhaps we can work around the adversarial model while we’re at it. But that’s the subject for another post.

Categories
Uncategorized

Reporting from the Anti-Spam Front

Surprisingly good reporting in the mainstream media (specifically, the Washington Post) about how shutting down McColo reduced worldwide spam volume by 65%.

Here’s a choice quote about why spammers prefer to host their servers in the United States:

What’s more, dependability and server uptime are important in cutthroat businesses for which an outage of a few hours can staunch the flow of spam and cost thousands of dollars. 

Hey, business is business, even for evil spammers.

Categories
Uncategorized

Blogging…Now 99.6% Safer Than Surfing!

OK, this is an oldie but goodie from xkcd, but I saw it in a recent presentation and couldn’t resist sharing.

Of course, you’d never know that from the sensationlist press.

Categories
General

Recommending Diversity

Another nice post from Daniel Lemire today, this time about a paper by Mi Zhang and Neil Hurley on “Avoiding monotony: improving the diversity of recommendation lists” (ACM Digital Library subscription required to see full text).

Here’s an abstract of the abstract:

Noting that the retrieval of a set of items matching a user query is a common problem across many applications of information retrieval, we model the competing goals of maximizing the diversity of the retrieved list while maintaining adequate similarity to the user query as a binary optimization problem.

It’s nice to see a similarity vs. diversity trade-off for recommendations analogous to the precision vs. recall trade-off for typical information retreival evaluation.

Our experience at Endeca is certainly that most of the approaches out there underemphasize diversity, which not only leads to the “monotony” problem but also breaks down when the query does not unambiguously express the user’s intent. Since our approach emphasizes interaction, we leverage the diversity of the options we present to maximize the opportunity for users to make progress in satisfying their information needs.

I would like to second Daniel Lemire’s suggestion to perform user studies to investigate the optimal balance between diversity and accuracy. They’d make for great papers. Just remember to send him (and me!) copies!

Categories
Uncategorized

Learning by Analogy

Thanks to Daniel Lemire for point to this recent paper by Peter Turney on “A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations“.

Daniel Lemire consider this paper an example of the “more data beats better algorithms” principle most famously espoused by Google Director of Research Peter Norvig.

My take is a bit different. One message I heard repeatedly at the recent NSF Symposium on Semantic Knowledge Discovery, Organization and Use is that semantic researchers need to reduce their problem space to make progress. Peter is doing exactly that in his own work by taking what are perceived as distinct problems and generalizing them in order to treat them uniformly. Perhaps the broader community could profit from his approach and, um, learn by analogy.

Categories
General

Symposium on Semantic Knowledge Discovery, Organization and Use, Day 2

Day 2 of the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU brought out representatives of the titans of web search:

  • Yahoo: Patrick Pantel (who actually just joined Yahoo) regaled us with entertaining tales “Of Search and Semantics”, whisking us through the history of search, arguing that semantics are making a commercial impact today, and then describing some current research at Yahoo. He elaborated a fair amount on SeeLEx, which stands for seed list expansion and is, in own own words, Yahoo’s version of Google Sets. But, unlike Google Sets, SeeLEx offers at least some transparency into the basis for similarity. I couldn’t fnd anything published other than the slides in my notebook from the symposium, but this is interesting work.
     
  • Google: Marius Pasca delivered an excellent talk, though its title of “Web Search as an Online Word Game for Knowledge Discovery” was a bit misleading. Unfortunately, neither of the Google speakers provides slides for the symposium notebook–I hope that isn’t a mandate from Google’s corporate policy. In any case, the talk was along the lines of his AAAI 2008 paper, “Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction“. It presented an intriguing approach for inferring class atrributes based on a distributional analysis of query space–an approach that reminded me of his earlier CIKM paper on “Acquisition of categorized named entities for web search“. Unfortunately neither of these papers is available without the appropriate memberships, but perhaps Marius will send them to you if you ask nicely.
     
  • Microsoft: Bill Dolan, Principal Researcher and manager of MSR’s Natural Language Processing Group, asked “Where does NLP stop and AI Begin?” He took us through the history of the MindNet project, and ruefully explained that it “worked beautifully” but “just not very often”. He made a compelling argument that semantic knowledge discovery researchers need to take a step back from AI-hard problems and focus on problems like paraphrase that are more amenable to the kind of progress we’ve seen in areas like machine translation.
Their presentations were followed by a three-hour (!) poster / demo session. As one of the demo presenters, I had 90 seconds in the poster boaster session to pitch my demo, and then attendees could wander around the posters and demos during the two-hour lunch break. I had some great conversations with the folks who did swing by, but I’m not sure this was the ideal format for conducting such a session.
The session after the demos was a bit of a blur for me, but the final discussion session was very engaging. One of the hot topics in the semantic knowledge community, much as it is in the information retrieval community, is the need for query logs–and, more generally, for good data–to conduct research. Having representatives of the three major web search engines there made the conversation certainly more interesting.
All in all, an excellent symposium, and I’m very grateful to Satoshi Sekine for organizing it.
Categories
Uncategorized

Fight the Spammers that Be!

Misery loves company, so I’m reassured to know that I’m not the only recent victim of the uptake in spam. Matt Hurst reports that Political Streams has recently been hit with a lot of LiveJournal spam. A current look at the site shows that the problem still persists, at least for blogs:

Matt says that “minor modification to our spam filter should take care of it.” I hope so. But it’s clear that, as social media become increasing important, spammers are taking note.

p.s. For folks too young to remember the 80s, the title is a play on this Public Enemy song.

Categories
General

Google Flu Trends: The Privacy Backlash Begins

Don’t say I didn’t tell you so. Declan McCullagh reports at CNET that privacy groups are expressing concern about Google Flu Trends:

The Electronic Privacy Information Center and Patient Privacy Rights sent a letter this week to Google CEO Eric Schmidt saying if the records are “disclosed and linked to a particular user, there could be adverse consequences for education, employment, insurance, and even travel.” It asks for more disclosure about how Google Flu Trends protects privacy.

I agree with Declan that

If you think that knowing that Alaska’s “influenza-like illness” number for the week of November 9 is 2.035 and California’s number is 1.384 is somehow worrisome and can identify you personally, it’s time to break out your tinfoil hat.

But, as the article gets to, the deeper concern is the one expressed by the Electronic Privacy Information Center:

There are no clear legal or technological privacy safeguards that prevent the disclosure of individual search histories. Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or presidential authority.

I’m not paranoid, and I actually think that both privacy advocates and web search companies have often exaggerated privacy issues, especially since the AOL fiasco a couple of years ago. But EPIC is raising is a legitimate concern, and I think Google doesn’t seem to be providing very reassuring answers.

Specifically, web search companies are very protective of their log data in the name of privacy, much to the chagrin of researchers. And yet those same companies feel that privacy advocates exaggerate their concerns about the data being collected in the first place. Google / Yahoo / Microsoft: you can’t have it both ways!

A final point: Declan comments that “If users don’t like that, nobody’s forcing them to use Google.” I chalk that attitude up to his libertarianism rather than to any partiality he may have towards his wife’s employer. I have a libertartian streak myself, so I’m sympathetic. But, just as a practical matter, this is the sort of behavior that historically attracts regulation. Google and its rivals would do well to acknowledge the legitimacy of their critics’ concerns and regulate themselves first.