Categories
General

Faceted Search for the Web: A Grand Challenge?

At the HCIR workshop last month, one of the posters was from Microsoft researchers Jaime Teevan, Susan Dumais, and Zachary Gutt, entitled “Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web“.

From the abstract:

Those [challenges] that we have identified stem from the fact that such datasets are 1) very large, making it difficult to assign quality meta-data to every document and to retrieve the full set of results and associated metadata at query time, and 2) heterogeneous, making it difficult to apply the same metadata to every result or every query.

Drilling further into their position paper reveals three challenges:

  • A lack of good, automatically generated metadata.
  • Uncertainty as to which facets will be most valuable for particular information needs.
  • The cost of dynamically computing facet distributions for result sets.

While these are all serious challenges, I feel that the position paper overstates them. Not all of the metadata needs to be generated automatically, and in any case there are lots of opporutnities to crowd-source metadata creation. Facet selection is more challenging, but the work we’ve done at Endeca on query clarification suggest that this problem is tractable. Finally, the computational challenge strikes me as an artifact of today’s ranked retrieval systems for web search, which are a bad fit for what is essentially a set retrieval problem.

I’m not saying that this problem isn’t hard. In fact, I think that the authors neglect the biggest challenge, which is the adversarial nature of web search. Arguably this problem is an implicit (but unstated) aspect of the metadata problem–there will be too much of an incentive to game it.

Nonetheless, I think the time is ripe to consider faceted search approaches for large, heterogeneous corpora like the web. And perhaps we can work around the adversarial model while we’re at it. But that’s the subject for another post.

Categories
Uncategorized

Reporting from the Anti-Spam Front

Surprisingly good reporting in the mainstream media (specifically, the Washington Post) about how shutting down McColo reduced worldwide spam volume by 65%.

Here’s a choice quote about why spammers prefer to host their servers in the United States:

What’s more, dependability and server uptime are important in cutthroat businesses for which an outage of a few hours can staunch the flow of spam and cost thousands of dollars. 

Hey, business is business, even for evil spammers.

Categories
Uncategorized

Blogging…Now 99.6% Safer Than Surfing!

OK, this is an oldie but goodie from xkcd, but I saw it in a recent presentation and couldn’t resist sharing.

Of course, you’d never know that from the sensationlist press.

Categories
General

Recommending Diversity

Another nice post from Daniel Lemire today, this time about a paper by Mi Zhang and Neil Hurley on “Avoiding monotony: improving the diversity of recommendation lists” (ACM Digital Library subscription required to see full text).

Here’s an abstract of the abstract:

Noting that the retrieval of a set of items matching a user query is a common problem across many applications of information retrieval, we model the competing goals of maximizing the diversity of the retrieved list while maintaining adequate similarity to the user query as a binary optimization problem.

It’s nice to see a similarity vs. diversity trade-off for recommendations analogous to the precision vs. recall trade-off for typical information retreival evaluation.

Our experience at Endeca is certainly that most of the approaches out there underemphasize diversity, which not only leads to the “monotony” problem but also breaks down when the query does not unambiguously express the user’s intent. Since our approach emphasizes interaction, we leverage the diversity of the options we present to maximize the opportunity for users to make progress in satisfying their information needs.

I would like to second Daniel Lemire’s suggestion to perform user studies to investigate the optimal balance between diversity and accuracy. They’d make for great papers. Just remember to send him (and me!) copies!

Categories
Uncategorized

Learning by Analogy

Thanks to Daniel Lemire for point to this recent paper by Peter Turney on “A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations“.

Daniel Lemire consider this paper an example of the “more data beats better algorithms” principle most famously espoused by Google Director of Research Peter Norvig.

My take is a bit different. One message I heard repeatedly at the recent NSF Symposium on Semantic Knowledge Discovery, Organization and Use is that semantic researchers need to reduce their problem space to make progress. Peter is doing exactly that in his own work by taking what are perceived as distinct problems and generalizing them in order to treat them uniformly. Perhaps the broader community could profit from his approach and, um, learn by analogy.

Categories
General

Symposium on Semantic Knowledge Discovery, Organization and Use, Day 2

Day 2 of the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU brought out representatives of the titans of web search:

  • Yahoo: Patrick Pantel (who actually just joined Yahoo) regaled us with entertaining tales “Of Search and Semantics”, whisking us through the history of search, arguing that semantics are making a commercial impact today, and then describing some current research at Yahoo. He elaborated a fair amount on SeeLEx, which stands for seed list expansion and is, in own own words, Yahoo’s version of Google Sets. But, unlike Google Sets, SeeLEx offers at least some transparency into the basis for similarity. I couldn’t fnd anything published other than the slides in my notebook from the symposium, but this is interesting work.
     
  • Google: Marius Pasca delivered an excellent talk, though its title of “Web Search as an Online Word Game for Knowledge Discovery” was a bit misleading. Unfortunately, neither of the Google speakers provides slides for the symposium notebook–I hope that isn’t a mandate from Google’s corporate policy. In any case, the talk was along the lines of his AAAI 2008 paper, “Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction“. It presented an intriguing approach for inferring class atrributes based on a distributional analysis of query space–an approach that reminded me of his earlier CIKM paper on “Acquisition of categorized named entities for web search“. Unfortunately neither of these papers is available without the appropriate memberships, but perhaps Marius will send them to you if you ask nicely.
     
  • Microsoft: Bill Dolan, Principal Researcher and manager of MSR’s Natural Language Processing Group, asked “Where does NLP stop and AI Begin?” He took us through the history of the MindNet project, and ruefully explained that it “worked beautifully” but “just not very often”. He made a compelling argument that semantic knowledge discovery researchers need to take a step back from AI-hard problems and focus on problems like paraphrase that are more amenable to the kind of progress we’ve seen in areas like machine translation.
Their presentations were followed by a three-hour (!) poster / demo session. As one of the demo presenters, I had 90 seconds in the poster boaster session to pitch my demo, and then attendees could wander around the posters and demos during the two-hour lunch break. I had some great conversations with the folks who did swing by, but I’m not sure this was the ideal format for conducting such a session.
The session after the demos was a bit of a blur for me, but the final discussion session was very engaging. One of the hot topics in the semantic knowledge community, much as it is in the information retrieval community, is the need for query logs–and, more generally, for good data–to conduct research. Having representatives of the three major web search engines there made the conversation certainly more interesting.
All in all, an excellent symposium, and I’m very grateful to Satoshi Sekine for organizing it.
Categories
Uncategorized

Fight the Spammers that Be!

Misery loves company, so I’m reassured to know that I’m not the only recent victim of the uptake in spam. Matt Hurst reports that Political Streams has recently been hit with a lot of LiveJournal spam. A current look at the site shows that the problem still persists, at least for blogs:

Matt says that “minor modification to our spam filter should take care of it.” I hope so. But it’s clear that, as social media become increasing important, spammers are taking note.

p.s. For folks too young to remember the 80s, the title is a play on this Public Enemy song.

Categories
General

Google Flu Trends: The Privacy Backlash Begins

Don’t say I didn’t tell you so. Declan McCullagh reports at CNET that privacy groups are expressing concern about Google Flu Trends:

The Electronic Privacy Information Center and Patient Privacy Rights sent a letter this week to Google CEO Eric Schmidt saying if the records are “disclosed and linked to a particular user, there could be adverse consequences for education, employment, insurance, and even travel.” It asks for more disclosure about how Google Flu Trends protects privacy.

I agree with Declan that

If you think that knowing that Alaska’s “influenza-like illness” number for the week of November 9 is 2.035 and California’s number is 1.384 is somehow worrisome and can identify you personally, it’s time to break out your tinfoil hat.

But, as the article gets to, the deeper concern is the one expressed by the Electronic Privacy Information Center:

There are no clear legal or technological privacy safeguards that prevent the disclosure of individual search histories. Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or presidential authority.

I’m not paranoid, and I actually think that both privacy advocates and web search companies have often exaggerated privacy issues, especially since the AOL fiasco a couple of years ago. But EPIC is raising is a legitimate concern, and I think Google doesn’t seem to be providing very reassuring answers.

Specifically, web search companies are very protective of their log data in the name of privacy, much to the chagrin of researchers. And yet those same companies feel that privacy advocates exaggerate their concerns about the data being collected in the first place. Google / Yahoo / Microsoft: you can’t have it both ways!

A final point: Declan comments that “If users don’t like that, nobody’s forcing them to use Google.” I chalk that attitude up to his libertarianism rather than to any partiality he may have towards his wife’s employer. I have a libertartian streak myself, so I’m sympathetic. But, just as a practical matter, this is the sort of behavior that historically attracts regulation. Google and its rivals would do well to acknowledge the legitimacy of their critics’ concerns and regulate themselves first.

Categories
Uncategorized

Calling All New York Area CTY Alumni

My apologies to regular readers for this completely off-topic post. If you’ve never heard of CTY, feel free to get back to your regularly scheduled reading.

But if you are a CTY alum in the New York area and are interested in meeting your peers, please keep reading. CTY alumni coordinator Sarah Shelfer and alum Matt Mochary organized a gathering at the Pegu Club for 1980s CTY alumni in New York. We barely managed to fit around the table (new folks arrived as early birds rotated out), and all of us were excited at the prospect of renewing our connection to CTY and to one another. We’re still figuring out next steps, but the first one is to start find more of one another. I’m hoping that this blog post helps spread the word.

If you are a CTY alum, even if you’re not in New York, and you’re interested in renewing your connection to CTY and the people who shared this formative experience with you, please contact Sarah at ctyalumni@jhu.edu. And, if you are in New York–or if you remember me from my three summers at Dickinson and Franklin & Marshall–please give me a shout!

Categories
Uncategorized

To Advertise Or Not To Advertise

More from Greg on gems from CIKM:

Andrei Broder and a large crew from Yahoo Research had a paper at CIKM 2008, “To Swing or not to Swing: Learning when (not) to Advertise” (PDF), that is a joy to see for those of us that are hoping to make advertising more useful and less annoying.

Of course, folks like me dislike advertising enough that we install plug-ins like Adblock Plus and CustomizeGoogle to avoid ads entirely. I wonder if a good learning algorithm would spare me the trouble. But, more importantly, I wonder how far an ad-supported industry wants to go in making it easy for people to opt out of advertising.