Categories
Uncategorized

Off To DC

I’m heading to Washington, DC tomorrow morning, a couple of days before the HCIR ’09 workshop. I’m not sure I’ll have any opportunities to blog while I’m in the nation’s capital, but of course I’ll post a write-up about the workshop when I’m back! Meanwhile, if you need your blog fix, I encourage you to check out some of the blogs I read.

Categories
Uncategorized

Third Annual Workshop on Search in Social Media (SSM 2010)

I’m proud to announce that Eugene Agichtein, Marti Hearst, and Ian Soboroff have invited me to help organize the upcoming Workshop on Search in Social Media (SSM 2010). The workshop will take place in conjunction with the ACM  Conference on Web Search and Data Mining (WSDM 2010), a young conference that has quickly become a top-tier forum for work in these areas. The conference and workshop will take place in my home town of New York–Brooklyn, to be precise!

Here’s the key information from the workshop web site:

Overview

Social applications are the fastest growing segment of the web. They establish new forums for content creation, allow people to connect to each other and share information, and permit novel applications at the intersection of people and information. However, to date, social media has been primarily popular for connecting people, not for finding information. While there has been progress on searching particular kinds of social media, such as blogs, search in others (e.g., Facebook, Myspace, of flickr) are not as well understood.

The purpose of the 3rd Annual Workshop on Search in Social Media (SSM 2010), is to bring together information retrieval and social media researchers to consider the following questions: How should we search in social media? What are the needs of users, and models of those needs, specific to social media search? What models make the most sense? How does search interact with existing uses of social media? How can social media search complement traditional web search?  What new search paradigms for information finding can be facilitated by social media?

SSM 2010 follows up on the highly successful SSM 2009 and SSM 2008 workshops held at SIGIR 2009 and CIKM 2008 respectively. We are looking forward to an equally exciting workshop at WSDM 2010 in New York!

Format and Topics

We are planning for a full-day workshop consisting of invited speakers, organized in both plenary and panel sessions, and a contributed poster/demo session.

We solicit short (under 2 pages) position papers, posters or demo proposals to be presented as part of a poster session, describing late-breaking and novel research results or demonstrations of prototypes or working systems. All topics at the intersection of information finding and social media are of interest, including, but not limited to:

  • Searching blogs, tweets, and other textual social media.
  • Searching within social networks, including expert finding.
  • Searching Wikipedia discussions and revision histories.
  • Searching online discussions, mailing lists, forums, and community question answering sites.
  • The role of human-powered and community question answering.
  • Novel models of information finding and new search applications for social media.
  • The role of timeliness, authority, and accuracy in social media search.
  • Interaction between traditional web search and social media search.
  • User needs assessments and task analysis for social media search.
  • Interactions between searching and browsing in social media.
  • Searching and exploiting folksonomies, tags, and tagged data.
  • Spam and adversarial interactions in social media.

Ideal papers may include late-breaking and novel research results, position and vision papers discussing the role of search in social media, and demonstrations of prototypes or working systems. Note that the workshop proceedings will not be archived or considered as formal publication, to encourage the informal atmosphere and to allow the authors to publish expanded versions of the work elsewhere.

The poster/demo proposals should be in standard ACM SIG format, more details to be posted soon.

Submissions are due on December 15th. I hope to see some of you there! Meanwhile, feel free to suggest ideas for invited speakers who have done interesting work at the intersection of social media and search, and I’ll share your suggestions with my co-organizers.

Categories
Uncategorized

Go Shopping, Be Social

Aardvark

If you’re into search startups, then today’s a great day to check out what a couple of them are up to.

TheFind just launched (or relaunched?) a “buying engine” that aspires “to help every shopper find exactly what they want to buy, and to help every merchant, large and small, to reach those shoppers.” It has some nice interface elements, but I can’t say I’m sold on the overall user experience.

Meanwhile, Aardvark just launched a web-based version of its social search application. The site urges users to “ask any question in plain English, and Aardvark will discover the perfect person in your network to answer…in under 5 minutes!” As I’ve commented before, I think they need to embrace the philosophy of “when in doubt, make it public“. But hey, they made the Time’s top 50 websites for 2009, so perhaps they are right to ignore my advice.

Categories
Uncategorized

Faceted Search Book: Now At Half Price!

Not sure when (or why) this happened, but I just noticed that my Faceted Search book is now almost half off at Amazon, selling for $12.94. Not that it was ever that extravagant a purchase, but at this price you have 48% fewer excuses not to buy your own copy! And, speaking of Amazon, I would appreciate if folks who have read the book could take a moment to post a review there.

Categories
Uncategorized

Google Meets The Press

I enjoyed my proverbial fifteen seconds of fame on CNN yesterday, and I even enjoyed lunch at the New York Times cafeteria today. But for a prime-time media show check out the live blogging of a chat that Google co-founder Sergey Brin and CEO Eric Schmidt are having with reporters at the Google New York Office.

Here’s an excerpt (via TechCrunch) to pique your interest:

Q: Do you think Bing is something different or a rebranding?

Sergey Brin: I don’t want to speak about our competitors.

Eric Schmidt: Better for you to judge. We like to focus on our customers.

More coverage at:

Categories
Uncategorized

The Noisy Channel, Live On CNN!

http://i.cdn.turner.com/cnn/.element/js/2.0/video/evp/module.js?loc=dom&vid=/video/business/2009/10/06/dcl.blog.ftc.blogs.cnn

For anyone who’s ever wondered what it would be like to see me live on CNN, this is your chance! Sorry that it isn’t my most telegenic moment. Still, it was a nice opportunity to share my perspective on the new FTC regulations facing bloggers.

Categories
Uncategorized

In the ASIS&T Bulletin: Reconsidering Relevance and Embracing Interaction

Just thought I’d alert readers to an article I just published in the current issue of the ASIS&T Bulletin entitled “Reconsidering Relevance and Embracing Interaction“. Of course, it’s all about trying to usher in a brave new world of human-computer information retrieval. If you’re not already sick of reading about HCIR, check it out!

Categories
Uncategorized

HCIR 2009 Proceedings Now Available

The HCIR 2009 proceedings are now available on the workshop web site. We’re planning to  save trees and money by asking attendees to download the proceedings rather than printing them out. And, of course, we’re delighted to circulate the proceedings to those who won’t be fortunate enough to spend the day at the workshop.

Categories
Uncategorized

Google Updates Search Refinement Options

Google announced today that its Search Options feature, which allows users to progressively refine search results, now includes new refinement options: past hour, specific date range, more shopping sites, fewer shopping sites, visited pages, not yet visited, books, blogs and news. Of course, you could do some of this already with clever hackery. In any case, it’s great to see Google slouching towards HCIR on its most visible search property. Perhaps I was too quick to write off their interest in faceted search. Meanwhile, I’m staying tuned for Bing 2.0.

Categories
Uncategorized

Guest Post: A Plan For Abusiveness

The following is a guest post by Jeff Revesz and Elena Haliczer, co-founders of Adaptive Semantics. Adaptive Semantics specializes in sentiment analysis, in particular using machine learning to help automate comment moderation. They’ve been quite successful at the Huffington Post, which is also an investor. Intrigued by their approach, I reached out to them to solicit this post. I encourage you to respond publicly in the comment thread, or to contact them personally (first name @ adaptivesemantics.com).

Seven years ago, Paul Graham famously stated:

“I think it’s possible to stop spam, and that content-based filters are the way to do it.”

Well, seven years of innovation and research have brought about some great advances in the field of text classification, so perhaps it’s time to raise the stakes a little. In short, we think it’s possible to stop abusiveness in user-generated content, and that content-based filters are the way to do it.

The Problem with UGC

Publishers these days are in a tight spot with user-generated content (UGC). The promise of UGC in terms of engagement and overall stickiness is hard to pass up, but along with the benefits come some headaches as well. Comment spam is less of an issue than it once was, thanks to services such as Akismet, but the problem of trolling and outright abuse is as bad as it ever was. Any publisher venturing into UGC is stuck with the question of how to keep comments in line with their editorial standards while at the same time avoiding accusations of censorship. The solution employed thus far has mainly been a combination of keyword filters and human moderators. Unfortunately for publishers, there are serious problems with both of those, so let’s look at that more closely.

The main problem with human moderators is the cost involved. They’re expensive, hard to outsource, and they don’t scale. The average human has a maximum capacity of about 250 comments per hour, which is a generous estimate. At minimum wage this works out to about $0.03 per comment, which seems reasonable until you consider that a typical online publisher like the Huffington Post receives about 2 million comments per month site-wide. Add in overhead costs like hiring, training, auditing, etc and it quickly starts to get out of control. On top of this is the issue of moderator bias. Is it possible that your Democratic moderator is simply deleting every post that disagrees with President Obama, regardless of content?

To mitigate the costs involved, many publishers add in a layer of non-human filtering, such as a keyword list. While this may seem like a good idea at first, all it really does is offer you the worst of both worlds. Now you have an expensive, non-scalable solution that also gives bad results. Keyword lists can be easily beaten by the simplest obfuscation, such breaking up bad words or simply replacing a letter with a symbol. In addition, it is impossible for keyword filters to catch anything but the crudest type of abusiveness. A great example is the recent Facebook poll “Should Obama Be Killed?” which would likely pass right through a keyword filter but is quite obviously abusive content.

The Solution:  Sentiment Classifiers

The idea of using a machine-learning classifier to identify text-based semantics is not a new one. Vladimir Vapnik introduced the original theory of support vector machines (SVMs) in 1995, and in 1998 Thorsten Joachims argued that the algorithm was perfectly suited for textual data. Finally, in 2002 Lillian Lee and colleagues showed that not only are SVMs well suited for identifying sentiment, but they also dominate keyword-based filters consistently. When applied to the problem of comment moderation, SVMs can mimic human moderation decisions with an accuracy of about 85%. That raises the question, is 85% good enough? How can we push the accuracy higher?

We have some proprietary answers to that question over at Adaptive Semantics, but a less controversial one arises from a well-documented property of classifier output known as the hyperplane distance (labeled vk in the diagram below).

optimal-margin-classifier

If the separating hyperplane can be viewed as the dividing line between abusive and non-abusive content, the hyperplane distance of any individual test comment can be interpreted as the classifier’s confidence in its own answer. If a test comment turns out to be very far from the dividing line, we can say that it lies deeply in “abusive space” (or in “non-abusive space” depending on the polarity). Now let’s imagine an SVM pre-filter that only makes auto-publish or auto-delete decisions on comments that have a large hyperplane distance, and sends all other comments to the human moderator staff. Such a classifier would have a guaranteed accuracy above 85%, and would progressively reduce the reliance on human moderators as it is re-trained over time. Even a conservatively tuned model can reduce the human moderation load by about 50% while keeping comment quality roughly the same. That’s a pretty good start.

In addition to high accuracy, a content-based classifier does not have the same limitation of a keyword filter in terms of vocabulary. Since the classifier is trained by feeding it thousands of real-world examples, it will learn to identify all of the typical types of obfuscation such as broken words, netspeak, slang, euphemisms, etc. And since the entire content of the comment is used as an input, the classifier will implicitly take account of context. So the comment “Should Obama Be Killed?” would likely be flagged for deletion, but a comment like “A defeat on healthcare may kill Obama’s chances at re-election.” would be left alone.

So is the abusiveness problem licked? Not quite yet, but the use of linear classifiers would be a huge step in the right direction. You could imagine further advances such as aggregating comment scores by user to quickly identify trolls, and maybe even using those scores as input for another classifier. Or how about training more classifiers to identify quality submissions and pick out the experts in your community? The possibilities are definitely exciting, and they raise another question: why are publishers not using these techniques? That one we don’t have a good answer for, so we founded a company in response.