Month: October 2009

Software Patents: A Personal Story

Post author By Daniel Tunkelang
Post date October 3, 2009
60 Comments on Software Patents: A Personal Story

Given the radioactive nature of this post’s subject matter, I feel the need to remind readers that this is not a corporate blog, and that the opinions expressed within are my personal opinions, not those of my employer. Also, please understand that I cannot comment on any intellectual property issues specifically related to my employer.

With that preamble out of the way, let me tell you a true story. The other day, I received a phone call from a friend who has been building a kick-ass startup. That friend had been contacted by a much larger competitor with what amounted to an ultimatum: shut down and come work for us, or we’ll crush you with a patent infringement suit. My friend’s startup didn’t cave in–in fact, my friend even went through the trouble of sharing a pile of incontrovertible prior art with the competitor. The competitor was unimpressed, and my friend’s startup is now facing a potentially ruinous lawsuit.

If you know any of the characters in this story, I beg you to keep that information to yourself–at least for now. I’d like my friend to have a chance of getting his company out of this predicament, and premature publicity might hurt his case.

But back to the case: let me give you an idea of how a story like this can play out. At a high level, the startup can choose to fight or not fight.

Not fighting means the entrepreneurs writing off their startup, but it allows them to move on and try something new. It might be the best career move for the entrepreneurs, but it means that the world loses a promising startup, and the surrender rewards bad behavior, reinforcing a regime where innovators can’t afford to compete with more established players.

Fighting means mounting a non-infringement defense, an invalidation defense, or both.

A non-infringement argument asserts that, regardless of the validity of the patent, its claims don’t cover what the startup is doing. Since patents carry a presumption of validity, the non-infringement route is appealing–there’s no need to slog through the much longer invalidation process. Leaving a bad patent alive may be a worse outcome for the rest of the world, but entrepreneurs don’t have the luxury of taking the weight of the world onto their own shoulders.

Unfortunately, the very characteristics of a bad patent make it hard for an accused infringer to succeed in a non-infringement argument. If a patent is overly broad, then it’s more likely that the infringement argument will be valid (but not sound, since the patent itself is–or should be–invalid). Vaguely worded claims are also a problem–while a patent examiner may have granted a patent based on one interpretation of the claim language, the patent holder may now be asserting infringement under a different (and typically broader) interpretation of that same language.

As a result, a non-infringement argument often depends almost entirely on the result of a Markman hearing, more formally known as a claim construction hearing. In such a hearing, a judge decides how to interpret any language in the claim whose meaning is contested by the opposing parties in the suit. Such a hearing is often a crap shoot for the accused infringer. An unfavorable result which supports the infringement accusation may ultimately help invalidate the patent, but the results are likely to come too late–justice delayed for a startup is often an extreme case of justice denied.

Which brings us to the invalidation route. In theory, invalidation is the right approach to take when confronted with an invalid patent. Ideally, the accused infringer presents prior art to the patent office to reexamine the patent, resulting in the patent either being invalidated or rewritten to have a much narrower scope. In practice, however, this approach requires significant effort, time, money–especially if you depend on lawyers to do the heavy lifting–and luck. The best hope is to rapidly request and obtain a reexamination, and then to request and obtain a stay of the infringement suit pending reexamination. Needless to say, the patent holder will fight tooth and nail to avoid this outcome.

I don’t know how my friend’s story will end. But, as the above analysis should make clear, he’s between a rock and a hard place. Whether or not you believe that there should be software patents–and there is room for reasonable people to debate this question–I hope you agree that the situation my friend is facing amounts to legalized extortion. I understand that no system is perfect, and that our legal system requires compromises that have inevitable casualties.

Nonetheless, my friend’s story does not feel like an isolated incident, but rather evidence of a systemic problem. There are a lot of software patents floating around right now of dubious validity, many of them granted to companies that have since folded and have unloaded their assets in fire sales. It would be sad for this supply of ersatz intellectual property to impede the real innovation that the patent system was intended to protect.

Update: this post has been picked up by Y Combinator’s Hacker News.

Uncategorized

Google Updates Search Refinement Options

Post author By Daniel Tunkelang
Post date October 1, 2009
2 Comments on Google Updates Search Refinement Options

Google announced today that its Search Options feature, which allows users to progressively refine search results, now includes new refinement options: past hour, specific date range, more shopping sites, fewer shopping sites, visited pages, not yet visited, books, blogs and news. Of course, you could do some of this already with clever hackery. In any case, it’s great to see Google slouching towards HCIR on its most visible search property. Perhaps I was too quick to write off their interest in faceted search. Meanwhile, I’m staying tuned for Bing 2.0.

Uncategorized

Guest Post: A Plan For Abusiveness

Post author By Daniel Tunkelang
Post date October 1, 2009
7 Comments on Guest Post: A Plan For Abusiveness

The following is a guest post by Jeff Revesz and Elena Haliczer, co-founders of Adaptive Semantics. Adaptive Semantics specializes in sentiment analysis, in particular using machine learning to help automate comment moderation. They’ve been quite successful at the Huffington Post, which is also an investor. Intrigued by their approach, I reached out to them to solicit this post. I encourage you to respond publicly in the comment thread, or to contact them personally (first name @ adaptivesemantics.com).

Seven years ago, Paul Graham famously stated:

“I think it’s possible to stop spam, and that content-based filters are the way to do it.”

Well, seven years of innovation and research have brought about some great advances in the field of text classification, so perhaps it’s time to raise the stakes a little. In short, we think it’s possible to stop abusiveness in user-generated content, and that content-based filters are the way to do it.

The Problem with UGC

Publishers these days are in a tight spot with user-generated content (UGC). The promise of UGC in terms of engagement and overall stickiness is hard to pass up, but along with the benefits come some headaches as well. Comment spam is less of an issue than it once was, thanks to services such as Akismet, but the problem of trolling and outright abuse is as bad as it ever was. Any publisher venturing into UGC is stuck with the question of how to keep comments in line with their editorial standards while at the same time avoiding accusations of censorship. The solution employed thus far has mainly been a combination of keyword filters and human moderators. Unfortunately for publishers, there are serious problems with both of those, so let’s look at that more closely.

The main problem with human moderators is the cost involved. They’re expensive, hard to outsource, and they don’t scale. The average human has a maximum capacity of about 250 comments per hour, which is a generous estimate. At minimum wage this works out to about $0.03 per comment, which seems reasonable until you consider that a typical online publisher like the Huffington Post receives about 2 million comments per month site-wide. Add in overhead costs like hiring, training, auditing, etc and it quickly starts to get out of control. On top of this is the issue of moderator bias. Is it possible that your Democratic moderator is simply deleting every post that disagrees with President Obama, regardless of content?

To mitigate the costs involved, many publishers add in a layer of non-human filtering, such as a keyword list. While this may seem like a good idea at first, all it really does is offer you the worst of both worlds. Now you have an expensive, non-scalable solution that also gives bad results. Keyword lists can be easily beaten by the simplest obfuscation, such breaking up bad words or simply replacing a letter with a symbol. In addition, it is impossible for keyword filters to catch anything but the crudest type of abusiveness. A great example is the recent Facebook poll “Should Obama Be Killed?” which would likely pass right through a keyword filter but is quite obviously abusive content.

The Solution: Sentiment Classifiers

The idea of using a machine-learning classifier to identify text-based semantics is not a new one. Vladimir Vapnik introduced the original theory of support vector machines (SVMs) in 1995, and in 1998 Thorsten Joachims argued that the algorithm was perfectly suited for textual data. Finally, in 2002 Lillian Lee and colleagues showed that not only are SVMs well suited for identifying sentiment, but they also dominate keyword-based filters consistently. When applied to the problem of comment moderation, SVMs can mimic human moderation decisions with an accuracy of about 85%. That raises the question, is 85% good enough? How can we push the accuracy higher?

We have some proprietary answers to that question over at Adaptive Semantics, but a less controversial one arises from a well-documented property of classifier output known as the hyperplane distance (labeled v^k in the diagram below).

If the separating hyperplane can be viewed as the dividing line between abusive and non-abusive content, the hyperplane distance of any individual test comment can be interpreted as the classifier’s confidence in its own answer. If a test comment turns out to be very far from the dividing line, we can say that it lies deeply in “abusive space” (or in “non-abusive space” depending on the polarity). Now let’s imagine an SVM pre-filter that only makes auto-publish or auto-delete decisions on comments that have a large hyperplane distance, and sends all other comments to the human moderator staff. Such a classifier would have a guaranteed accuracy above 85%, and would progressively reduce the reliance on human moderators as it is re-trained over time. Even a conservatively tuned model can reduce the human moderation load by about 50% while keeping comment quality roughly the same. That’s a pretty good start.

In addition to high accuracy, a content-based classifier does not have the same limitation of a keyword filter in terms of vocabulary. Since the classifier is trained by feeding it thousands of real-world examples, it will learn to identify all of the typical types of obfuscation such as broken words, netspeak, slang, euphemisms, etc. And since the entire content of the comment is used as an input, the classifier will implicitly take account of context. So the comment “Should Obama Be Killed?” would likely be flagged for deletion, but a comment like “A defeat on healthcare may kill Obama’s chances at re-election.” would be left alone.

So is the abusiveness problem licked? Not quite yet, but the use of linear classifiers would be a huge step in the right direction. You could imagine further advances such as aggregating comment scores by user to quickly identify trolls, and maybe even using those scores as input for another classifier. Or how about training more classifiers to identify quality submissions and pick out the experts in your community? The possibilities are definitely exciting, and they raise another question: why are publishers not using these techniques? That one we don’t have a good answer for, so we founded a company in response.