Author: Daniel Tunkelang

High-Class Consultant.

Is Twitter Planning To Monetize The Firehose?

Post author By Daniel Tunkelang
Post date October 8, 2009
9 Comments on Is Twitter Planning To Monetize The Firehose?

A few months ago, I wrote in “The Twouble with Twitter Search“:

But the trickle that Twitter returns is hardly enough.

I believe this limitation is by design–that Twitter knows the value of such access and isn’t about to give it away. I just hope Twitter will figure out a way to provide this access for a price, and that an ecology of information access providers develops around it. Of course, if Google or Microsoft buys Twitter first, that probably won’t happen.

Now that Twitter has raised $100M at a valuation of $1B, I doubt any acquisition will happen anytime soon. But, according to Kara Swisher’s unnamed sources:

Twitter is in advanced talks with Microsoft and Google separately about striking data-mining deals, in which the companies would license a full feed from the microblogging service that could then be integrated into the results of their competing search engines.

If so, then it’s about time! How much either Microsoft or Google would pay for this feed is an interesting question. It’s probably not a coincidence that Twitter raised its last round of funding before pursuing this path–the revenue they obtain this way could be significant, but is unlikely to justify a $1B valuation.

In any case, I’m excited as a consumer that Twitter may finally allow Google and Microsoft to better expose the value of its content. But I’m also curious what my friends on the Twitter Search team think of the potential competition from the web search titans. Until now, no one has been able compete effectively with Twitter’s native search because of lacking access to the firehose. Having such access would give Google and Microsoft more than a fighting chance. Given the centrality of search to Twitter’s user experience, it’s an interesting corporate strategy.

Uncategorized

Google Meets The Press

I enjoyed my proverbial fifteen seconds of fame on CNN yesterday, and I even enjoyed lunch at the New York Times cafeteria today. But for a prime-time media show check out the live blogging of a chat that Google co-founder Sergey Brin and CEO Eric Schmidt are having with reporters at the Google New York Office.

Here’s an excerpt (via TechCrunch) to pique your interest:

Q: Do you think Bing is something different or a rebranding?

Sergey Brin: I don’t want to speak about our competitors.

Eric Schmidt: Better for you to judge. We like to focus on our customers.

More coverage at:

Uncategorized

The Noisy Channel, Live On CNN!

Post author By Daniel Tunkelang
Post date October 6, 2009
11 Comments on The Noisy Channel, Live On CNN!

http://i.cdn.turner.com/cnn/.element/js/2.0/video/evp/module.js?loc=dom&vid=/video/business/2009/10/06/dcl.blog.ftc.blogs.cnn

For anyone who’s ever wondered what it would be like to see me live on CNN, this is your chance! Sorry that it isn’t my most telegenic moment. Still, it was a nice opportunity to share my perspective on the new FTC regulations facing bloggers.

Uncategorized

In the ASIS&T Bulletin: Reconsidering Relevance and Embracing Interaction

Post author By Daniel Tunkelang
Post date October 5, 2009
2 Comments on In the ASIS&T Bulletin: Reconsidering Relevance and Embracing Interaction

Just thought I’d alert readers to an article I just published in the current issue of the ASIS&T Bulletin entitled “Reconsidering Relevance and Embracing Interaction“. Of course, it’s all about trying to usher in a brave new world of human-computer information retrieval. If you’re not already sick of reading about HCIR, check it out!

Uncategorized

HCIR 2009 Proceedings Now Available

Post author By Daniel Tunkelang
Post date October 5, 2009
3 Comments on HCIR 2009 Proceedings Now Available

The HCIR 2009 proceedings are now available on the workshop web site. We’re planning to save trees and money by asking attendees to download the proceedings rather than printing them out. And, of course, we’re delighted to circulate the proceedings to those who won’t be fortunate enough to spend the day at the workshop.

General

Jeff Jarvis and Matt Cutts on the New FTC Blog Regulations

Post author By Daniel Tunkelang
Post date October 5, 2009
25 Comments on Jeff Jarvis and Matt Cutts on the New FTC Blog Regulations

As has been anticipated for a while–and discussed during the Ethics of Blogging panel–the United States Federal Trade Commission (FTC) has published explicit guidelines regarding how bloggers (at least within its jurisdiction) must disclose any “material connections” they have to the companies they endorse. The full details are available here.

There have been a number of reactions across the blogosphere, but I’d like to hone in on two opposing views: those of What Would Google Do author (and blogger) Jeff Jarvis and Googler Matt Cutts.

Jarvis describes the regulations as “a monument to unintended consequence, hidden dangers, and dangerous assumptions…the greatest myth embedded within the FTC’s rules [is] that the government can and should sanitize the internet for our protection.”

Commenting on Jarvis’s post, Cutts replies:

As a Google engineer who has seen the damage done by fake blogs, sock puppets, and endless scams on the internet, I’m happy to take the opposite position: I think the FTC guidelines will make the web more useful and more trustworthy for consumers. Consumers don’t want to be shilled and they don’t want payola; they want a web that they can trust. The FTC guidelines just say that material connections should be disclosed. From having dealt with these issues over several years, I believe that will be a good thing for the web.

It’s a fascinating debate, and I can see merit in both sides. Like the folks at Reason, I lean libertarian (at least on issues of freedom of expression) and am not eager to see more government regulation of online speech. That said, I see the value of laws requiring truth in advertising, and I don’t see why pay-for-play bloggers should get a free pass if they are acting as advertisers. Interestingly, Jarvis’s response to Cutts is: “I trust you to regulate spam more than the FTC. You are better at it and have more impact.” That’s probably true today, but wouldn’t want to invest that responsibility in a company that makes 99% of its revenue from advertising.

Everyone in this discussion sees the value of transparency–the question is whether it should be a legal norm enforced through FTC regulation or a social norm enforced by the marketplace. Despite my general skepticism about regulation of expression, I temper my libertarianism with a dose of pragmatism. For example, I’m glad that the Food and Drug Administration (FDA) at least tries to regulate health claims–its efforts may not eliminate quackery, but they surely reduce the problem.

Do we need FTC regulation in order to tame the jungle of social media? For that matter, will regulations have a positive effect, or will sploggers and other scammers simply ignore them–and perhaps even more offshore? I share Jarvis’s fear that the regulation will cause more harm than good–perhaps even having chilling effects on would-be bloggers. Certainly the FTC will have to use its new power wisely–both to avoid trampling the existing blogosphere and to not scare off newcomers. Still, if the FTC shows that it is only out to get true scammers, it may help establish, in Cutts’s words, a web we can trust.

I’m Daniel Tunkelang, and I endorse this blog post.

General

Software Patents: A Personal Story

Post author By Daniel Tunkelang
Post date October 3, 2009
60 Comments on Software Patents: A Personal Story

Given the radioactive nature of this post’s subject matter, I feel the need to remind readers that this is not a corporate blog, and that the opinions expressed within are my personal opinions, not those of my employer. Also, please understand that I cannot comment on any intellectual property issues specifically related to my employer.

With that preamble out of the way, let me tell you a true story. The other day, I received a phone call from a friend who has been building a kick-ass startup. That friend had been contacted by a much larger competitor with what amounted to an ultimatum: shut down and come work for us, or we’ll crush you with a patent infringement suit. My friend’s startup didn’t cave in–in fact, my friend even went through the trouble of sharing a pile of incontrovertible prior art with the competitor. The competitor was unimpressed, and my friend’s startup is now facing a potentially ruinous lawsuit.

If you know any of the characters in this story, I beg you to keep that information to yourself–at least for now. I’d like my friend to have a chance of getting his company out of this predicament, and premature publicity might hurt his case.

But back to the case: let me give you an idea of how a story like this can play out. At a high level, the startup can choose to fight or not fight.

Not fighting means the entrepreneurs writing off their startup, but it allows them to move on and try something new. It might be the best career move for the entrepreneurs, but it means that the world loses a promising startup, and the surrender rewards bad behavior, reinforcing a regime where innovators can’t afford to compete with more established players.

Fighting means mounting a non-infringement defense, an invalidation defense, or both.

A non-infringement argument asserts that, regardless of the validity of the patent, its claims don’t cover what the startup is doing. Since patents carry a presumption of validity, the non-infringement route is appealing–there’s no need to slog through the much longer invalidation process. Leaving a bad patent alive may be a worse outcome for the rest of the world, but entrepreneurs don’t have the luxury of taking the weight of the world onto their own shoulders.

Unfortunately, the very characteristics of a bad patent make it hard for an accused infringer to succeed in a non-infringement argument. If a patent is overly broad, then it’s more likely that the infringement argument will be valid (but not sound, since the patent itself is–or should be–invalid). Vaguely worded claims are also a problem–while a patent examiner may have granted a patent based on one interpretation of the claim language, the patent holder may now be asserting infringement under a different (and typically broader) interpretation of that same language.

As a result, a non-infringement argument often depends almost entirely on the result of a Markman hearing, more formally known as a claim construction hearing. In such a hearing, a judge decides how to interpret any language in the claim whose meaning is contested by the opposing parties in the suit. Such a hearing is often a crap shoot for the accused infringer. An unfavorable result which supports the infringement accusation may ultimately help invalidate the patent, but the results are likely to come too late–justice delayed for a startup is often an extreme case of justice denied.

Which brings us to the invalidation route. In theory, invalidation is the right approach to take when confronted with an invalid patent. Ideally, the accused infringer presents prior art to the patent office to reexamine the patent, resulting in the patent either being invalidated or rewritten to have a much narrower scope. In practice, however, this approach requires significant effort, time, money–especially if you depend on lawyers to do the heavy lifting–and luck. The best hope is to rapidly request and obtain a reexamination, and then to request and obtain a stay of the infringement suit pending reexamination. Needless to say, the patent holder will fight tooth and nail to avoid this outcome.

I don’t know how my friend’s story will end. But, as the above analysis should make clear, he’s between a rock and a hard place. Whether or not you believe that there should be software patents–and there is room for reasonable people to debate this question–I hope you agree that the situation my friend is facing amounts to legalized extortion. I understand that no system is perfect, and that our legal system requires compromises that have inevitable casualties.

Nonetheless, my friend’s story does not feel like an isolated incident, but rather evidence of a systemic problem. There are a lot of software patents floating around right now of dubious validity, many of them granted to companies that have since folded and have unloaded their assets in fire sales. It would be sad for this supply of ersatz intellectual property to impede the real innovation that the patent system was intended to protect.

Update: this post has been picked up by Y Combinator’s Hacker News.

Uncategorized

Google Updates Search Refinement Options

Post author By Daniel Tunkelang
Post date October 1, 2009
2 Comments on Google Updates Search Refinement Options

Google announced today that its Search Options feature, which allows users to progressively refine search results, now includes new refinement options: past hour, specific date range, more shopping sites, fewer shopping sites, visited pages, not yet visited, books, blogs and news. Of course, you could do some of this already with clever hackery. In any case, it’s great to see Google slouching towards HCIR on its most visible search property. Perhaps I was too quick to write off their interest in faceted search. Meanwhile, I’m staying tuned for Bing 2.0.

Uncategorized

Guest Post: A Plan For Abusiveness

Post author By Daniel Tunkelang
Post date October 1, 2009
7 Comments on Guest Post: A Plan For Abusiveness

The following is a guest post by Jeff Revesz and Elena Haliczer, co-founders of Adaptive Semantics. Adaptive Semantics specializes in sentiment analysis, in particular using machine learning to help automate comment moderation. They’ve been quite successful at the Huffington Post, which is also an investor. Intrigued by their approach, I reached out to them to solicit this post. I encourage you to respond publicly in the comment thread, or to contact them personally (first name @ adaptivesemantics.com).

Seven years ago, Paul Graham famously stated:

“I think it’s possible to stop spam, and that content-based filters are the way to do it.”

Well, seven years of innovation and research have brought about some great advances in the field of text classification, so perhaps it’s time to raise the stakes a little. In short, we think it’s possible to stop abusiveness in user-generated content, and that content-based filters are the way to do it.

The Problem with UGC

Publishers these days are in a tight spot with user-generated content (UGC). The promise of UGC in terms of engagement and overall stickiness is hard to pass up, but along with the benefits come some headaches as well. Comment spam is less of an issue than it once was, thanks to services such as Akismet, but the problem of trolling and outright abuse is as bad as it ever was. Any publisher venturing into UGC is stuck with the question of how to keep comments in line with their editorial standards while at the same time avoiding accusations of censorship. The solution employed thus far has mainly been a combination of keyword filters and human moderators. Unfortunately for publishers, there are serious problems with both of those, so let’s look at that more closely.

The main problem with human moderators is the cost involved. They’re expensive, hard to outsource, and they don’t scale. The average human has a maximum capacity of about 250 comments per hour, which is a generous estimate. At minimum wage this works out to about $0.03 per comment, which seems reasonable until you consider that a typical online publisher like the Huffington Post receives about 2 million comments per month site-wide. Add in overhead costs like hiring, training, auditing, etc and it quickly starts to get out of control. On top of this is the issue of moderator bias. Is it possible that your Democratic moderator is simply deleting every post that disagrees with President Obama, regardless of content?

To mitigate the costs involved, many publishers add in a layer of non-human filtering, such as a keyword list. While this may seem like a good idea at first, all it really does is offer you the worst of both worlds. Now you have an expensive, non-scalable solution that also gives bad results. Keyword lists can be easily beaten by the simplest obfuscation, such breaking up bad words or simply replacing a letter with a symbol. In addition, it is impossible for keyword filters to catch anything but the crudest type of abusiveness. A great example is the recent Facebook poll “Should Obama Be Killed?” which would likely pass right through a keyword filter but is quite obviously abusive content.

The Solution: Sentiment Classifiers

The idea of using a machine-learning classifier to identify text-based semantics is not a new one. Vladimir Vapnik introduced the original theory of support vector machines (SVMs) in 1995, and in 1998 Thorsten Joachims argued that the algorithm was perfectly suited for textual data. Finally, in 2002 Lillian Lee and colleagues showed that not only are SVMs well suited for identifying sentiment, but they also dominate keyword-based filters consistently. When applied to the problem of comment moderation, SVMs can mimic human moderation decisions with an accuracy of about 85%. That raises the question, is 85% good enough? How can we push the accuracy higher?

We have some proprietary answers to that question over at Adaptive Semantics, but a less controversial one arises from a well-documented property of classifier output known as the hyperplane distance (labeled v^k in the diagram below).

If the separating hyperplane can be viewed as the dividing line between abusive and non-abusive content, the hyperplane distance of any individual test comment can be interpreted as the classifier’s confidence in its own answer. If a test comment turns out to be very far from the dividing line, we can say that it lies deeply in “abusive space” (or in “non-abusive space” depending on the polarity). Now let’s imagine an SVM pre-filter that only makes auto-publish or auto-delete decisions on comments that have a large hyperplane distance, and sends all other comments to the human moderator staff. Such a classifier would have a guaranteed accuracy above 85%, and would progressively reduce the reliance on human moderators as it is re-trained over time. Even a conservatively tuned model can reduce the human moderation load by about 50% while keeping comment quality roughly the same. That’s a pretty good start.

In addition to high accuracy, a content-based classifier does not have the same limitation of a keyword filter in terms of vocabulary. Since the classifier is trained by feeding it thousands of real-world examples, it will learn to identify all of the typical types of obfuscation such as broken words, netspeak, slang, euphemisms, etc. And since the entire content of the comment is used as an input, the classifier will implicitly take account of context. So the comment “Should Obama Be Killed?” would likely be flagged for deletion, but a comment like “A defeat on healthcare may kill Obama’s chances at re-election.” would be left alone.

So is the abusiveness problem licked? Not quite yet, but the use of linear classifiers would be a huge step in the right direction. You could imagine further advances such as aggregating comment scores by user to quickly identify trolls, and maybe even using those scores as input for another classifier. Or how about training more classifiers to identify quality submissions and pick out the experts in your community? The possibilities are definitely exciting, and they raise another question: why are publishers not using these techniques? That one we don’t have a good answer for, so we founded a company in response.

Uncategorized

A Museum of Mathematics

Post author By Daniel Tunkelang
Post date September 30, 2009

Mathematics illuminates the patterns that abound in our world. The Math Factory strives to enhance public understanding and perception of mathematics. Its dynamic exhibits and programs will stimulate inquiry, spark curiosity, and reveal the wonders of mathematics. The museum’s activities will lead a broad and diverse audience to understand the evolving, creative, human, and aesthetic nature of mathematics.

The above is the mission statement of The Math Factory, an organization headed by former Renaissance Technologies analyst (and CTY alumnus) Glen Whitney that aspires to build a national museum of mathematics in New York. The effort is well underway–the organization has raised $4M to date, attracted an impressive group of trustees and advisors, and has obtained quite a bit of enthusiastic press coverage. No wonder–the Math Midway it exhibited at the World Science Festival this past June was a wild success. I’d gone there to offer moral support, only to find that I was lucky to get close enough to see the exhibits!

Last night, I was fortunate enough to attend a gala at the Urban Academy and actually play with the exhibits–from riding a tricycle with square wheels to walking through a maze without making left turns. It was a blast! And, while I’ll admit to being favorably predisposed towards math, the exhibits hardly required such a predisposition–any more than the Exploratorium in San Francisco requires a predisposition towards science. Rather, experiences like these create excitement, overcoming the negative preconceptions that too many children (and adults!) have about this subjects.

While I suspect that many Noisy Channel readers are already sold on both the enjoyment and core societal value of mathematics, I encourage you to think about how much better a world we would have if this appreciation were more widely shared. For those who have to think about large numbers just to manage their assets, I encourage you to think of The Math Factory as worthy of your philanthropy. I encourage everyone to contribute your ideas and endorsements to this visionary effort.