Categories
General

CFP: IEEE Internet Computing Special Issue on Context-Aware Computing

Pankaj Mehra and I are guest editors for an upcoming special issue of IEEE Internet Computing with the topic “Beyond Search: Context-Aware Computing“.

Here is a copy of the call for papers:

Context is the unstated actor in human communications, actions, and situations. It makes our communication efficient, our commands actionable, and our situations understandable to the people, organizations, and devices that provide us with content or services. The increased embedding of technology into our personal and social environments drives a need for context-aware computing.

Context-aware computing offers mobile Internet users an experience that goes beyond user-initiated search and location-­based services. Context awareness sharpens relevance when responding to user-initiated actions (such as product search and support calls). It also enables proactive communications through analysis of a user’s behavior and environment, thereby forming the basis for key business imperatives targeting customer-engagement systems. Even greater opportunity arises from context use in systems that can make sense of and engage in customer dialogs and forums.

This special issue seeks original articles that support and illustrate context use in creating enhanced user experiences. Sample topics include

  • proactive, contextualized delivery of information, alerts, and advertisements;
  • context-mediated Web service orchestration, yielding actionable interpretation of spoken high-level commands;
  • system architecture, economics, and ecosystems for comprehensive capture, representation, communication, gathering, and brokering the larger user context;
  • systems of engagement that treat discourse as text plus context and process textual communication as an event in which linguistic, cognitive, and social actions converge; and
  • reasoning and knowledge representation mechanisms that use context in selecting the body of knowledge to use, the level of detail to model, and the point of view with which to communicate and interpret text and data.

All submissions must be original manuscripts of fewer than 5,000 words, focused on Internet technologies and implementations. All manuscripts are subject to peer review on both technical merit and relevance to IC’s international readership—primarily system and software design engineers. We do not accept white papers, and we discourage strictly theoretical or mathematical papers. To submit a manuscript, please log on to ScholarOne (https://mc.manuscriptcentral.com:443/ic-cs) to create or access an account, which you can use to log on to IC’s Author Center and upload your submission.

I hope some of you will submit articles in time for the June 15 deadline, and Pankaj and I look forward to reviewing them.

Categories
General

Identifying Influencers on Twitter


One of the perks of working at LinkedIn is being surrounded by intellectually curious colleagues. I recently joined a reading group and signed up to lead our discussion of a WSDM 2011 paper on “Identifying ‘Influencers’ on Twitter” by Eytan Bakshy, Jake Hofman, Winter Mason, and Duncan Watts. It’s great to see the folks at Yahoo! Research doing cutting-edge work in this space.

I thought I’d prepare for the discussion by sharing my thoughts here. Perhaps some of you will even be kind enough to add your own ideas, which I promise to share with the reading group.

I encourage you to read the paper, but here’s a summary of its results:

  • A user’s influence on Twitter is the extent to which that user can cause diffusion a posted URL, as measured by reposts propagated through follower edges in Twitter’s directed social graph.
  • The best predictors of future total influence are follower count and past local influence, where local influence refers to the average number of reposts by that user’s immediate followers, and total influence refers to average total cascade size.
  • The content features of individual posts do not have identifiable predictive value.
  • Barring a high per-influencer acquisition cost, the most cost-effective strategy for buying influence is to target users of average influence.

Let’s dive in a bit deeper.

The definitions of influence and influencers are, by the authors’ own admission, narrow and arbitrary. There are many ways one could define influence, even within the context of Twitter use. But I agree with the authors that these definitions have enough verisimilitude to be useful, and their simplicity facilitates quantitative analysis.

It’s hardly surprising that past influence is a strong predictor of future influence. But it might seem counterintuitive that, for predicting future total influence, past local influence is more informative than past total influence. The authors suggest the explanation that most non-trivial cascades are of depth 1 — i.e., total influence is mostly local influence. But at most that would make the two features equally informative, and total influence should still be a mildly better predictor.

I suspect that another factor is in play — namely, that the difference between local influence and total influence reflects the unpredictable and rare virality of the content (e.g., a random Facebook Question generated 4M votes). If this hypothesis is correct, then past local influence factors out this unpredictable factor and is thus a better predictor of both future local influence and future total influence.

I’m a bit surprised that follower count supplies additional informative value beyond the past local influence; after all, local influence should already reflect the extent to which the followers are being influenced. It’s possible that past influence lags the follower count, since it does not sufficiently weigh the potential contributions of more recent followers. But another possibility is one analogous to the predictive value of past local vs. global influence: past local influence may include an unpredictable content factor which follower count factors out.

Of course, I can’t help suggesting that TunkRank might be a more useful indicator than follower count. Unfortunately the authors don’t seem to be aware of the TunkRank work — or perhaps they preferred to restrict their attention to basic features.

I’m not surprised by the inability to exploit content features to predict influence. If it were easy to generate viral content, everyone would do it. Granted, a deeper analysis might squeeze out a few features (like those suggested in the Buddy Media report), but I don’t think there are any silver bullets here.

Finally, the authors consider the question of designing a cost-effective strategy to buy influence. The authors assume that the cost of buying influence can be modeled in terms of two parameters: a per-influencer acquisition cost (which is the same for each influencer) and a per-follower cost for each influencer. They conclude that, until the acquisition cost is extremely high (i.e., over 10,000 times the per-follower cost), the most cost-efficient influencers are those of average influence. In other words, there’s no reason to target the small number of highly influential users.

The authors may be arriving at the right conclusion (Watts’s earlier work with Peter Dodds, which the paper cites, questions the “influentials” hypothesis), but I’m not convinced by their economic model of an influence market. It may be the case that professional influencers are trying to peddle their followers’ attention on a per-follower basis — there are sites that offer this model.

But why should anyone believe that an influencer’s value is proportional to his or her number of followers? The authors’ own work suggests that past local influence is a more valuable predictor than follower count, and again they might want to look at TunkRank.

Regardless, I’m not surprised that a fixed per-follower cost makes users with high follower counts less cost-effective, as I subscribe to its corollary: as a user’s follower count goes up, the per-follower value diminishes. I haven’t done the analysis, but I believe that the ratio of a user’s TunkRank to the user’s follower count tends to go down as a user’s follower count goes up. A more interesting research (and practical) question would be to establish a correctly calibrated model of influencer value and then explore portfolio strategies.

In any case, it’s an interesting paper, and I look forward to discussing it with my colleagues next week. Of course, I’m happy to discuss it here in the meantime. If you’re in my reading group, feel free to chime in. And you’re not in you’re not in my reading group, consider joining. We do have openings. 🙂

Categories
General

Social Utility, +/- 25%

I like Google…

I’ve been a regular Google user since the day I first discovered its existence in 1999. Indeed, I’ve consistently found Google to be the most useful service on the web. That’s not love, but it’s a very strong +1.

Moreover, I’d say that my preference for Google is an informed one. I’ve given all of the major search engines a fair chance, and even tried a fair number of obscure ones. They all have their strengths, but none have delivered enough utility to me to justify the cognitive load of using more than one search engine for the open web.

…but I don’t need Google.

Nonetheless, I know that, if Google disappeared tomorrow or became inconvenient to access, I’d be content with one of its competitors. I have no particular investment in Google beyond brand loyalty.

Actually, that’s not entirely true. I could easily walk away from Google search, but I’d be apoplectic if I suddenly lost access to my Gmail account — much as if I lost access to my LinkedIn or Twitter accounts. Indeed, Gmail is the only way in which Google has me locked in, but I don’t see my Gmail account as entangled with my access to Google’s other services.

Perhaps that not a bug but a feature: after all, Google trumpets the virtues of “open” and the portability of user data (including Gmail) through the Data Liberation Front. Nonetheless, it’s no secret that Google has a major case of Facebook envy. And if rumors hold, Google is now making the success of its social strategy a major component in all employee compensation.

Social is Give to Get.

Google critics often assert that Google doesn’t get social. But I think the problem isn’t so much with what Google gets as what it gives. When it comes to social, you have to give to get. That is, to get data and engagement, you have to provide social utility.

To start off, Google would love to know who you are. That’s why it developed Google Profiles in 2007. People are more than willing to provide data about who they are, as proven by the hundreds of millions of people who create profiles on Facebook and LinkedIn. Perhaps Google was a little bit late to the game. More likely, people didn’t see enough utility in creating Google profiles. Facebook, on the other hand, helps people be found by their friends and family in a context designed for social interaction. LinkedIn offers people the opportunity to be found by people who can help you professionally: colleagues, classmates, potential employers, etc. Google didn’t give people much reason to invest effort — in fact it seems to treat Profiles as a dumping ground populated by Google’s other products, rather than valuable piece of online real estate embedded in a living social context. Not surprisingly, users invest their efforts elsewhere.

Google would also love to know where you are and where you’ve been — that’s why Google created Latitude in 2009. Moreover, Google developed this pioneering location-based service as a complement to Google Maps, perhaps the best product Google has produced outside of search. Given it’s dominance in mapping services, directions, and local search, Google should be the leader of all things local. And yet, while Latitude has flopped, Foursquare — which launched in the same year as a tiny startup after Google acquired and shut down its previous incarnation— succeeded in defining location-based services as a category. Before Foursquare, the idea of a service tracking your location was one that most of us associated with Lo-Jack and Big Brother — if not with modern totalitarian regimes. Yet, by making a game out of “checking in” to venues, Foursquare inspired its users to willingly — and eagerly! — share and publish their whereabouts. It’s unclear whether this model will create sustained interest (cf. Mark Watkins’s analysis at ReadWriteWeb), but Foursquare’s success thus far is predicate on its offers social utility in exchange for data and attention.

Of course, Google also wants to know what you like. That’s why Google developed SearchWiki (RIP), Hotpot (now merged into Places), and most recently +1. As Amazon, Facebook, Netflix, and Yelp have demonstrated, people aren’t shy about sharing their opinions publicly, given the right social context and utility. Unfortunately, Google seems to struggle with that last part. Google embedded SearchWiki in the non-social context of search — and has launched +1 the same way. It’s not at all clear what users would gain by going out of their flow to annotate search results. Hotpot may simply be a case of too little, too late — people are already trained to go to Yelp and Facebook Fan pages for subjective information about service businesses. Overall, Google has not given users a reason to believe there is significant return on their investment in sharing opinions.

Collecting Data Doesn’t Count.

Of course Google is able to collect a significant amount of data about users’ identities through their search history, cookies, browser toolbars, and purchase history (if they use Google Checkout). Indeed, it is Google inference of user intent in search queries that has allowed Google to become the poster child of online advertising.

But collecting data is not the same as having the user volunteer it. Most users have a transactional relationship with Google, tolerating data collection and advertising in exchange for a free service. Google wants more — it wants users to invest in identities associated with their Google accounts. But Google doesn’t seem to undertand that users don’t make these investments unless their receive some social or professional utility in return.

If it’s true that Larry Page is making “social” Google’s top OKR, then I hope for the sake of my former colleagues that Google has learned from its past experiments.

Categories
General

Guest Blog: Data 2.0 Conference Report

http://www.flickr.com/apps/slideshow/show.swf?v=71649

Note: This post was written by Scott Nicholson, a Senior Data Scientist at LinkedIn. Scott is data and modeling geek with a passion for startups, product and user experience. His work at LinkedIn focuses on analyzing and improving user engagement and monetization.

I’m happy to report back on my experience at the Data 2.0 conference, an event organized by midVentures and targeted at entrepreneurs building products to leverage the dramatic increase in publicly and privately collected data. The conference has four main themes: what data is available, how to obtain data, how to store and access data, and how to create value from data products. For data nerds or hackers, the conference offered a delightful stream of  “you know what would be cool…” ideas.

The morning started off on a strong foot with a talk by Vivek Wadhwa on how data is going to define the next generation of successful startups in a new information age. He observed the increasing online access to data that has previously been restricted to offline access (or no access at all). He also emphasized the importance of  new sources of data, such as medical records and genome data. We need to think of social use of data beyond Twitter, Facebook and LinkedIn: for example, genome data will allow us to connect to each other in ways that helps us better understand our similarities and differences. Meanwhile, some existing data sources will become increasingly open and available to all. Wadhwa stressed the importance of leveraging the open sources of federal, state and local government data to come up with solutions to the existing closed and clunky legacy systems that governments used to generate data reports (a pity that data.gov and related programs may be defunded — DT).

The morning keynote segued nicely into the panel on open data sources. Jay Nath, Director of CRM for the city of San Francisco, noted that, while many applications are using government data and APIs, they mostly address consumer convenience (e.g., public transit apps) rather than government efficiency.  Panelists agreed that government employees have few incentives to take risks by using new technology: legacy systems might be expensive, inflexible and inefficient, but they do perform their limited function. Alluding to Eric Ries’s idea of a “lean startup“, Nath suggested the concept of a “lean government” that lowered costs, sped up its operations, and avoided procurement processes by using open source technology — all in the context of providing services to its citizens.

The inspiring mid-day keynote by former Amazon Chief Scientist Andreas Weigend took a different perspective from the morning sessions: he focused on the how data sharing can provide tangible value to end-users, even resulting in significant behavior change. He cited products like tweeting weight scales, FitBit, and Nike + that allow people to share data about their fitness efforts, thus leading to social reinforcement for positive behaviors. I personally see this area as a great example of where data scientists and engineers can create enormous economic value and increase people’s welfare

The day also featured a various product launches and presentations. Here are a few that caught my attention:

  • Micello: Google maps for indoors. They won the startup competition that was held in conjunction with the conference.
  • Tropo: API for voice calls and SMS
  • DataStax Brisk: Technology unifying Hadoop, Hive & Cassandra. A new Hadoop distribution powered by Cassandra.
  • Neer: always-on location awareness app from Qualcomm. Privately share location with groups and families.
  • Heritage Health Prize: $3MM prize for predictive modeling around who will require hospitalization (a follow-up on their announcing the prize at Strata)

Overall, it was great to see hundreds of people exploring innovations and opportunities to use data to improve business, technology and society.

Categories
General

Steal These Ideas!

Talk is cheap, as the saying goes. That’s a good thing, since I am always overflowing with ideas that I have neither the time (I love my day job!) nor the money to advance. What I do have is a blog that I hope inspires readers to turn some of these ideas into reality.

My ideas are somewhat predictable, in that they all address user-centric information-seeking problems. Working for over a decade in this space has focused my intellectual curiosity somewhat — and of course I work on a number of these problems at LinkedIn. But there are many information-seeking problems that are outside of my present or foreseeable scope.

Here are two ideas that I’m hoping someone will execute on so I don’t have to:

1. Shopping: Help Me Figure Out What I Want

We’ve come a long way to improve the shopping experience, at least for utilitarian shoppers like yours truly. If I know exactly what I want, I usually find it by using Google as a gateway to Amazon, taking a bit more time if I’m feeling price-sensitive. I’d happily install a browser extension that could automatically detect product search queries and take them to my preferred shopping sites, bypassing the search results page, but that’s a minor detail of convenience (though probably not such a minor detail for the search engine companies). In any case, known-item search for online shopping is hardly inspiring as an open problem.

Exploratory search is another story entirely. For all the work that’s been done on faceted search, it is used almost exclusively to help people narrow search results. Progressive narrowing is great if you have a pre-established information need, but it is not the best interface if you’re hoping to evolve your information need through exploration. Instead of just “help me find what I’m looking for”, I’d also like to see more “help me figure out what I want”. I’d like to see an innovator applying faceted search to broaden queries, not just to narrow them, as well as going beyond collaborative filtering and “related items” to create a compelling browsing experience based on semantic and social navigation.

2. Organizing the World’s Information: Beyond Wikipedia and Navigational Queries

If shopping online often reduces to using Google to find product pages on Amazon, then informational queries similarly reduce to using Google to find Wikipedia entries. Nothing against Wikipedia — I think it is one of the most extraordinary achievements of our generation — but I think of the web as a library and Wikipedia as its encyclopedia section. Google’s mission statement notwithstanding, web search engines do a poor job of organizing the rest of the world’s information, instead choosing to optimize for known-item search.

There are countless opportunities for improvement here. Imagine if there were an interface for books, scholarly articles, patents, music, or videos that supported browsing and exploration of their content and meta-data. We’ve seen the beginnings of such an approach for individual libraries (e.g., the Triangle Research Libraries Network), but there is so much more to do in this space. Perhaps it’s a space that is hard to monetize, but even then I’d expect philanthropists to take an interest in making the world’s knowledge and creative artifacts more accessible.

If you are pursuing either of these areas, I’d love to hear about it. I’m sure readers here would too. I’m also curious to learn more about innovation in the travel and personals spaces, as those are both areas that could benefit from supporting exploratory search. And if you have work in progress, please present it at the HCIR workshop!

Categories
General

LinkedIn: HCIR for Fun and Profit

This afternoon, I met with a couple of Stanford seniors to advise them on a startup they’ve been developing and targeting towards mid-sized online retailers. I’d expected to spend most of the time talking about their technology and customer development strategy — and we did indeed talk about these things. But we spent most of the time brainstorming whom I knew that could best help them achieve the key milestone of landing a first customer.

Not surprisingly, my first step was to open up my laptop and head straight to LinkedIn (I’m not only a data scientist — I’m also a member!) to see who in my network might be most helpful to them at this critical stage. The students were openly impressed: despite being sharp, energetic, and remarkably business-savvy for a couple of guys not old enough to legally buy beer, they had never seen someone use LinkedIn the way I was doing in front of them — not for hiring or recruiting, but as an exploratory search tool to find useful professional connections.

I started with a search for online retail, then restricted to directors and VPs, narrowing down further to first-degree and second-degree connections. I vetted second-degree connections by looking at my paths to them, determining who would be likely to be most helpful either because they owed me a favor or because they might have their own interest in the startup’s success.

We then browsed through the list of top online retailers, identifying plausible companies for them to target and then looking for my first-degree and second-degree connections not only at those companies but also at other companies in the same space. We spent over an hour fluidly going back and forth between talking and exploring on LinkedIn. In the course of this exploration, we not only produced a list of people to contact, but also arrived at a better understanding of the business strategy.

I’m always happy to help young entrepreneurs who represent the future of our economy, and even happier to do so using the tools my colleagues and I are constantly working to improve. But I’m surprised and a bit disheartened that the methods I used are not common knowledge, especially among people who stand to gain the most benefit from them. Perhaps, as someone who has been using LinkedIn since 2004, I take for granted that people know how to take advantage of it for professional networking. I hope that the company’s increasing visibility will make more people aware that LinkedIn is not *just* the best things that has ever happened to recruiting.

Also, as an HCIR advocate, I’d like to see these kinds of information-seeking tasks receive more attention from researchers and practitioners. I’ve been saying for a while that these and similar tasks that are neglected by the information retrieval community and not adequately addressed by Google . For example, while there has been significant research effort in the area of expert finding, I’d like to see more efforts to improving the interactive process of finding experts and expertise. And not just from LinkedIn!

If you are doing work in this space, I hope you’ll participate in the upcoming HCIR workshop and show off your stuff.

In the meantime, I hope you make the most of LinkedIn, for fun and for profit. As a mentor of mine told me in my first job, it’s “network or not work”.

 

Categories
General

A Practical Rant about Software Patents

Given the controversial content of this post, I’d like to remind readers upfront that this post, like all of the contents of this blog, represents my personal opinions, and in particular does not represent the opinions of my present or former employers. I am not a lawyer, nor do I claim to have read any of the patents to which I directly or indirectly allude in this posts. None of the below should construed as legal advice. Finally, the material is US-centric — your national software patent policy may vary.

My feelings about software patents are a matter of public record (e.g., this open letter to the USPTO). As things stand today, software patents act as an innovation tax rather than as a catalyst for innovation. It may be possible to resolve the problems of software patents through aggressive reform, but it would be better to abolish software patents than to maintain the status quo.

My personal feelings notwithstanding, I acknowledge the reality that today’s software companies need to have defensive patent strategies. In a previous job, one of my key accomplishments was to hire a director of intellectual property. It was a difficult hire, but it happened just in time to defend against a particularly noxious patent troll. I am not at liberty to spell out the details, but I can say that we responded with a long, expensive fight that effectively quashed the patent and the lawsuit.

Beware Of Trolls


Patent trolls, known less pejoratively as non-practicing entities (NPE) because they do not actually sell products or services that implement the systems or methods in the patents they own, take advantage of asymmetric risk. On one hand, an NPE does not need much money to bankroll (or at least initiate) a patent infringement suit — in fact, there are law firms who will take such cases on contingency. On the other hand, the company being sued faces potentially ruinous costs. Moreover, even if a company feels certain that a lawsuit against it is baseless, the company cannot count on the imperfect and inefficient legal system to reach a fair outcome. As a result, the company has to choose between spending heavily in its own defense or settling with the NPE. Most companies opt for the less risky route and negotiate settlements, providing funds that the NPEs use to sue more companies.

Some people have a name for this style of asymmetric warfare — namely, terrorism. I suppose that the word terrorist is loaded enough without increasing its breadth to include patent trolls — not to mention that trolls have their defenders. But the metaphor is a useful one. A terrorist attack inflicts an amount of damage that is much greater than the absolute cost to the terrorist, e.g., a suicide bomber who inflicts mass murder. Moreover, the threat of terrorism puts the object of that threat in the position between settling (aka negotiating with terrorists) or spending heavily on counter-terrorism efforts. As Peter Neumann notes in a Foreign Affairs article:

The argument against negotiating with terrorists is simple: Democracies must never give in to violence, and terrorists must never be rewarded for using it. Negotiations give legitimacy to terrorists and their methods…

Yet in practice, democratic governments often negotiate with terrorists.

There have been various attempts to address the threat of patent trolls.

Google litigation director Catherine Lacavera has gone on record saying that Google intends to fight rather than settle patent infringement lawsuits in order to deter patent trolls. We’ll see if Google can sustain this “we don’t negotiate with terrorists” approach; I admire the resolve, but like Neumann I’m skeptical.

Article One Partners has built a business around crowd-sourcing patent invalidation. Clients pay for research to invalidate patents, and Article One offers bounties to anyone who contributes valuable evidence. In theory, companies can request validity analysis of their own patents to test them for robustness, but I assume that the primary application of this service is the invalidation patents that a company sees as threats.

Rational Patent (RPX) has created a defensive patent pool. purchasing a large portfolio of patents and then licensing them to its member companies. Some have questioned whether this approach is “patent extortion by another name“, and indeed paying RPX for a blanket license does feel a bit like preemptively settling in bulk. But I’d be more concerned that the “over 1,500 US and international patent assets” that RPX claims to have acquired are a drop in the bucket compared to the vast number of patents that the USPTO has granted, many of dubious merit.

Meanwhile, patent trolldom is serious business. Former Microsoft CTO Nathan Myhrvold created Intellectual Ventures to “invest both expertise and capital in the development and monetization of inventions and patent portfolios.” The company has only filed one lawsuit so far, but Mike Masnick claims that it has used over a thousand shell companies to conduct stealth lawsuits.

Unfortunately, the proliferation of lawsuits by software patent trolls suggests that the economic incentives encourage such suits. If every company could sustain a “Never give up, never surrender!” approach, patent trolls would eventually go away, but it is unlikely that companies would be willing to assume the short-term risks that such an approach entails.

Moreover, this approach only works if everyone participates, requiring that every company forgo the competitive advantage it could enjoy from being the only company among its competitors to appease the trolls. This is a classic tragedy of the commons. I’m hopeful that we’ll eventually implement sensible patent reform in the United States, but I expect it will take a long time to overcome the entrenched interests that support the status quo.

It’s Not Just The Trolls

But NPEs are not the only cause for concern. Many established companies, including some technology leaders, are not averse to using patent lawsuits as part of their business strategy. The mobile device and software space is a particularly popular arena for patent litigation, the most notable being Oracle’s lawsuit against Google claiming that Android infringes on patents related to Java. The stakes are extraordinary, dwarfing even the $612.5M that RIM paid NTP in order to avoid a complete shutdown of the BlackBerry service (ironically, at least some of the patents involved have since been rejected by the patent office after re-examination).

Patent lawsuits can also be a way for larger companies to bully smaller ones. For example, a couple of entrepreneurs at visual search startup Modista were forced to shut down their company because of a lawsuit by Like.com, a more established player in the space which was ultimately acquired by Google. Note: although I was Google at the time, I have no inside knowledge of the acquisition, nor whether there is any truth to the speculation that Google acquired the company for its patents.

Defensive Patenting

Moral considerations aside, the above stories make it clear that defensive patent strategy isn’t just about NPEs. In fact, many software companies take an approach to defensive patenting is to assemble a trove of patents that are useful for countersuits and thus serve as a deterrent. Back to military metaphors, it’s similar to countries developing nuclear weapons (a popular metaphor for patents in general) in accordance with the doctrine of mutual assured destruction.

Companies that follow a defensive patent strategy typically implement a process for capturing intellectual property. Scientists and engineers file invention disclosures, a committee reviews these for patentability, and a law firm translates the invention disclosures into patent filings. The filings then go through the meat grinder of patent prosecution and eventually are extruded as patents.

It all sounds great in theory — indeed, I have seen executives who mostly worry about educating scientists and engineers about patents and providing the right incentives to encourage them to write and submit invention disclosures. Indeed, it can be difficult to integrate intellectual property capture into the process and culture of a software company. But I think there are two much bigger issues.

First, it takes several years to obtain a patent. Indeed, the USPTO dashboard shows that it takes two years just to get an initial response from the patent office. Thus a defensive patenting strategy requires significant advance planning: any patents filed today are unlikely to be useful deterrents until at least 2014. Given the rapid pace of the software industry, this delay is very significant. Moreover, startups are especially vulnerable in their first few years.

Second, intellectual property capture processes are inherently optimized for offensive (i.e., don’t copy my invention or I’ll sue you) rather than defensive (i.e., don’t sue me) patent strategy. Consider Google’s defensive position with respect to Oracle in the aforementioned lawsuit. Google has a relatively small patent portfolio, but it has obtained patents for some of its major innovations, such as MapReduce. Let’s put aside questions about the validity of the MapReduce patent — especially since patents enjoy the presumption of validity. The bigger question is to whom such a patent serves as a deterrent against patent lawsuits. It may very well deter Hadoop users, which include Google arch-rival Facebook. But, as far as I know, Oracle is not vulnerable on this front. FOSS Patents blogger Florian Mueller did an analysis and concluded that Google’s patents are not an effective deterrent. Indeed, the fact that Google has not counter-sued Oracle using its own patents is at least consistent with this analysis.

What if Google were to invest in obtaining (i.e., purchasing) a collection of broad patents that had to do with relational databases? Such patents could have nothing to do with Google’s areas of innovation and nonetheless serve as an effective deterrent against lawsuits from relational database companies like Oracle. Even if the patents were not robust, they would still have some value as deterrents because of their presumption of validity and the aforementioned inefficiency of the legal system.

In general, the most valuable defensive patents are those that you believe your competitors (or anyone else who might have an interest in suing you) are already infringing. Even if those patents would be unlikely to survive re-examination, the re-examination process is long and expensive, and even the most outrageous of patents enjoys the presumption of validity.

Everybody Into The Pool

patent pool is a consortium of at companies that agree to cross-license each other’s patents — a sort of mutual non-aggression pact. But perhaps companies that only believe in the defensive use of patents should take a more aggressive approach to patent pooling. Following the example of NATO, they could create an alliance in which they agree to mutual defense in response to an attack by any external party. I don’t know if such an approach would be viewed as anti-competitive, but it does strike me as a cost-effective alternative to the current approach for defensive patenting.

And, as with most ideas, this one is hardly original. In 1993, Autodesk founder John Walker published “PATO: Collective Security In the Age of Software Patents“, in which he proposed:

The basic principle of NATO is that an attack on any member is considered an attack on all members. In PATO it works like this–if any member of PATO is alleged with infringement of a software patent by a non-member, then that member may counter-sue the attacker based on infringement of any patent in the PATO cross-licensing pool, regardless of what member contributed it. Once a load of companies and patents are in the pool, this will be a deterrent equivalent to a couple thousand MIRVs in silos–odds are that any potential plaintiff will be more vulnerable to 10 or 20 PATO patents than the PATO member is to one patent from the aggressor. Perhaps the suit will just be dropped and the bad guy will decide to join PATO….

Since PATO is chartered to promote the free exchange and licensing of software patents, members do not seek revenue from their software patents–only mutual security. Thus, anybody can join PATO, even individual programmers who do not have a patent to contribute to the pool–they need only pay the nominal yearly dues and adhere to the treaty–that any software patents they are granted will go in the pool and that they will not sue any other PATO member for infringement of a software patent.

It’s been almost two decades, but perhaps PATO is an idea whose time has come. And, even if a collective effort fails, individual companies might do well to focus less on intellectual property capture and more on collecting the kinds of nuisance patents currently favored by trolls. After all, the best defense is the credible threat of a good offense.

Conclusion

Even if you hate software patents, you can’t afford to ignore them if you are in the software industry. And I’m well aware that not everyone shares my view of software patents. But I hope those who do find useful advice in the above discussion. I’d love to see the software industry move beyond this innovation tax.

Categories
General

Life, the Universe, and SEO Revisited

A couple of years ago, I wrote a post entitled “Life, the Universe, and SEO” in which I considered Google’s relationship with the search engine optimization (SEO) industry. Specifically, I compared it to the relationship that Deep Thought, the computer in Douglas Adams’s Hitchhikers Guide to the Galaxy, has with the Amalgamated Union of Philosophers, Sages, Luminaries and Other Thinking Persons.

Interestingly, both SEO and union protests have been front-page news of late. I’ll focus on the former.

Three recent incidents brought mainstream attention to the SEO industry:

  • Two weeks ago, Google head of web spam Matt Cutts told the New York Times that Google was engaging in a “corrective action” that penalized retailer J. C. Penney’s search results because the company had engaged in SEO practices that violated Google’s guidelines. For months before the action (which included the holiday season), J. C. Penney was performing exceptionally well in broad collection of Google searches, including such queries as [dresses], [bedding], [area rugs], [skinny jeans], [home decor], [comforter sets], [furniture], [tablecloths], and [grommet top curtains]. As I write this blog post, I do not see results from jcpenney.com on the first result page for any of these search queries.
  • This past Thursday, online retailer Overstock.com reported to the Wall Street Journal that Google was penalizing them because of Overstock’s now discontinued practice of rewarding students and faculties with discounts in exchange for linking to Overstock pages from their university web pages. Before the penalty, these links were helping Overstock show up at the top of result sets for queries like [bunk beds] and [gift baskets]. As I write this blog post, I do not see results from overstock.com on the first result page for either of these search queries.
  • That same day, Google announced, via an official blog post by Amit Singhal (Google’s head of core ranking) and Matt Cutts, a change that, according to their analysis, noticeably impacts 11.8% of of Google search queries. In their words: “This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sites—sites with original content and information such as research, in-depth reports, thoughtful analysis and so on.”

Of course, Google is always working to improve search quality and stay at least one step ahead of those who attempt to reverse-engineer and game its ranking of results. But it’s quite unusual to see so much public discussion of ranking changes in such a short time period.

Granted, there is a growing chorus in the blogosphere bemoaning the decline of Google’s search quality. Much of it focused on “content farms” that seem to be the target of Google’s latest update. Perhaps Google’s new public assertiveness is a reaction to what it sees as unfair press. Indeed, Google’s recent public spat with Bing would be consistent with a more assertive PR stance.

But what I find most encouraging is the Google’s recent release of Chrome browser extension that allows users to create personal site blocklists that are reported to Google. Some may see this is as a reincarnation of SearchWiki, an ill-conceived and short-lived feature that allowed searchers to annotate and re-order results. But filtering out entire sites for all searches offers users a much greater return on investment than demoting individual results for specific searches.

Of course, I’d love to see user control taken much further. And I wonder if efforts like personal blocklists are the beginning of Amit offering me a more positive answer to the question I asked him back in 2008 about relevance approaches that relied on transparent design rather than obscurity.

I’m a realist: I recognize that many site owners are competing for users’ attention, that most users are lazy, and that Google wants to optimize search quality subject to these constraints. I also don’t think that anyone today threatens Google with the promise of better search quality (and yes, I’ve tried Blekko).

Perhaps the day is in sight when human-computer information retrieval (HCIR) offers a better alternative to the organization of web search results than the black-box ranking that fuels the SEO industry. But I’ve waiting for that long enough to not be holding my breath. Instead, I’m encouraged to see a growing recognition that today’s approaches are an endless game of Whac-A-Mole, and I’m delighted  that at least one of the improvements on the table takes a realistic approach to putting more power in the hands of users.

Categories
General

Life’s a Beach

Heading to Punta Cana for a week. Feel free to keep writing great comments — will catch up when I get back!

Categories
General

Google vs. Bing: A Tweetle Beetle Battle Muddle

Unless you’ve been living in a cone of silence, you’ve probably heard about the epic war of words between Google and Bing. But just in case, here’s a quick summary:

Amit Singhal, Google Fellow: “Microsoft’s Bing uses Google search results—and denies it“:

Bing is using some combination of:

or possibly some other means to send data to Bing on what people search for on Google and the Google search results they click. Those results from Google are then more likely to show up on Bing. Put another way, some Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation.

Harry Shum, Corporate Vice President, Bing: “Thoughts on search quality“:

We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.

Yusuf Mehdi, Senior Vice President, Online Services Division, Bing: “Setting the record straight“:

Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results. What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index.

Matt Cutts, Head of Webspam, Google: “My thoughts on this week’s debate“:

Something I’ve heard smart people say is that this could be due to generalized clickstream processing rather than code that targets Google specifically. I’d love if Microsoft would clarify that, but at least one example has surfaced in which Microsoft was targeting Google’s urls specifically. The paper is titled Learning Phrase-Based Spelling Error Models from Clickthrough Data and here’s some of the relevant parts:

The clickthrough data of the second type consists of a set of query reformulation sessions extracted from 3 months of log files from a commercial Web browser [I assume this is Internet Explorer. –Matt] …. In our experiments, we “reverse-engineer” the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query – an important indication that the spelling suggestion is desired. From these three months of query reformulation sessions, we extracted about 3 million query-correction pairs.”

This paper very much sounds like Microsoft reverse engineered which specific url parameters on Google corresponded to a spelling correction. Figure 1 of that paper looks like Microsoft used specific Google url parameters such as “&spell=1″ to extract spell corrections from Google. Targeting Google deliberately is quite different than using lots of clicks from different places.

Let me start by saying that these are very serious words from very serious people.

Amit and Matt, both of whom I know personally, are not just two of the most prominent Google employees — they have a deep personal investment in Google’s search quality. Amit is personally responsible for much of Google’s web search ranking algorithm, and Matt is surely the person whom spammers (and many SEO consultants) most love to hate. There is no question in my mind that the emotion both of them are expressing is sincere.

I haven’t met Harry or Yusuf, but I have no reason to doubt their own sincerity — especially since everything they are saying seems consistent with the facts — in fact, consistent with the substantive parts of Google’s allegations. Indeed, the facts don’t really seem to be in dispute. And more generally, I’ve met some of the folks who lead the Bing team (like Jan Pedersen), and, like Matt, I believe they are thoughtful and sincere and are devoted to building a great search engine of their own.

The debate is not about the facts. Rather, it’s about what is right and wrong. I will try to summarize the two sides’ position without editorializing.

Bing is claiming that:

  • Users have a right to do as they please with their own clickthrough data, which includes data from Google search sessions.
  • Bing toolbar users opted in to share this clickthrough data with Bing.
  • By using this clickthrough data, Bing creates value for users.

Google is claiming that:

  • Bing’s specific targeting of Google clickthrough data amounts to copying Google and is wrong.
  • Bing toolbar users are not necessarily aware that they are complicit in this behavior.
  • Bing is disingenuous in understating how much it benefits from Google as a signal.

What do I think?

I agree with Bing that users have the right to do as they please with clickthrough data. I’d think Google would agree too, given that Google wrote the sermon on “the meaning of open“:

Open information means that when we have information about users we use it to provide something that is valuable to them, we are transparent about what information we have about them, and we give them ultimate control over their information.

I agree with all of the three points I listed as Google’s claims except for the part that Bing’s behavior is wrong. It’s up to users if they want to help Bing compete with Google. Do users know that they’re doing so? Probably not. But would they stop doing so if they did? I doubt it. I can’t see why most users would have a dog in this fight — and in fact, it may be in users’ interest to help Bing be more competitive.

I do think Bing should be forthright about what it is doing — and how much this user-provided data from Google search sessions is contributing to its own quality improvements. Bing can, of course, keep this information secret, but I’d think that Bing would want to defend its reputation as an innovator — especially as the David in a David vs. Goliath fight.

But I also think that Google should be careful with its accusations. Accusing Bing of not being innovative is one thing, and that accusation, backed by concrete examples, is probably enough to score points. But implying that Google owns its users’ clickthrough data and that Bing has no right to solicit that data from users is another thing entirely.

I’m curious to hear what others here think. It’s been a while since I could freely express opinions about Google and Bing, so I’m delighted to have such a hot controversy to incite discussion. Because everyone enjoys a muddle puddle tweetle poodle beetle noodle bottle paddle battle!