Categories
General

The Secret May Be To Keep Fewer Secrets

In light of the recent WikiLeaks saga and the various leaks that have plagued my former employer, I was musing the other day about whether leaks are inevitable as an organization grows.

I started off by considering a model where each individual in an organization leaks a particular piece of sensitive information with a constant probability p, and where acts of leakage are independent and identically distributed events. Now let’s consider what value of p leads to a 99% probability of leakage in an organization of n = 20,000 people. It’s less than 1/4000. In other words, even if each person in an organization can keep a secret with 99.98% reliability, almost all secrets will be leaked.

Using this same value of p with n = 900 (roughly the size of my current employer) yields less than a 20% chance of leakage — certainly not a zero probability, but much closer to zero than to one. And at n = 90 — the upper end of what I’d consider a startup — the probability of leakage drops to 2%. Based on this crude analysis, the ability to keep secrets drops very rapidly as organizations enjoy the growth that comes with success.

Moreover, p is likely to be positively correlated to n — that is, individuals in larger organizations are more likely to leak sensitive information. Many people in larger organizations have less actual and perceived stake in the organization’s success, than those in smaller ones. Also, it is difficult to sustain grueling hiring standards — particularly cultural ones — as an organization grows.

So what is an organization to do? If the above model is even close to accurate, then I can see four options:

1) Don’t grow.

Yes, I’m serious. Not every idea inspires a billion-dollar business, and not every company should grow beyond a hundred people. Growth has costs that offset its benefits, and the inability to keep secrets may be a significant cost for organizations whose competitive advantage depends on proprietary intellectual property. The largest hedge funds each have about 1,000 employees, and most are much smaller. Secrecy is not the only consideration, but it’s certainly a consideration.

2) Share less with your employees.

If you can’t reduce p, you can at least reduce n by sharing secrets less widely. Traditional organizations only share sensitive information within a tight inner circle. Even Google, known for sharing almost everything with its employees, keeps tighter control over the details of search result ranking. This approach, however, comes at a cost: it signals to employees that they cannot be trusted. Moreover, if employees discover secret information through rumor, they may feel less responsible for maintaining secrecy than if they had been entrusted with that information.

3) Investigate leaks and punish leakers.

Some organizations succeed better than others at rooting out leakers and punishing them. In economic terms, it makes sense to discourage undesirable behavior through strong disincentives. Note, however, that leakers rarely gain anything tangible in exchange for their leaks and indeed are often acting irrationally in strictly economic terms. People in general have been known to act irrationally. So I’d caution against any approach that assumes human rationality. A better approach may be to detect or prevent of leaks through technology (e.g., packet analyzers), but see the previous comment about making employees feel they cannot be trusted.

4) Keep fewer secrets.

A prominent CEO recently said “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place“. Yes, I’m taking the quotation out of context, but I’d like to offer a variant: if your organization’s success depends on something that you don’t want anyone to know, maybe you should reconsider your business model. Less glibly, you should avoid unnecessary dependence on secrecy, and you should avoid labeling all corporate information as secret, since that desensitizing employees to the risks of disclosure.

Conclusion? As Ben Franklin said, “Three may keep a secret, if two of them are dead.” Organizations can and do manage to keep secrets. But it’s hard to fight human nature, and better not to rely on winning that fight.

Categories
General

CIKM 2011 Industry Event

CIKM 2011 is nearly a year away, but I wanted to give folks a heads up about the Industry Event there that I am organizing with Tony Russell-Rose. These events have become an an increasingly important part of the annual CIKM and SIGIR conferences, and I believe they are helping to bridge the gap between scholarship and practice. When I organized the SIGIR 2009 Industry Event, it was almost too popular — I felt bad for the parallel research presentations that had to compete with Matt Cutts and danah boyd for attendees!

But not so bad that I wouldn’t do it again! We have an outstanding line-up of invited talks for the CIKM 2011 Industry Event, featuring:

For those not familiar with industry luminaries, that list includes one of the world’s most prominent information retrieval researchers, the founder of Metaweb (which created Freebase), the person who build Facebook’s data team (which developed Hive and Cassandra), and one of the leading industrial researchers on natural language processing. To borrow a sports metaphor, these were our first-round draft picks, and we are delighted that they all agreed to participate.

And those are just the keynotes! We’re also going to put out a call for participation soon, so watch this space!

Categories
General

First Week

It’s hardly surprising, at least in retrospect, that location-based social networking company Foursquare was founded (twice!) in New York City. Where else (at least in the United States) are there so many people with so many places to go and so many ways to get there? I’m not a social or environmental determinist, but clearly a startup needs hospitable conditions to thrive.

Having just started my new life as a citizen of Silicon Valley, I’ve quickly comprehended how it is the perfect birthplace for LinkedIn. Every introduction has been an exercise of triadic closure. Indeed, while most people know that the Bay Area is the world’s leading hub for technology startups, perhaps not everyone realizes that the foundation for this environment is the professional network that binds it. I’ve only been here for a week, and yet my world seems smaller by the day as I keep discovering new connections among my colleagues. It’s a lot of fun, if a bit overwhelming!

And fun but overwhelming is a great way to describe LinkedIn itself. It’s only been a few days since I updated my profile, but I already feel immersed in LinkedIn’s vibrant culture. I sit in an open office, surrounded by people I work with — data scientists, software engineers, product managers, designers, and more. And I’m already interviewing folks I might be working with soon — in a company growing as quickly as LinkedIn, it is everyone’s job to grow the team. I’ve joked to friends that moving west gave me three more hours to get work done — but I’m using them all and they’re not enough!

But despite this explosive growth, LinkedIn’s vision is shared and tight. We all know that our goal is to connect the world’s professionals to make them more productive and successful. Having such a clear-cut mission enables us to directly relate all of our efforts and ambitions to the concrete value they create. It’s a great feeling, and it helps me keep my sanity as I observe the size of my ever-increasing to-do list.

To say that I’m still adjusting is an understatement. I haven’t made a change like this is over a decade, and this adventure feels even more immersive. But a big difference between now and 1999 is that I arrive in my new world with a network of people there to welcome me. I have LinkedIn to thank for helping me develop that network, and it’s great to finally have the opportunity to give back.

Categories
General

Follow The Data

Today is my last day at Google. I have enjoyed an incredible year there, during which I’ve had the privilege to work with some of the smartest engineers on the planet. Working at Google taught me how much impact a handful of dedicated people can have on the lives of billions of users. Not that long ago, I compared Google to McDonald’s. Having spent time on the inside, I can attest that Google is a marvel of scale orchestration. Moreover, the Google New York office represents an impressive concentration of Google’s talent in the greatest city of the world.

But I am leaving Google to pursue the opportunity of a lifetime. On Monday, I will start a new chapter of my life. I am joining the data scientist team at LinkedIn, where I’ll be working with DJ Patil and his world-class team to build products and discover insights from a data collection that I have coveted for years. I’ll get to work with folks like Pete Skomoroch and Monica Rogati. And I’ll get to tackle challenges in my favorite areas of computer science: information extraction, matching, recommendation, social network analysis, and network visualization. Not to mention working with one of the largest faceted search deployments on the web!

It was an agonizing decision to leave Google and New York City. But, when LinkedIn reached out to me a couple of months ago, I was reminded of a fateful email from Steve Papa in July 1999 that led me to pack a bag two months later and begin the adventure that is now Endeca. LinkedIn is hardly a startup — it has over 600 employees and over 80 million members. But I see boundless opportunities to create new value from the great data and talent that LinkedIn has assembled. So, when I received that note from LinkedIn, I didn’t really have a choice.

This Monday, I begin a new adventure. Data, here I come!

Categories
General

Giving Thanks as an Information Scientist

As a first-generation American who is married to a card-carrying Native American, I celebrate Thanksgiving the traditional way: a day of gluttony followed by yummy leftovers. But, trite as it may be, I do like to take the time to reflect on the countless things for which I am thankful. A wonderful family, of course, but also the great fortune to live in an age where some of the subjects that I find most intellectually stimulating have become highly relevant to our practical daily lives.

Consider information retrieval. Perhaps I’m dating myself, but an undergraduate computer science major, I hardly imagined that information retrieval would have much significance outside of academia. Sure, there were commercial IR systems being built in the 1980s, but it wasn’t until the late 1990s that web search brought IR to the mainstream. Today, it’s hard to imagine studying computer science without learning about IR. Sure, my career makes me a tad biased, but it is undeniable that information retrieval is one of the defining problems of our generation.

And then there are social networks. When I studied graph drawing in the 1990s, the canonical example of a social network was “Six Degrees of Kevin Bacon“. Sure, many of my peers would talk about their Erdős numbers (they were more discreet about their placement in the Tarjan graph), but the study of social networks was surely an academic pursuit. Who would imagine that, barely a decade later, a movie entitled The Social Network would be a blockbuster movie grossing $175M? Leaving aside Hollywood, social networks have become a significant part of our daily lives. Not only do Facebook, Twitter, and LinkedIn account for a large fraction of our time online, but they also affect our offline personal and professional lives.

From childhood, I’ve been interested in mathematics, computer science, and psychology. Living in an age of information retrieval and social networks means that I can apply these interests in my daily work. Today I give thanks for being born at the right place and right time, blessed with a lifetime of interesting and practical problems to solve. Happy Thanksgiving to all, and enjoy the leftovers!

Categories
General

An Information Cascade

I’ve been reading Networks, Crowds, and Markets, a great textbook by David Easley and Jon Kleinberg. I’m very grateful to Cambridge University Press for surprising me with an unsolicited review copy. I’m more than halfway through its 700+ pages. Much of the material is familiar in this “interdisciplinary look at economics, sociology, computing and information science, and applied mathematics to understand networks and behavior”. But I’m delighted by much that is new to me, including a particularly elegant description of an information cascade.

I excerpt the following example from section 16.2, which the authors in turn  borrow from Lisa Anderson and Charles Holt:

The experimenter puts an urn at the front of the room with three marbles in it; she announces that there is a 50% chance that the urn contains two red marbles and one blue marble, and  a 50% chance that the urn contains two blue marbles and one red marble…one by one, each student comes to the front of the room and draws a marble from the urn; he looks at the color and then places it back in the urn without showing it to the rest of the class. The student then guesses whether the urn is majority-red or majority-blue and publicly announces this guess to the class.

Let’s simulate how a set of rational students would perform in this experiment.

The first student has it easy: if he selects a blue marble, he guesses blue; if he selects a red marble, he guesses red. Either way, his guess publicly discloses the first marble’s color.

Thus the second student knows exactly the colors of the first two selected marbles. If he selects the same color as the first student, he will make the same guess.  If, however, the second student selects a red marble, he has no reason to prefer one color over the other. Let’s assume that, when the odds are 50/50, an indifferent student breaks symmetry by selecting the color in his hand. That way, we guarantee that the second student discloses the color of the marble he selects.

Things get interesting with the third student’s selection. What happens if the first two students have both guessed red, but the third student selects a blue marble? Rationally, the third student will guess red, since he knows that two of the first three selected marbles were red. In fact, if the first two students select red marbles, *every* subsequent student will ignore his own selection and guess red. Of course, analogous reasoning applies if we reverse the colors.

Generalizing from this case, we can see that the sequence guesses locks in on a single color as soon as the count for one color is ahead of the other by two. I leave it as an exercise to the reader to determine that, if the urn is majority-red, there is a 4/5 probability that the sequence will converge to red and a 1/5 probability that it will converge to blue.

A 1/5 probability of arriving at the wrong answer may not seem so bad. But imagine if you could see the actual marbles sampled and not just the guesses (i.e., each student provides an independent signal). The law of large numbers kicks in quickly, and the probability of the sample majority color being different from the true majority converges to 0.

This example of an information cascade is unrealistically simple, but is eerily suggestive of the way many sequential decision processes work. I hope we all see it as a cautionary tale. The wisdom of the crowd breaks down when we throw away the independent signals of its participants.

Categories
General

The Element of Surprise

Surprise is not a word that user interface designers typically like to hear. Indeed, the principle of least surprise (also called the principle of least astonishment) is that systems should always strive to act in a way that least surprises the user.

Like many interface design principles, the principle of least surprise reflects the premise that software applications exist to be useful. In utility-oriented applications, surprise means distraction and delay — negatives that good designers work to avoid.

But we increasingly see applications whose main value to the user is not utility, but entertainment. Indeed, a recent Nielsen report claims that the top two online activities for Americans are social networks / blogs and games. I take the report with a grain of salt, but it seems safe to argue that people have come to expect the internet to be at least as fun as it is useful.

Even search, which would seem to be the poster child for the utility of online services, is being pressed into the service of entertainment. Max Wilson and David Elsweiler argued as much in their HCIR 2010 presentation about “casual leisure searching“. They mined Twitter to analyze a variety of scenarios where search isn’t about the use finding something, but rather about enjoying the experience. Indeed, their controversial definition of search is broad enough to include the possibility that the user does not have an information need.

Like the businessman in Antoine de St. Exupery’s Le Petit Prince, I’ve long felt that, as “un homme sérieux”, my job is delivering utility to users. Users already have lots of ways to waste time; I focus on making their productivity-oriented time more effective and efficient. I’m glad there are folks who devote their lives to making the rest of us have more fun (especially all the computer scientists who left academia for Pixar), but entertainment simply isn’t a vocation for me.

However, I’ve been coming around to the realization that fun and utility are not mutually exclusive. For example, news serves the utilitarian ideal of informing the citizenry, but many (most?) of us read news as a pleasant way to pass the time. Social networks are another example serving a similar function–perhaps with a balance that is more toward the entertainment of the spectrum but still providing genuine social utility.

A common feature of both of these examples is that users regularly return to the same site expecting the unexpected. The transient nature of news and social news feeds promises an endless supply of fresh content, produced more quickly than users can consume it. This situation is in stark contrast to those of typical web search queries, for which the results are expected to be largely static. Indeed, we may set up alerts to inform us of novel search results, but we are unlikely to regularly visit a bookmarked search results page the way we regularly visit a news or social network site.

Is novelty the only source of surprise? Novelty certainly helps, but it is not a necessity. An alternative source is randomness. I’m known people to use Wikipedia’s “random article” feature. But a more plausible place to introduce randomness is in recommendations — whether for products or content. Since recommendations are good guesses at best, a bit of randomness can help ensure that the guesses are interesting. Indeed, a SIGIR 2010 paper by Neal Lathia, Stephen Hailes, Licia Capra, and Xavier Amatriain on “Temporal Diversity in Recommender Systems” explored the use or randomness to induce diversity in recommendations and arrived at the conclusion that people don’t like being recommended the same things over and over again.

Can we generalize from these examples? I think so. For utility-oriented information needs, it is important to provide users with accurate, predictable, and efficient tools. But we can’t dismiss everything else as frivolous. Sometimes we just need to offer our users a little bit of surprise to keep it interesting.

Or, as Mary Poppins tells us: “In every job that must be done, there is an element of fun. You find the fun, and – SNAP – the job’s a game!”

Categories
General

A Question of User Expectations

Ideally, a search engine would read the user’s mind. Shy of that, a search engine should provide the user with an efficient process for expressing an information need and then provide the user with results relevant to the that need.

From an information scientist’s perspective, these are two distinct problems to solve in the information seeking process: establishing the user’s information need (query elaboration) and retrieving relevant information (information retrieval).

When open-domain search engines (i.e., web search engines) went mainstream in the late 1990s, they did so by glossing over the problem of query elaboration and focusing almost entirely on information retrieval. More precisely, they addressed the query elaboration problem by requiring users to provide reasonable queries and search engines to infer information needs from those queries. In recent years, there has been more explicit support for query elaboration–most notably in the form of type-ahead query suggestions (e.g., Google Instant). There have also been a variety of efforts to offer related queries as refinements.

But even with such support, query elaboration typically yields an informal, free-text string. All vocabularies have their flaws, but search engines compound the inherent imprecision of language by not even trying to guide users to a common standard. At best, query suggestion nudges users towards more popular–and hopefully more effective–queries.

In contrast, consider closed-domain search engines that operate on curated collections, e.g., the catalog search for an ecommerce site. These search engines often provide users with the opportunity to express precise queries, e.g., black digital cameras for under $250. Moreover, well-designed sites offer users faceted search interfaces that support progressive query elaboration through guided refinements.

Many (though not all) closed-domain search engines have an advantage over their open-domain counterparts: they can rely on manually curated metadata. The scale and heterogeneity of the open web defies human curation. Perhaps we’ll reach a point when automatic information extraction offers quality competitive with curation, but we’re not there yet. Indeed, the lack of good, automatically generated metadata has been cited as the top challenge facing those who would implement faceted search for the open web.

What can we do in the mean time? Here is a simple idea: use a closed-domain search engine do guide users to precise queries, and then apply the resulting queries to the open web. In other words mash up the closed and open collections.

Of course, this is easier said that done. It is not at all clear if or how we can apply a query like “black digital cameras for under $250” to a collection that is not annotated with the necessary metadata. But we can certainly try. And our ability to perform information retrieval from structured queries will improve over time–in fact, it may even improve more quickly if we can start to assume that users are being guided to precise, unambiguous queries.

Even though result quality would be variable, such an approach would at least eliminate a source of uncertainty in the information seeking process: the user would be certain of having a query that accurately represented his or her information need. That is no small victory!

I fear, however, that users might not respond positively to such an interface. Given the certainty that a query accurately represents his or her information need, a user is likely to have higher expectations of result quality than without that certainty. Retrieval errors are harder to forgive when the query elaboration process eliminates almost any chance of misunderstanding. Even if the results were more accurate, they might not be accurate enough to satisfy user expectations.

As an HCIR evangelist, I am saddened by this prospect. Reducing uncertainty in any part of the information seeking process seems like it should always be a good thing for the user. I’m curious to hear what folks here think of this idea.

Categories
General

Pluralistic Ignorance and Bayesian Truth Serum

Last week, I had the pleasure of talking with CMU professor George Loewenstein, one of the top researchers in the area of behavioral economics. I mentioned my idea of using prediction markets to address the weaknesses of online review systems and reputation systems, and he offered two insightful pointers.

The first pointer was to the notion of pluralistic ignorance. As summarized on Wikipedia:

In social psychology, pluralistic ignorance, a term coined by Daniel Katz and Floyd H. Allport in 1931, describes “a situation where a majority of group members privately reject a norm, but assume (incorrectly) that most others accept it…It is, in Krech and Crutchfield’s (1948, pp. 388–89) words, the situation where ‘no one believes, but everyone thinks that everyone believes'”. This, in turn, provides support for a norm that may be, in fact, disliked by most people.

It had not occurred to me that pluralistic ignorance could wreak havoc on the prediction market approach I proposed. Specifically, there is a risk that, even though the majority participants in the market hold a particular opinion, they suppress their individual opinions and instead vote based on mistaken assumptions about the collective opinion of others. Ironically, these participants are pursuing an optimal strategy, given their pluralistic ignorance. Yet the results of such a market would not necessarily reflect the true collective opinion of participants. Clearly there is a need to incorporate people’s true opinions into the equation, and not just their beliefs about others’ opinions.

Which leads me to the second resource to which Loewenstein pointed me: a paper by fellow behavioral economist and MIT professor Drazen Prelec entitled “A Bayesian Truth Serum for Subjective Data“. As per the abstract:

Subjective judgments, an essential information source for science and policy, are problematic because there are no public criteria for assessing judgmental truthfulness. I present a scoring method for eliciting truthful subjective data in situations where objective truth is unknowable. The method assigns high scores not to the most common answers but to the answers that are more common than collectively predicted, with predictions drawn from the same population. This simple adjustment in the scoring criterion removes all bias in favor of consensus: Truthful answers maximize expected score even for respondents who believe that their answer represents a minority view.

Most of the paper is devoted to proving, subject to a few assumptions, that the optimal strategy for players in this game is to tell what they believe to be the truth–that is, the truth-telling strategy is the optimal Bayesian Nash equilibrium for all players.

The assumptions are as follows:

  1. The sample of respondents is sufficiently large that a single answer cannot appreciably affect the overall results.
  2. Respondents believe that others sharing their opinion will draw the same inferences about population frequencies.
  3. All players assume that other players are responding truthfully–which follows if they are rational players.

Prelec sums up his results as follows:

In the absence of reality checks, it is tempting to grant special status to the prevailing consensus. The benefit of explicit scoring is precisely to counteract informal pressures to agree (or perhaps to “stand out” and disagree). Indeed, the mere existence of a truth-inducing scoring system provides methodological reassurance for social science, showing that subjective data can, if needed, be elicited by means of a process that is neither faith-based (“all answers are equally good”) nor biased against the exceptional view.

Unfortunately, I don’t think that Prelec’s assumptions hold for most online review systems and reputation systems. In typical applications (e.g., product and service reviews on sites like Amazon and Yelp), the input is too sparse to even approximate the first assumption, and the other two assumptions probably ascribe too much rationality to the participants.

Still, Bayesian truth serum is a step in the right direction, and perhaps the approach (or some simple variant of it) applies to a useful subset of real-world prediction scenarios. Certainly it gives me hope that we’ll succeed in the quest to mine “subjective truth” from crowds.

Categories
General

LinkedIn Signal = Exploratory Search for Twitter

I like Twitter. Yes, I know that a lot of its content is noise. But I’ve found Twitter to be a useful professional tool for both publishing and consuming information. Publishing to Twitter is the easy part: I publish links to my blog posts and occasionally engage in public conversations.

Consuming information from Twitter is more of a challenge. I follow 100 people, which is about the limit of my attention budget. I use saved searches to track long-term interests (much as I use web and news alerts), and I perform ad hoc searches when I am interested in finding out what people are saying about a particular topic.

But Twitter search is not a great fit for analysis or exploration–unless you count trending topics as analysis. Originally, the search results were simply the tweets that contained the matching tweets in order of recency. The current system sometimes promotes a few “top tweets” to the top of the results. Still, if you’d like to get a summary view, slice and dice the results, or perform any other sort of HCIR task, you’re out of luck.

Until now.

The LinkedIn Search, Network, and Analytics team–the same folks that built LinkedIn’s faceted search system and developed open-source search tools Zoie and Bobo–just introduced a service called Signal that is squarely aimed at folks like me who use Twitter as a professional tool. It is still in its infancy (in private beta, in fact), but I think it has the potential to dramatically change how people like me use Twitter. You can learn more about its architecture and implementation details here.

Signal joins the often cacophonous Twitter stream to the high-quality structured data that LinkedIn knows about its own users. For example, when I post a tweet, LinkedIn knows that I am in the software industry, work at Google, and live in New York. LinkedIn can only make this connection for people who include Twitter ids in their LinkedIn profiles, but that’s a substantial and growing population.

Signal then lets you use this structured information to satisfy analytic and exploratory information needs. For example, I can see which companies’ employees are tweeting about software patents (top two are Google and Red Hat).

Or compare what Microsoft employees are saying about Android…

…to what Google employees are saying about Android.

As you can see on the right-hand side, Signal also mines shared links to identify popular ones relative to given search–and allows you to see who has shared a particular link. This functionality is similar to Topsy, but with the advantage of allowing structured searches. Like Topsy, it wrangles the mass of retweeted links into a useful and user-friendly summary.

Signal is still very much in beta. An amusing bug that I encountered earlier today was that, due to some legacy issues in how Linkedin standardized institution names, the system decided that I was an alumnus of the Longy School of Music rather than of MIT. Fortunately, that’s fixed now (thanks, John!)–I love karaoke, but I’m not ready to quit my day job!

Also, Signal only exposes a handful of LinkedIn’s facets, which limits the breadth of analysis and exploration. I’d love to see it add a past company facet, making it possible to drill down into what a company’s ex-employees are saying about a particular topic (e.g., their ex-employer).

Finally, while Signal offers Twitter hashtags as a facet, these are hardly a substitute for a topic facet. In order to provide such a facet, LinkedIn needs to implement some kind of concept extraction to provide a useful topic facet (something I’d also love to see for their regular people search). This is a challenging information extraction problem, especially for the open web, but I also know from experience that it is tractable within a domain. Given LinkedIn’s professional focus, I believe this is a problem they can and should tackle.

Of course, Linkedin also needs to convince more of its users to join their LinkedIn accounts to their Twitter accounts–since that is their input source. But I suspect it’s mostly a matter of time and education–and hopefully the buzz around Signal will help raise awareness.

All in all, I see LinkedIn Signal as a great innovation and a big step forward for exploratory search and for Twitter. Congratulations to John Wang, Igor Perisic, and the rest of the LinkedIn search team on the launch!