Data By The People, For The People

I was fortunate this year not only to be able to attend the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) in Maui, but also to be invited as part of this year’s industry event, representing LinkedIn.

Above are the slides I presented on “Data By The People, For The People“. Enjoy!


Content, Connections, and Context

This is keynote presentation I delivered at the Workshop on Recommender Systems and the Social Web, held as part of the 6th ACM International Conference on Recommender Systems (RecSys 2012):

Content, Connections, and Context 

Recommender systems for the social web combine three kinds of signals to relate the subject and object of recommendations: content, connections, and context.

Content comes first – we need to understand what we are recommending and to whom we are recommending it in order to decide whether the recommendation is relevant. Connections supply a social dimension, both as inputs to improve relevance and as social proof to explain the recommendations. Finally, context determines where and when a recommendation is appropriate.

I’ll talk about how we use these three kinds of signals in LinkedIn’s recommender systems, as well as the challenges we see in delivering social recommendations and measuring their relevance.

When I’m back from Dublin, I promise to blog about my impressions and reflections from the conference. In the mean time, I hope you enjoy the slides!


Strata 2012: Big Data is Bigger than Ever!

I spent the last three days at the O’Reilly Strata Conference, an assembly of two thousand over 2500 people focused on data science and its applications. While I’m wary of industry conferences from attending vendor-fests in my past life in the enterprise software world, Strata is an exceptionally good conference. The speakers were a who’s who of data science, including Lucene and Hadoop creator Doug Cutting, search user interface pioneer Marti Hearst, and Google chief economist Hal Varian. You can find the tweet stream for the conference at hash tag #strataconf.


I spent Tuesday in the Deep Data session, billed as a no-holds-barred program for data scientists. My two favorite talks:

  • Claudia Perlich, winner of three KDD cups, talked about using information to pick the right action and to influence people such that they behave in a way that is better for them, better for us, and possibly better for society in general.
  • Monica Rogati, my colleague at LinkedIn and the epitome of a data scientist, delivered a fantastic talk about machine learning models and training data in the real world, extending Peter Norvig‘s point about the “unreasonable effectiveness of data” to observe that more data beats clever algorithms but better data beats more data.

But the most fun that day was the Oxford-style debate featuring Drew ConwayPete Skomoroch, Mike Driscoll, DJ Patil, Amy Heineike, Pete Warden, and Toby Segaran. The question proposed was absurdly Manichean: if you had to hire your first data scientist and could only hire one, would you pick a domain expert or a machine learning expert? After the moderator suppressed some initial attempts to hedge (“both”, “it depends”, etc.), the debaters ripped into the question by taking extreme positions and defending them with gusto. It was a lot of fun, with enthusiastic audience participation and the debaters exploiting their inside knowledge of their opponents’ work histories. In the end, the machine learning side won by a small margin.

I then had the good fortune to grab dinner with Marti Hearst and Hal Varian at Xanh — a wonderful mix of great food and conversation.


The Wednesday morning keynote session offered some gems:

  • Cloudera CEO Mike Olson urged big data practitioners to focus on guns, drugs, and oil.
  • Doctor and data geek Ben Goldacre delivered a mesmerizing and disturbing talk about the suppression of inconvenient medical trial results and analytical tools to discover it.

But the person who stole the show was Google’s Avinash Kaushik, who talked about making love with data to find orgasm-inducing actions to change the world and make more money. Unfortunately this was the one talk that was not recorded, but you can read the summary on Avinash’s Google+ page.

As a speaker, I held “office hours” on Wednesday. It was supposed to be a 40-minute slot for conference attendees to come and ask me question. But somehow those 40 minutes extended into three hours of conversation about everything from normalized KL divergence to interview problems — and segued into a reception with specialty big-data cocktails. By the time I got back to my apartment, my voice, brain, and liver were spent.


I spent most of Thursday morning in the speaker lounge, recovering from the previous evening and making last touches on my presentation. But I couldn’t resist attending a two-part session on privacy. Indeed, this session was distinctive enough to merits it’s own hash tag: #strataprivacy.

The first part featured O’Reilly’s Alex Howard moderating Intelius Chief Privacy Officer Jim Adler and NYU PhD student Solon Barocas on a panel provocatively titled  “If Data Wants to Be Free, is Privacy a Prison?” It was a great discussion, and I enjoyed the opportunity to offer my own provocative question through Twitter. Since the panelists were arguing that it was unethical to infer private facts from public data, I asked if they were trying to establish a new form of thoughtcrime.

The second panel, entitled “Pretty Simple Data Privacy“, featured Kaitlin Thaney from Digital Science, Betsy Masiello from Google, and John Wilbanks from the Kauffman Foundation for Entrepreneurship. Given that today was the first day of Google’s new privacy policy, there was no avoiding focus on the associated controversy. I did try to get Betsy to address my charge that Google doesn’t think users own their search history (cf. “Google vs. Bing: A Tweetle Beetle Battle Muddle“), but she said she was unfamiliar with the details of that event. I do wish that someone at Google with more familiarity would respond publicly.

Back to the speaker room after lunch, until my own talk with Samasource’s Claire Hunsaker on “Humans, Machines, and the Dimensions of Microwork“. I’ll post the slides (and there will be a video on the conference site), but the sound bite is that you need to keep crowdsourcing tasks simple, manage the trade-off between task value and difficulty, and watch out for systematic bias.

I wrapped up the conference by hearing William Gunn talk about how Mendeley is disrupting bibliometrics and perhaps the entire academic publishing and reputation ecosystem. I laud his ambition and wish him and Mendeley luck in this quest.

In summary, three days of great talks, conversations, and general enjoyment. My thanks to Strata organizers Edd Dumbill and Alistair Croll for putting together such an outstanding event and for giving me the opportunity to participate.


Identifying Influencers on Twitter

One of the perks of working at LinkedIn is being surrounded by intellectually curious colleagues. I recently joined a reading group and signed up to lead our discussion of a WSDM 2011 paper on “Identifying ‘Influencers’ on Twitter” by Eytan Bakshy, Jake Hofman, Winter Mason, and Duncan Watts. It’s great to see the folks at Yahoo! Research doing cutting-edge work in this space.

I thought I’d prepare for the discussion by sharing my thoughts here. Perhaps some of you will even be kind enough to add your own ideas, which I promise to share with the reading group.

I encourage you to read the paper, but here’s a summary of its results:

  • A user’s influence on Twitter is the extent to which that user can cause diffusion a posted URL, as measured by reposts propagated through follower edges in Twitter’s directed social graph.
  • The best predictors of future total influence are follower count and past local influence, where local influence refers to the average number of reposts by that user’s immediate followers, and total influence refers to average total cascade size.
  • The content features of individual posts do not have identifiable predictive value.
  • Barring a high per-influencer acquisition cost, the most cost-effective strategy for buying influence is to target users of average influence.

Let’s dive in a bit deeper.

The definitions of influence and influencers are, by the authors’ own admission, narrow and arbitrary. There are many ways one could define influence, even within the context of Twitter use. But I agree with the authors that these definitions have enough verisimilitude to be useful, and their simplicity facilitates quantitative analysis.

It’s hardly surprising that past influence is a strong predictor of future influence. But it might seem counterintuitive that, for predicting future total influence, past local influence is more informative than past total influence. The authors suggest the explanation that most non-trivial cascades are of depth 1 — i.e., total influence is mostly local influence. But at most that would make the two features equally informative, and total influence should still be a mildly better predictor.

I suspect that another factor is in play — namely, that the difference between local influence and total influence reflects the unpredictable and rare virality of the content (e.g., a random Facebook Question generated 4M votes). If this hypothesis is correct, then past local influence factors out this unpredictable factor and is thus a better predictor of both future local influence and future total influence.

I’m a bit surprised that follower count supplies additional informative value beyond the past local influence; after all, local influence should already reflect the extent to which the followers are being influenced. It’s possible that past influence lags the follower count, since it does not sufficiently weigh the potential contributions of more recent followers. But another possibility is one analogous to the predictive value of past local vs. global influence: past local influence may include an unpredictable content factor which follower count factors out.

Of course, I can’t help suggesting that TunkRank might be a more useful indicator than follower count. Unfortunately the authors don’t seem to be aware of the TunkRank work — or perhaps they preferred to restrict their attention to basic features.

I’m not surprised by the inability to exploit content features to predict influence. If it were easy to generate viral content, everyone would do it. Granted, a deeper analysis might squeeze out a few features (like those suggested in the Buddy Media report), but I don’t think there are any silver bullets here.

Finally, the authors consider the question of designing a cost-effective strategy to buy influence. The authors assume that the cost of buying influence can be modeled in terms of two parameters: a per-influencer acquisition cost (which is the same for each influencer) and a per-follower cost for each influencer. They conclude that, until the acquisition cost is extremely high (i.e., over 10,000 times the per-follower cost), the most cost-efficient influencers are those of average influence. In other words, there’s no reason to target the small number of highly influential users.

The authors may be arriving at the right conclusion (Watts’s earlier work with Peter Dodds, which the paper cites, questions the “influentials” hypothesis), but I’m not convinced by their economic model of an influence market. It may be the case that professional influencers are trying to peddle their followers’ attention on a per-follower basis — there are sites that offer this model.

But why should anyone believe that an influencer’s value is proportional to his or her number of followers? The authors’ own work suggests that past local influence is a more valuable predictor than follower count, and again they might want to look at TunkRank.

Regardless, I’m not surprised that a fixed per-follower cost makes users with high follower counts less cost-effective, as I subscribe to its corollary: as a user’s follower count goes up, the per-follower value diminishes. I haven’t done the analysis, but I believe that the ratio of a user’s TunkRank to the user’s follower count tends to go down as a user’s follower count goes up. A more interesting research (and practical) question would be to establish a correctly calibrated model of influencer value and then explore portfolio strategies.

In any case, it’s an interesting paper, and I look forward to discussing it with my colleagues next week. Of course, I’m happy to discuss it here in the meantime. If you’re in my reading group, feel free to chime in. And you’re not in you’re not in my reading group, consider joining. We do have openings. 🙂


Social Networking: Theory and Practice

I’ve been a student of social network theory for years, enjoying the work of Duncan Watts, Albert-László Barabási, Jon Kleinberg, and a number of other researchers investigating this field. It should be no surprise that a topic that is so core to our humanity has attracted attention from some of our best and brightest.

And I’ve dabbled a bit on the theoretical side myself. The TunkRank measure (I’m indebted to Jason Adams for his implementing it on a live site!) attempts to take the most basic assumption about our social behavior–the constraint that we have a finite attention budget–and explore its implications for influence over social networks. I have a few unexplored hypotheses queued up for when I can find the spare time to try validate them empirically!

But why settle for theory? We live in an age where social networks compete with web search (and perhaps complement search) as the hottest online technologies. If we’re not reading about Google vs. Bing, we’re reading about Facebook vs. Twitter, with LinkedIn offering a third way that seems to co-exist with its more storied peers. In this post, I’d like to focus on LinkedIn.

LinkedIn, despite its feature creep, is still fairly old-school: its raison d’être is for users to build, maintain, and exploit their professional networks. In theory, connections on LinkedIn represent present or past working relationships that become the basis for referrals–whether the goal is employment, sales, or partnership. LinkedIn is not the only professionally oriented social network, but at this point it’s certainly the dominant one.

But I’ve found at least two additional ways to use LinkedIn that I’d like to share:

Intelligence gathering. For reasons I don’t yet claim to understand, people share far more information about themselves–and in a much cleaner, structured form–on LinkedIn than in perhaps any other online medium. Most people’s resumes are not available online, but their LinkedIn profiles are tantamount to resumes. Moreover, their structured format makes it possible for LinkedIn to assemble aggregate profiles of companies, revealing composite pictures that must drive some of those companies’ legal and HR departments batty! At a higher level, LinkedIn also works well as a discovery tool–much more so now they’ve enabled faceted search. It’s still a bit tricky to explore people and companies by topic, but far more effective using LinkedIn than using any other tool I’m aware of.

Meeting new people. Cold-calling, spamming–pick your poison. In short, LinkedIn doesn’t have to only be about connecting with people you already know. But there’s an art to sending unsolicited messages: you have to pass the moral equivalent of a CAPTCHA by proving that your communication strategy isn’t indiscriminate. Let me use a personal example (that Maisha Walker was nice enough to write up in her Inc. magazine column). I decided that I wanted to find everyone on LinkedIn who might be interested in HCIR ’09. So I searched for everyone whose profiles indicated interests in both IR and HCI and sent out a targeted message (in fact, a invite with personalized message–a feature I recently feared they’d killed). The results were overwhelmingly positive. I’m not sure how many of the people I contacted will attend, but I raised awareness without inflicting annoyance. Better yet, one of the people I contacted then discovered I was looking for volunteers to review the draft of my book–and I thus obtained hours of help of someone who, just a day before, had never heard of me!

What intrigues me about LinkedIn (and other social networks) is the extent to which I am exploiting attention market inefficiencies (as LinkedIn may be doing as well). For example, LinkedIn makes it easy to send unsolicited invitations to anyone. Granted, you can lose this privilege by even having a couple of people respond to invitations with “I don’t know this person”. There’s also the question of why people’s social norms around disclosure are so different on LinkedIn than anywhere else–people not only post the content of their resumes, but go through the effort of providing it to LinkedIn in a structured form! Meanwhile, LinkedIn keeps tightfisted control over the information it aggregates–understandably, they recognize that this content is their most valuable asset.

People are still getting used to the idea of social networks. It will be interesting to see how their use evolves, particularly in term of information and attention market efficiency.


JCDL 2009

For the benefit of those of us not lucky enough to be attending this year’s Joint Conference on Digital Libraries (JCDL 2009), a number of attendees are live-tweeting the conference using the hashtag #jcdl2009. I’m sure there will be blog posts (like these), and I’ll try to round up what I can when the conference wraps up. I also understand that papers will eventually be available in the ACM Digital Library, and that authors are being encouraged to post their own papers on their web sites–if / when that happens, I’ll try to assemble a list here, at least of the ones that particularly catch my attention.


Guy Kawasaki, I’ll Say It

I just saw this post from a week ago by Andrew Goodman on Traffick asking “Is Guy Kawasaki Singlehandedly Ruining Twitter?“. Some context: Guy Kawasaki gave a keynote at the New York Search Engine Strategies conference last week in which he discussed the tactics he uses to “use Twitter as a twool“.

Of course, what galls me, at least if Goodman is reporting his speech accurately, is this:

he castigates people who don’t follow everyone back because they’re arrogant. By not “reciprocating,” non-followers are showing they “don’t care about their followers.”

Well, Kawasaki follows over 100,000 users, so he practices what he preaches. But, as Goodman points out:

The thing about Kawasaki’s follow-back habit is: it’s fake reciprocity. He isn’t actually following. Following everyone back is like the old idea of exchanging links with everyone and anyone, in the hopes of gaming Google. You don’t actually have any hope of really following 100,000 people, so instead, you hide behind TweetDeck and other apps. As Kawasaki points out, he does read all @replies and Direct Messages. But don’t believe that the “purpose of following everyone back is so people can direct message me.” The purpose is to get people used to the idea that a follow should be reciprocated with a follow. That way, folks who go out and follow 200,000 people have a greater chance of being followed by, say, 160,000.

Can you say “attention Ponzi scheme“? I sure can. I may have criticized A-list blogger Loic Le Meur in the past for suggesting that follower count implies authority, but at least he doesn’t play this fake reciprocity game–the 500 people he follows may a bit more than Dunbar recommends, but are at least within the bounds of plausbility.

According to Goodman, Kawasaki kept trying to ingratiate himself by saying “well someone out there is going to say I’m a dick for saying this, but…”. Well, Guy, I’ll be the blowhard and say it, you’re being a dick. Every Ponzi scheme has its winners, and you’ve clearly cashed in on this one. I don’t begrudge you the attention you’ve accumulated. But please have the decency not to give advice that, as Goodman puts it, would turn Twitter into a “digital trailer park”.


It’s OK To Tweet

The other day, Owen Thomas at Valleywag smirked about the audience at Times Open that “sat and Twittered instead of listening to the speaker”. To which I say, take a look at our tweets and you’ll see that people were listening intently.

I’m glad that Congress isn’t reading Valleywag: CNN reports that members of Congress twittered through Obama’s big speech:

Members of Congress twittered their way through President Obama’s nationally televised speech Tuesday night, providing a first-of-its-kind running commentary that took users of the social networking site inside the packed House chamber.

I hope this mainstream use of Twitter inspires audiences to play a more active role not only as listeners but also contributors to the conversations that good speeches are designed to inspire.

Of course, there remains the question of establishing social norms for live audiences who are torn between looking at the speaker and typing. Ironically, I remember being yelled at in class for *not* taking notes! Perhaps the people who most need coaching at the speakers who have to face live-tweeting audiences. Here’s some advice on the subject from speaking expert Olivia Mitchell.


The Noisy Community

One of the great things about blogging is that it’s given me the chance to assemble a community of people who interests overlap with my own. But so far that community only becomes interactive through the comment threads and, to a lesser extent, Twitter.

I’ve wondered if it would be worthwhile to invest in cultivating more sense of community at The Noisy Channel.  I have no desire to replicate social network  functionality available elsewhere, and I trust that most readers use some combination on LinkedIn, Facebook, and Twitter.

But I could do some things here that might be helpful:

  • Create an opt-in directory page that lists readers with a short tagline and a contact URL. For example, mine would be:

    Daniel Tunkelang
    Chief Scientist, Endeca

  • Post job descriptions targeted at readers, along with contact information.
  • Institute a regular appearance of guest blog posts.

Do any of the above appeal to people? The directory strikes me as the simplest first step. I’d love to find out more about who my readers are, and hopefully I can make it worth your while by directing some traffic in your direction. Conversely, a list of Noisy Channel readers with short descriptions sounds like just what the SEO doctor ordered. Of course, it would only take off even enough people are interested in being on this page.

I’m intrigued by the possiblity of posting targeted job descriptions, but that’s only worthwhile if there are enough of them. I have no desire to compete with the large job sites!

Finally, guest posts are something I’ve talked about before, but that somehow have never taken. But a directory might make it easier for me to know whom to ask.

If any of the above interest you, please give me feedback in the comments. I’ll take silence as a lack of interest.


Google Tech Talk: Reconsidering Relevance

I’m still waiting for Google to post a video of the talk to YouTube (the wait is over!), but in the meantime I’ve posted the slides to Scribd and SlideShare. I’ve included speaker notes designed to make the talk completely self-contained.

I’d like to add that my hosts at Google NYC were very gracious, particularly considering that my material was more than a little critical of their approach to search and information retrieval.

Here is the abstract again as a reminder:

We’ve become complacent about relevance. The overwhelming success of web search engines has lulled even information retrieval (IR) researchers to expect only incremental improvements in relevance in the near future. And beyond web search, there are still broad search problems where relevance still feels hopelessly like the pre-Google web.

But even some of the most basic IR questions about relevance are unresolved. We take for granted the very idea that a computer can determine which documents are relevant to a person’s needs. And we still rely on two-word queries (on average) to communicate a user’s information need. But this approach is a contrivance; in reality, we need to think of information-seeking as a problem of optimizing the communication between people and machines.

We can do better. In fact, there are a variety of ongoing efforts to do so, often under the banners of “interactive information retrieval”, “exploratory search”, and “human computer information retrieval”. In this talk, I’ll discuss these initiatives and how they are helping to move “relevance” beyond today’s outdated assumptions.