The Noisy Channel


Semantic Link and Internet Evolution

April 19th, 2012 by Daniel Tunkelang


Recently I had a couple of great opportunities to share my thoughts publicly, and I wanted to make sure readers here were aware of them.

The first was a special guest appearance on The Semantic Link, a program hosted by Paul Miller with regular panelists Peter Brown, Christine Connors, Eric Franzon, Eric Hoffer, Bernadette Hyland, and Andraz Tori. It was a lot of fun, and a great warm-up for the keynote I’ll be delivering on “Scale, Structure, and Semantics” at the upcoming Semantic Tech & Business Conference (SemTechBiz), which will take place in San Francisco in June.

The second was a live interview on Internet Evolution, hosted by Mary Jander and Nicole Ferraro. They clearly did their homework, scouring my blog posts and web commentary for everything controversial I’d ever said — and then some! If that’s enough to pique your interest, then I encourage you to listen to the recorded interview and read the chat transcript at Internet Evolution.

Happy to answer questions based on either of these sessions — comment away!


Noah Iliinsky: Tech Talk on Designing Data Visualizations

April 18th, 2012 by Daniel Tunkelang

Note: This post was written by Yael Garten, a Senior Data Scientist at LinkedIn. Yael joined Linkedin in 2011, where she leads our mobile analytics team. She previously worked at Stanford on text mining, personalized medicine, and biomedical informatics.

We live in an era of Big Data. But how do we use all of that data to answer questions and communicate those answers effectively?

My colleagues and I at LinkedIn were fortunate enough to hear answers from Noah Iliinsky, who literally wrote the book on designing data visualization.

Earlier this month, we hosted Noah at LinkedIn to give a tech talk on “Designing Effective Data Visualizations“. We are proud to make these tech talks open to the public, and enjoyed a great mix of attendees from local companies and universities. If you couldn’t attend the talk in person or remotely, I encourage you to watch the recording, embedded above.

Why do we visualize data? As Noah tells us, visualization makes data accessible. It gives us faster access to actionable insights and allows access to huge amounts of data. Visualization enables both data exploration (when you are still trying to discover the story) and data explanation (when you have a story to tell). Noah reviewed some great examples (watch the talk!), with an emphasis on the dos and don’ts of data visualization.

In particular, he provided a step-by-step framework for traversing the path from question to answer:

Phase 1: Decide what to visualize.

  • Understand the question your audience wants to answer.
  • Understand the actions they are hoping the answer will drive.
  • Consider who is consuming this data — their needs, biases, etc.
  • Decide what data to use — and what data not to use — and what relationships you are interested in.
  • Explore the data and construct a storyline.

Phase 2: Decide how to visualize it.

  • Use appropriate visual encodings for data and relationships (cf.
  • Limit the data you include.
  • Use position for your most important relationship.
  • Try different axes.
  • Show your visualization to different people, without explanations. Show an expert, show a layman.
  • Iterate, iterate, iterate!

Noah also shared his thoughts on how to visualize social networks. He recommended useful tools for data visualization, including Tableau, Spotfire, D3, Processing, ggplot2, Omnigraffle, and OmnigraphSketcher.

Finally, he left us with key lessons to take home:
  • You are not your audience. This is a huge lesson that all of us must internalize to be great at what we do. Consider what you need to communicate to marketers, investors, member of the general public, etc.
  • Do user research! Understand your users’ hopes, dreams, and favorite flavors! Understand their identity, their jargon, culture, etc.
  • Remember that your success is defined by your customer’s success. If you can’t satisfy your customer’s needs, you have failed — no matter how insightful your analysis.

You can enjoy the talk by watching the embedded video above. And you can find more LinkedIn tech talks on our YouTube channel.

Comments Off on Noah Iliinsky: Tech Talk on Designing Data Visualizations

Data, Algorithms, and People

April 14th, 2012 by Daniel Tunkelang

One of the highlights of the recent Data 2.0 Summit was a panel featuring:

The focus of the panel was supposed to be about “Data Science and Predicting the Future”, but the most contentious topic was whether data, algorithms or people (that is, the data scientists themselves) were the most important factor in the practice and success of data science.

Yes, we one-upped the debate that my colleague Monica Rogati instigated at this year’s Strata conference. In fact, Josh cited the “better data beats more data beats clever algorithms” argument that Monica made in her own Strata presentation. And, just like at Strata, there was a healthy dose of audience participation.

Of course, I came down on the side of data — which I believe won the debate hands down.

I’m a fan of clever algorithms, which Alexander had to defend given that Skytree’s core value proposition is better machine learning algorithms delivered at scale. But I’m with Peter Norvig et al. on the dominance of data over algorithms.

Favoring data over people was a harder choice. Anthony naturally made the case for people (Kaggle’s claim to fame is assembling many of the world’s best data scientists by organizing competitions). Hopefully my team won’t quit en masse when they read this blog post! But I think they’ll agree with me that, without the incredible data we work with at LinkedIn, they’d be unable to deliver the awesomeness that I’ve come to expect from them.

There’s a saying that we all cook from the same cookbooks, so that it’s the ingredients that make all the difference. To take the metaphor further, you can also try to poach your rival’s chefs. But data is the biggest entry barrier — and the most sustainable competitive advantage.

Of course, we should have the best people apply the best algorithms to work with the best data. But data comes first. The best meal starts with the best ingredients.


Video of Strata 2012 Talk on Humans, Machines, and the Dimensions of Microwork

March 31st, 2012 by Daniel Tunkelang


The video of the presentation that Claire Hunsaker and I delivered on “Humans, Machines, and the Dimensions of Microwork” at Strata 2012 is now available as part of the complete video compilation. I’ve taken the liberty to upload it to YouTube — feel free to watch the embedded video above.

Comments Off on Video of Strata 2012 Talk on Humans, Machines, and the Dimensions of Microwork

Data 2.0 Summit

March 30th, 2012 by Daniel Tunkelang

I’ll be participating in the Data 2.0 Summit on Tuesday, April 3rd, and I hope to see some of you there. Last year, my colleague (and fellow LinkedIn data scientist) Scott Nicholson attended and wrote this guest post about it. This year, I’m not only attending but participating on a panel about social data, moderated by AllthingsD Senior Editor Liz Gannes.

There’s a great line-up of speakers for the day, including:

  • Bram Cohen, the founder, chief scientist, and inventor of BitTorrent, the leading peer-to-peer file sharing protocol for sharing large files on the Internet.
  • Michael Driscoll, CTO and co-founder of the Metamarkets Group. He moderated a fantastic debate at the recent Strata conference about the relative importance of domain expertise or machine learning for data scientists.
  • Gil Elbaz, the founder and CEO of Factual, an information marketplace. He is also the co-founder of Applied Semantics, which Google acquired in 2003 for $102M and turned into the foundation for AdSense (now a $10B business).
  • Anthony Goldbloom, co-founder and CEO of Kaggle,  a platform for data science competitions that generated a lot of discussion at Strata.
  • Stefan Weitz, director of search at Bing. He’ll be on my panel. Also see the discussion I had with him in the comment thread for a post on “Why Are People So Clueless About Search?“.

And lots more, but you get the idea. I’m thrilled to be part of such a talent-heavy program and looking forward to insightful discussions with with fellow panelists and attendees. Also a great excuse to spend a day in the city (note for my former townspeople — that’s what they call San Francisco around here).

1 Comment

Claudia Perlich: Tech Talk on Real-Time Bidding Optimization

March 22nd, 2012 by Daniel Tunkelang

Conventional wisdom holds that physical compliments are counter-productive as pick-up lines. Indeed, a dating site did some analysis showing a negative correlation between such compliments and the probability of a positive response.

But, as m6d Chief Scientist and 3-time KDD Cup winner Claudia Perlich explained in her recent talk at LinkedIn, we have to watch out for confounding variables. In the dating scenario above, beauty is a confounding variable: it determines both the probability of getting a positive response and of the probability of a suitor offering physical compliments. Hence, we need to control for the actual beauty or it can appear that making compliments is a bad idea.

Perlich does not work on online dating, but rather in the data-driven world of online advertising. Specifically, she and her team work on real-time bidding optimization.

Perlich described a variety of design choices that have general applicability to data science problems. For example, her team used hashed tokens of previously visited URLs, rather than the URLs themselves, as features for their machine learning models. They avoided the use of personally identifying information (PII) or even demographic information about their users. These decisions were counterintuitive — typically, more data leads to better results. But Perlich found that these restrictions did not sacrifice accuracy, and had the further benefit of keeping their approach general rather than application- or customer-specific.

Perlich also described several technical challenges that her team had to overcome. For example, they found they could not sample users, so they instead sampled events — that is, visits, impressions, and conversions. They also found that their linear models tended to suffer from overfitting in their top predictions — a problem they resolved by introducing a spline model.

The talk was deeply technical and yet very relevant and accessible to a broad audience of data scientists and engineers. There’s much more content than fits in this small summary, so I encourage you to watch the video! And you can watch more LinkedIn tech talks here.


Facing Prosopagnosia

March 18th, 2012 by Daniel Tunkelang

In the past few years, prosopagnosia, also known as “face blindness”, has received a fair amount of attention from researchers, as well as from the popular press.

My first exposure to the topic was Joshua Davis’s article entitled “Face Blind“, which appeared in Wired in November 2006. I was intrigued, especially since I’d long recognized that I had difficulty recognizing people by face. Perhaps the person who has done most to raise awareness of prosopagnosia is neurologist Oliver Sacks, who has prosopagnosia himself.

The Wired article inspired me to explore the subject. I discovered and found quizzes that tested for prosopagnosia. On one of these, where random guessing would have earned a score of 50%, I scored in the low 60s. My initial reaction was that my score wasn’t so bad — it was a hard test! Then my wife took the test and scored in the high 90s. That’s when I realized that I didn’t just have difficulty recognizing faces — I was almost incapable of it.

Faced with this realization, I had to decide whether to share it with my friends and family, let alone with my broader set of social and professional acquaintances. It was tempting not to — after all, why tell the world that I wasn’t “normal”?

But eventually I realized that it would be better for people around me to know than not know. The biggest downside to prosopagnosia isn’t the momentary embarrassment of not recognizing someone — it’s the content fear of offending people who may think you don’t value them enough to recognize or acknowledge them.

Hence, I spread the word through my colleagues, ensuring that most of the people with whom I interacted regularly would find out without any big announcements. Some of my co-workers were surprised, since I do a pretty good job of recognizing people using non-facial clues — height, hair, clothing, where I run into them, etc. I have a great memory, and I have no problems with voice recognition. In other words, I have lots of work-arounds.

Fortunately, I work with a lot of people who understand machine learning — which is a great framework for understanding how I recognize people. I simply work with a different set of features than most people, but fortunately I achieve sufficient precision and recall to pass as “normal” most of the time.

Anyway, if you didn’t already know that I had prosopagnosia, welcome to the inner circle! And if you ever felt that I walked by you without recognizing or acknowledging you, please accept my belated apology.

Finally, if you’re curious to learn more about prosopagnosia, I encourage you to watch this 60 Minutes show about it.


Making Love with Data: Avinash Kaushik’s Strata 2012 Keynote

March 7th, 2012 by Daniel Tunkelang

Just watch the presentation, which stole the show at Strata 2012. The written word cannot do justice to Avinash’s passion and his extraordinary ability to communicate it.

Comments Off on Making Love with Data: Avinash Kaushik’s Strata 2012 Keynote

Humans, Machines & the Dimensions of Microwork

March 4th, 2012 by Daniel Tunkelang

As per my previous post, I had a great time at the O’Reilly Strata Conference. It was a delight to participate in such a fantastic gathering of folks who work with big data. For those who missed my session, I’ve attached the slides that Claire and I presented. Some of the slides don’t make sense without the voice-over, but hopefully there is enough self-contained content in them to be useful.

The presentation was recorded and will be available as part of the Strata 2012 Video Compilation.


Strata 2012: Big Data is Bigger than Ever!

March 2nd, 2012 by Daniel Tunkelang

I spent the last three days at the O’Reilly Strata Conference, an assembly of two thousand over 2500 people focused on data science and its applications. While I’m wary of industry conferences from attending vendor-fests in my past life in the enterprise software world, Strata is an exceptionally good conference. The speakers were a who’s who of data science, including Lucene and Hadoop creator Doug Cutting, search user interface pioneer Marti Hearst, and Google chief economist Hal Varian. You can find the tweet stream for the conference at hash tag #strataconf.


I spent Tuesday in the Deep Data session, billed as a no-holds-barred program for data scientists. My two favorite talks:

  • Claudia Perlich, winner of three KDD cups, talked about using information to pick the right action and to influence people such that they behave in a way that is better for them, better for us, and possibly better for society in general.
  • Monica Rogati, my colleague at LinkedIn and the epitome of a data scientist, delivered a fantastic talk about machine learning models and training data in the real world, extending Peter Norvig‘s point about the “unreasonable effectiveness of data” to observe that more data beats clever algorithms but better data beats more data.

But the most fun that day was the Oxford-style debate featuring Drew ConwayPete Skomoroch, Mike Driscoll, DJ Patil, Amy Heineike, Pete Warden, and Toby Segaran. The question proposed was absurdly Manichean: if you had to hire your first data scientist and could only hire one, would you pick a domain expert or a machine learning expert? After the moderator suppressed some initial attempts to hedge (“both”, “it depends”, etc.), the debaters ripped into the question by taking extreme positions and defending them with gusto. It was a lot of fun, with enthusiastic audience participation and the debaters exploiting their inside knowledge of their opponents’ work histories. In the end, the machine learning side won by a small margin.

I then had the good fortune to grab dinner with Marti Hearst and Hal Varian at Xanh — a wonderful mix of great food and conversation.


The Wednesday morning keynote session offered some gems:

  • Cloudera CEO Mike Olson urged big data practitioners to focus on guns, drugs, and oil.
  • Doctor and data geek Ben Goldacre delivered a mesmerizing and disturbing talk about the suppression of inconvenient medical trial results and analytical tools to discover it.

But the person who stole the show was Google’s Avinash Kaushik, who talked about making love with data to find orgasm-inducing actions to change the world and make more money. Unfortunately this was the one talk that was not recorded, but you can read the summary on Avinash’s Google+ page.

As a speaker, I held “office hours” on Wednesday. It was supposed to be a 40-minute slot for conference attendees to come and ask me question. But somehow those 40 minutes extended into three hours of conversation about everything from normalized KL divergence to interview problems — and segued into a reception with specialty big-data cocktails. By the time I got back to my apartment, my voice, brain, and liver were spent.


I spent most of Thursday morning in the speaker lounge, recovering from the previous evening and making last touches on my presentation. But I couldn’t resist attending a two-part session on privacy. Indeed, this session was distinctive enough to merits it’s own hash tag: #strataprivacy.

The first part featured O’Reilly’s Alex Howard moderating Intelius Chief Privacy Officer Jim Adler and NYU PhD student Solon Barocas on a panel provocatively titled  “If Data Wants to Be Free, is Privacy a Prison?” It was a great discussion, and I enjoyed the opportunity to offer my own provocative question through Twitter. Since the panelists were arguing that it was unethical to infer private facts from public data, I asked if they were trying to establish a new form of thoughtcrime.

The second panel, entitled “Pretty Simple Data Privacy“, featured Kaitlin Thaney from Digital Science, Betsy Masiello from Google, and John Wilbanks from the Kauffman Foundation for Entrepreneurship. Given that today was the first day of Google’s new privacy policy, there was no avoiding focus on the associated controversy. I did try to get Betsy to address my charge that Google doesn’t think users own their search history (cf. “Google vs. Bing: A Tweetle Beetle Battle Muddle“), but she said she was unfamiliar with the details of that event. I do wish that someone at Google with more familiarity would respond publicly.

Back to the speaker room after lunch, until my own talk with Samasource’s Claire Hunsaker on “Humans, Machines, and the Dimensions of Microwork“. I’ll post the slides (and there will be a video on the conference site), but the sound bite is that you need to keep crowdsourcing tasks simple, manage the trade-off between task value and difficulty, and watch out for systematic bias.

I wrapped up the conference by hearing William Gunn talk about how Mendeley is disrupting bibliometrics and perhaps the entire academic publishing and reputation ecosystem. I laud his ambition and wish him and Mendeley luck in this quest.


In summary, three days of great talks, conversations, and general enjoyment. My thanks to Strata organizers Edd Dumbill and Alistair Croll for putting together such an outstanding event and for giving me the opportunity to participate.



Clicky Web Analytics