Claudia Perlich: Tech Talk on Real-Time Bidding Optimization

Conventional wisdom holds that physical compliments are counter-productive as pick-up lines. Indeed, a dating site did some analysis showing a negative correlation between such compliments and the probability of a positive response.

But, as m6d Chief Scientist and 3-time KDD Cup winner Claudia Perlich explained in her recent talk at LinkedIn, we have to watch out for confounding variables. In the dating scenario above, beauty is a confounding variable: it determines both the probability of getting a positive response and of the probability of a suitor offering physical compliments. Hence, we need to control for the actual beauty or it can appear that making compliments is a bad idea.

Perlich does not work on online dating, but rather in the data-driven world of online advertising. Specifically, she and her team work on real-time bidding optimization.

Perlich described a variety of design choices that have general applicability to data science problems. For example, her team used hashed tokens of previously visited URLs, rather than the URLs themselves, as features for their machine learning models. They avoided the use of personally identifying information (PII) or even demographic information about their users. These decisions were counterintuitive — typically, more data leads to better results. But Perlich found that these restrictions did not sacrifice accuracy, and had the further benefit of keeping their approach general rather than application- or customer-specific.

Perlich also described several technical challenges that her team had to overcome. For example, they found they could not sample users, so they instead sampled events — that is, visits, impressions, and conversions. They also found that their linear models tended to suffer from overfitting in their top predictions — a problem they resolved by introducing a spline model.

The talk was deeply technical and yet very relevant and accessible to a broad audience of data scientists and engineers. There’s much more content than fits in this small summary, so I encourage you to watch the video! And you can watch more LinkedIn tech talks here.


Facing Prosopagnosia

In the past few years, prosopagnosia, also known as “face blindness”, has received a fair amount of attention from researchers, as well as from the popular press.

My first exposure to the topic was Joshua Davis’s article entitled “Face Blind“, which appeared in Wired in November 2006. I was intrigued, especially since I’d long recognized that I had difficulty recognizing people by face. Perhaps the person who has done most to raise awareness of prosopagnosia is neurologist Oliver Sacks, who has prosopagnosia himself.

The Wired article inspired me to explore the subject. I discovered and found quizzes that tested for prosopagnosia. On one of these, where random guessing would have earned a score of 50%, I scored in the low 60s. My initial reaction was that my score wasn’t so bad — it was a hard test! Then my wife took the test and scored in the high 90s. That’s when I realized that I didn’t just have difficulty recognizing faces — I was almost incapable of it.

Faced with this realization, I had to decide whether to share it with my friends and family, let alone with my broader set of social and professional acquaintances. It was tempting not to — after all, why tell the world that I wasn’t “normal”?

But eventually I realized that it would be better for people around me to know than not know. The biggest downside to prosopagnosia isn’t the momentary embarrassment of not recognizing someone — it’s the content fear of offending people who may think you don’t value them enough to recognize or acknowledge them.

Hence, I spread the word through my colleagues, ensuring that most of the people with whom I interacted regularly would find out without any big announcements. Some of my co-workers were surprised, since I do a pretty good job of recognizing people using non-facial clues — height, hair, clothing, where I run into them, etc. I have a great memory, and I have no problems with voice recognition. In other words, I have lots of work-arounds.

Fortunately, I work with a lot of people who understand machine learning — which is a great framework for understanding how I recognize people. I simply work with a different set of features than most people, but fortunately I achieve sufficient precision and recall to pass as “normal” most of the time.

Anyway, if you didn’t already know that I had prosopagnosia, welcome to the inner circle! And if you ever felt that I walked by you without recognizing or acknowledging you, please accept my belated apology.

Finally, if you’re curious to learn more about prosopagnosia, I encourage you to watch this 60 Minutes show about it.


Making Love with Data: Avinash Kaushik’s Strata 2012 Keynote

Just watch the presentation, which stole the show at Strata 2012. The written word cannot do justice to Avinash’s passion and his extraordinary ability to communicate it.


Humans, Machines & the Dimensions of Microwork

As per my previous post, I had a great time at the O’Reilly Strata Conference. It was a delight to participate in such a fantastic gathering of folks who work with big data. For those who missed my session, I’ve attached the slides that Claire and I presented. Some of the slides don’t make sense without the voice-over, but hopefully there is enough self-contained content in them to be useful.

The presentation was recorded and will be available as part of the Strata 2012 Video Compilation.


Strata 2012: Big Data is Bigger than Ever!

I spent the last three days at the O’Reilly Strata Conference, an assembly of two thousand over 2500 people focused on data science and its applications. While I’m wary of industry conferences from attending vendor-fests in my past life in the enterprise software world, Strata is an exceptionally good conference. The speakers were a who’s who of data science, including Lucene and Hadoop creator Doug Cutting, search user interface pioneer Marti Hearst, and Google chief economist Hal Varian. You can find the tweet stream for the conference at hash tag #strataconf.


I spent Tuesday in the Deep Data session, billed as a no-holds-barred program for data scientists. My two favorite talks:

  • Claudia Perlich, winner of three KDD cups, talked about using information to pick the right action and to influence people such that they behave in a way that is better for them, better for us, and possibly better for society in general.
  • Monica Rogati, my colleague at LinkedIn and the epitome of a data scientist, delivered a fantastic talk about machine learning models and training data in the real world, extending Peter Norvig‘s point about the “unreasonable effectiveness of data” to observe that more data beats clever algorithms but better data beats more data.

But the most fun that day was the Oxford-style debate featuring Drew ConwayPete Skomoroch, Mike Driscoll, DJ Patil, Amy Heineike, Pete Warden, and Toby Segaran. The question proposed was absurdly Manichean: if you had to hire your first data scientist and could only hire one, would you pick a domain expert or a machine learning expert? After the moderator suppressed some initial attempts to hedge (“both”, “it depends”, etc.), the debaters ripped into the question by taking extreme positions and defending them with gusto. It was a lot of fun, with enthusiastic audience participation and the debaters exploiting their inside knowledge of their opponents’ work histories. In the end, the machine learning side won by a small margin.

I then had the good fortune to grab dinner with Marti Hearst and Hal Varian at Xanh — a wonderful mix of great food and conversation.


The Wednesday morning keynote session offered some gems:

  • Cloudera CEO Mike Olson urged big data practitioners to focus on guns, drugs, and oil.
  • Doctor and data geek Ben Goldacre delivered a mesmerizing and disturbing talk about the suppression of inconvenient medical trial results and analytical tools to discover it.

But the person who stole the show was Google’s Avinash Kaushik, who talked about making love with data to find orgasm-inducing actions to change the world and make more money. Unfortunately this was the one talk that was not recorded, but you can read the summary on Avinash’s Google+ page.

As a speaker, I held “office hours” on Wednesday. It was supposed to be a 40-minute slot for conference attendees to come and ask me question. But somehow those 40 minutes extended into three hours of conversation about everything from normalized KL divergence to interview problems — and segued into a reception with specialty big-data cocktails. By the time I got back to my apartment, my voice, brain, and liver were spent.


I spent most of Thursday morning in the speaker lounge, recovering from the previous evening and making last touches on my presentation. But I couldn’t resist attending a two-part session on privacy. Indeed, this session was distinctive enough to merits it’s own hash tag: #strataprivacy.

The first part featured O’Reilly’s Alex Howard moderating Intelius Chief Privacy Officer Jim Adler and NYU PhD student Solon Barocas on a panel provocatively titled  “If Data Wants to Be Free, is Privacy a Prison?” It was a great discussion, and I enjoyed the opportunity to offer my own provocative question through Twitter. Since the panelists were arguing that it was unethical to infer private facts from public data, I asked if they were trying to establish a new form of thoughtcrime.

The second panel, entitled “Pretty Simple Data Privacy“, featured Kaitlin Thaney from Digital Science, Betsy Masiello from Google, and John Wilbanks from the Kauffman Foundation for Entrepreneurship. Given that today was the first day of Google’s new privacy policy, there was no avoiding focus on the associated controversy. I did try to get Betsy to address my charge that Google doesn’t think users own their search history (cf. “Google vs. Bing: A Tweetle Beetle Battle Muddle“), but she said she was unfamiliar with the details of that event. I do wish that someone at Google with more familiarity would respond publicly.

Back to the speaker room after lunch, until my own talk with Samasource’s Claire Hunsaker on “Humans, Machines, and the Dimensions of Microwork“. I’ll post the slides (and there will be a video on the conference site), but the sound bite is that you need to keep crowdsourcing tasks simple, manage the trade-off between task value and difficulty, and watch out for systematic bias.

I wrapped up the conference by hearing William Gunn talk about how Mendeley is disrupting bibliometrics and perhaps the entire academic publishing and reputation ecosystem. I laud his ambition and wish him and Mendeley luck in this quest.

In summary, three days of great talks, conversations, and general enjoyment. My thanks to Strata organizers Edd Dumbill and Alistair Croll for putting together such an outstanding event and for giving me the opportunity to participate.


Enjoying Seattle’s Best: UW, WSDM, and SSS

My excursion to Seattle was delightful, and I thought I’d share some details with readers.

I spent most of Friday at the University of Washington, meeting with graduating PhD students.  I’ve always known that UW is a top school, but I was particularly impressed with this batch. I was pleasantly surprised to see folks like Nodira Khoussainova and Kayur Patel working to bring together the often disparate worlds of databases, machine learning, and HCI in order to make people more effective at solving “big data” problems. I realize that I’m aiding and abetting other employers with whom I compete for top talent, but it would be wrong not to encourage everyone to find worthy challenges for these budding scientists.

I then went to the Space Needle to meet up with the WSDM 2012 crowd. Jaime Teevan and Eytan Adar outdid themselves, providing a great setting for folks to mingle, imbibe, and enjoy a spectacular view of Seattle.

Saturday I attended the “social” day of the WSDM conference.

Andrew Tomkins chaired the first morning session, which included Hila Becker‘s latest work on identifying event content in social media and Georgios Zervas presenting the work on the analyzing reputational effects of Groupon that triggered quite a controversy last September. After the break came the spotlight section — a great sequence of 5-minute presentations that in which researchers both summarized their contributions and lured attendees to visit their posters. I hope that more conferences adopt this format, which optimizes for communicating ideas and discourages long-winded expositions.

I then had the pleasure to have lunch with Jan Pedersen and friends at Blueacre Seafood — great food and even better conversation. We both noted the irony that, even though we are practically neighbors, we only seems to meet up at events like these..

I made it back to the conference in time to hear the two best-paper awardees: Adam Sadilek on “Finding Your Friends and Following Them to Where You Are” and Yaron Singer on “How to Win Friends and Influence People, Truthfully: Influence Maximization Mechanisms for Social Networks“. I highly recommend both papers, especially if you are interested in either social network prediction or the underlying economics of influence.

Another coffee break, and then the keynote: Hilary Mason on “The Secret Life of Social Links”. Hilary is a great speaker — I first met her when I invited her to the Workshop on Search and Social Media (SSM 2010) at WSDM 2010. She didn’t disappoint, and it’s great to see practitioners like her crossing the aisle to engage the academic community. Not to mention infusing their slides with lolcats.

The conference wrapped up at 5pm, but then we bussed over to Microsoft Research for the Social Search Social. That was a fun event designed to cross-pollinate the WSDM and CSCW communities. Meredith Ringel MorrisGene GolovchinksyJeremy PickensMadhu ReddyChirag Shah, and Michael Twidale put together a great program of 45-second madness presentations and “speed-dating” to pair up WSDM and CSCW attendees. It was far too short, but a lot of fun. And some of us kept up the social spirit by grabbing dinner afterward at Blue C Sushi.

To everyone I met in the last couple of days: thanks for the great company and conversation! Keep sharing ideas and making data and science social.


Social Wisdom in Seattle


First, I wanted to give readers a heads up that I’ll be in Seattle this Friday and Saturday. I’ll spend Friday afternoon at the University of Washington, meeting with some of their outstanding computer science doctoral students. My schedule filled up with unexpected haste! But if you’re on campus and urgently want to meet, let me know and I’ll see what I can do.

Saturday I’ll be attending the social track of WSDM 2012, the premier international ACM conference covering research in the areas of search and data mining on the Web. I’m excited about the program, as well as the opportunity to catch up with friends and make new ones. Back in 2010, I had the pleasure of co-organizing the Workshop on Search and Social Media (SSM 2010) and being the official ACM blogger for WSDM 2010. You can read my posts here.

Then, on Saturday evening, I’ll be heading to Microsoft Research to attend the Social Search Social (SSS 2012). Hats off to organizers Meredith Ringel Morris, Gene Golovchinksy, Jeremy Pickens, Madhu Reddy, Chirag Shah, and Michael Twidale for creating what looks to be a fun (and very social!) event. I’m especially looking forward to the 45-second “madness” presentations (in which I’m participating) and the “speed dating” to help cross-pollinate  the WSDM and CSCW communities.

Hope to see some of you there, and of course will share what I learn here at The Noisy Channel. I also encourage you to follow the tweet streams for #wsdm2012 and #sss2012.


LinkedIn @ CMU

As regular readers know, I have a deep affection for Carnegie Mellon University, where I did my graduate work. I’m happy to announce that two of my colleagues (both fellow CMU PhDs) will be giving talks at CMU in a couple of weeks, and I hope that some of you will have the opportunities to attend.

On Tuesday, February 7th, Abhimanyu Lad will be hosting an information session at 6pm in Scaife Hall, Room 214. Abhi is rock star on our data science team, and he’s been working on the next generation of LinkedIn search. You can get a taste of his work from his recent HCIR 2011 presentation, “Is it Time to Abandon Abandonment?“. Abhi will talk about a variety of technical challenges that data scientists and engineers are working on at LinkedIn.

On Thursday, February 9th, Paul Ogilvie will talk about “Where Big Data Meets Real-Time: Efficiently Indexing and Ranking News using Activity” at 3:30pm in GHC 6115. Paul is responsible for article relevance infrastructure and algorithms on LinkedIn Today, a great example of social navigation — not to mention a great success for users. Paul will talk about the technical details that make LinkedIn Today possible, including a novel use of inverted lists to efficiently index and support real-time updates to document representations.

And, even if you can’t make it to the talks, I encourage you to visit the LinkedIn booth at the EOC fair on Wednesday, February 8th. We’re looking for great software engineers and data scientists, and we’re especially interested in interns.

I hope that CMU students and faculty will take the time to meet Abhi, Paul, and their colleagues when they visit in a couple of weeks.

Thoughts about Job Performance

This is the season of annual reviews, at least at LinkedIn. Performance reviews can be daunting for both employees and managers — at least everywhere that I’ve worked. Not only are we as human beings terrible at delivering feedback, but we also receive bad advice as managers.

For example, many of us have learned the “feedback sandwich” method, a technique that doesn’t hold up to scientific validation. Watch the video below to see what Stanford professor Clifford Nass has learned from his experiments (see my review of his book here).

Here is what I suggest as a format for performance feedback, whether for writing your own self-assessment or delivering feedback to reports or peers on their performance:

1) What is your day job?

Everyone needs a day job — a mission with a crisp set of responsibilities and deliverables. If you don’t know what you’re responsible for delivering, you can’t assess how well you are delivering it. You should know and articulate your top priorities — at most three, with a clear #1. For further reading, I suggest the Quora discussion on OKRs (Objectives and Key Results), an idea pioneered by Intel and now used at top technology companies (including LinkedIn and Google).

2) How are you performing in your day job?

Hopefully you make more contributions than you can count. But make sure that your day job comes first. If you find that a disproportionate fraction of your contribution is outside your day job, then consider changing your day job. Your top priority is to meet (hopefully exceed!) the expectations for your day job — expectations you should set early and revisit regularly. Performance reviews are a great opportunity to brag.

3) What do you do beyond your day job?

Your day job should be strongly aligned with your team and company’s top priorities. But great employees contribute beyond their day job towards other team and company priorities. For example, talent is our top priority at LinkedIn, so we particularly value contributions to hiring and growing our talent. And, at least in every environment I’ve experienced, the best employees are those who help make others successful.

4) How do you want to grow?

This is really a two-part question. First, what do you want to do next? That could mean getting better at your day job, evolving your current responsibilities, or taking on a different role. Second, what are you doing to get there? You are ultimately responsible for your own professional development. But one of your manager’s top responsibilities is to help you identify and advance along the path that is best for you. And performance reviews are a great opportunity to make you think about the future.

Regardless of how your company manages performance, these are the key questions you should think about. Performance feedback is a great opportunity to focus on professional development — your own and that of the people you work with everyday. Make the most of it!


Are You Hitched?

Let me preface this post by saying that this is my personal blog, and that my opinions here are not necessarily those of my employer.

With that out of the way, I love the premise of a dating site for professionals based on LinkedIn. I won’t confirm or deny the number of my colleagues who have thought about building a dating site based on our data, but it’s great to see someone using our APIs to do so. And the marketing video, while not exactly politically correct, is brilliant.

Yet another reason to work as a data scientist at LinkedIn!