Don’t say I didn’t tell you so. Declan McCullagh reports at CNET that privacy groups are expressing concern about Google Flu Trends:
The Electronic Privacy Information Center and Patient Privacy Rights sent a letter this week to Google CEO Eric Schmidt saying if the records are “disclosed and linked to a particular user, there could be adverse consequences for education, employment, insurance, and even travel.” It asks for more disclosure about how Google Flu Trends protects privacy.
I agree with Declan that
If you think that knowing that Alaska’s “influenza-like illness” number for the week of November 9 is 2.035 and California’s number is 1.384 is somehow worrisome and can identify you personally, it’s time to break out your tinfoil hat.
But, as the article gets to, the deeper concern is the one expressed by the Electronic Privacy Information Center:
There are no clear legal or technological privacy safeguards that prevent the disclosure of individual search histories. Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or presidential authority.
I’m not paranoid, and I actually think that both privacy advocates and web search companies have often exaggerated privacy issues, especially since the AOL fiasco a couple of years ago. But EPIC is raising is a legitimate concern, and I think Google doesn’t seem to be providing very reassuring answers.
Specifically, web search companies are very protective of their log data in the name of privacy, much to the chagrin of researchers. And yet those same companies feel that privacy advocates exaggerate their concerns about the data being collected in the first place. Google / Yahoo / Microsoft: you can’t have it both ways!
A final point: Declan comments that “If users don’t like that, nobody’s forcing them to use Google.” I chalk that attitude up to his libertarianism rather than to any partiality he may have towards his wife’s employer. I have a libertartian streak myself, so I’m sympathetic. But, just as a practical matter, this is the sort of behavior that historically attracts regulation. Google and its rivals would do well to acknowledge the legitimacy of their critics’ concerns and regulate themselves first.
3 replies on “Google Flu Trends: The Privacy Backlash Begins”
[…] Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or presidential … Read the rest of this great post here […]
There are ways to both meets the needs of consumers for privacy as well as conduct research.
For background: Over 87% of the population can be identified if you have just 3 pieces of information. Their zip code, date of birth and gender but there are ways to use privacy protecting algorithms for large population sets for research while still protecting individual’s information.
Latanya Sweeney’s classic study she was able to identify Governor William Weld by matching masked medical data from the Group Insurance Commission of Massachusetts with publicly available voters lists is a foreshadowing of what can happen.
There has been some really interesting research done “Privacy Preserving Data Analysis” which is based on the work of Microsoft Research (MSR) researchers Cynthia Dwork and Frank McSherry (as well as S Chawla, K Talwar, A Blum, K Nissim, and A Smith) .
The basic premise of their research is that the addition of noise (is shift the zip code by one, the age by one) to the data will be able to protect the individuals underneath the aggregate data and still produce large scale research results.
For a deatialed overview of the actual algorithm check out Denny Lee’s blog at Microsoft .
Sherry, thanks for the comments. I’m familiar with some of the work on data obfuscation, as well as with attacks on it (e.g., the work on de-anonymizing the Netflix Prize data set that a friend of mine co-authored).
I think it would be a good idea for organizations that collect sensitive data but only need to leverage its aggregate properties to at least try to protect individuals through random perturbation. It may still be an imperfect solution, but it would at least be a good-faith step in the right direction.