An introduction to information entropy, open data, and the possible end of crowdsourcing.
Tim O’Reilly and ZIP Codes
From his Strata Conference on Data Science, Tim O’Reilly tweeted with dismay the recent California court decision that the zipcode is now to be classified as “personally identifiable information”. “No more demographics” he lamented. A little later he retweeted a response that “apparently 87% of US residents can be uniquely identified by zip+DOB+gender: bit.ly/qysMqs” and later followed up with “Here’s a reference for the claim that zip code, gender and DOB uniquely identify 87% of individuals: http://www.citeulike.org/user/burd/article/5822736 via @crdant”.
These tweets are odd and disturbing. The zip/DOB/gender finding is a basic one in studies of privacy, published years ago by Latanya Sweeney of Carnegie Mellon University. I gave a talk at work on privacy a year ago, and this was one of the first references I came across. Tim O’Reilly has been pushing an agenda of Open Data, particularly Open Government Data, for the last couple of years, and yet it looks as if he isn’t aware of the basic privacy issues around such data. Can that really be the case?
If it is, then here, to help Tim along, are some notes from my talk as a kind of introduction to data privacy, or at least to data-anonymization and re-identification. A great resource on some of these issues from a legal perspective is Paul Ohm’s 2009 paper “Broken Promises of Privacy: Responding to the Surprising Failures of Anonymization” (PDF), University of Colorado Law Legal Studies Research Paper No. 09-12. It’s long, but it’s so well written it’s an easy read. Much of these notes originated with this paper, in one form or another.
How Privacy Broke Crowdsourcing
A few years ago Netflix ran its highly successful and widely publicised crowdsourced prize competition, in which it released a data set of users and their movie ratings and let competitors download them and search for patterns. The data consisted of a customer ID (faked), a movie, the customer’s rating of the movie, and the date of the rating.
In the FAQ for the competition, Netflix said this:
Q. Is there any customer information in the dataset that should be kept private?
A. No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy… Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.
This certainly looked reasonable enough, but Arvind Narayanan and Vitaly Shmatikov of the University of Texas had other ideas.1 First, they looked at the claim that the data was perturbed by asking acquaintances for their rankings. They found that only a small number of the ratings were perturbed at all, which makes sense because perturbing the data gets in the way of its usefulness.
In the Netflix data set, different users have distinct sets of movies that they have watched. The data set is sparse (most people have not seen most movies), and there are many different movies available, so individual tastes and viewing histories leave a clear fingerprint. That is, if you knew what movies someone watched, you could pick them out of the data set because no one else would have seen the same combination.
A closer look showed that with 8 ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of the records in the Netflix data set uniquely identify an individual. For 68% of records, two ratings and dates are sufficient. Various combinations of information are sufficient to identify users, eg 84% by 6 of 8 movies outside the top 500.
But of course there is no personally identifiable information in the data set. So is this a privacy issue? It is when you have another data set to look at. The researchers took a sample of 50 IMDB users. The IMDB data is noisy – there is no ranking, for example. Still, they identified two users whose Netflix records were 28 and 15 standard deviations away from the next best. One from ratings, another from dates.
So despite Netflix’s best efforts, the data set included enough information to identify some individuals. Partly because of this, a planned follow-up competition was scrapped, and the whole enterprise of crowdsourcing recommender algorithms was given a possibly terminal blow.
What’s this all about?
Just to be clear, this set of notes is not about the following things:
- Encryption
- Restricting access to data
- Lost USB keys and CDs
It is about these:
- Deliberately released data that turns out to infringe on privacy
- HIPAA, EU Data Directive, corporate rules for handling customer data
- Advertising and ISPs
- Gov 2.0, data.gov, and “openness”
It’s about claims such as: “Attorneys on Monday accused Google of intentionally divulging millions of users’ search queries to third parties in violation of federal law and its own terms of service” (October 26 2010)
“MySpace and some popular applications on the social-networking site have been transmitting data to outside advertising companies that could be used to identify users, a Wall Street Journal investigation has found” (October 23, 2010)
“Facebook users may inadvertently reveal their sexual preference to advertisers in an apparent wrinkle in the social-networking site’s advertising system, researchers have found” (October 22, 2010)
(These claims are a year old, found in the week before I gave the talk. I’m sure there are many more.) The Facebook case was one in which advertisers (for a nursing program I believe) asked to target their ads specifically at females and at men interested in other men. But unlike, for example, an ad about a gay bar where the target demographic is blatantly obvious, the user reading the ad text would have no idea that it had been targeted solely at a very specific demographic, and that by clicking it he would reveal to the advertiser both his sexual preference and a unique identifier (cookie, IP address, or e-mail address if he signs up on the advertiser’s site). “Furthermore (the researchers wrote) such deceptive ads are not uncommon; indeed exactly half of the 66 ads shown exclusively to gay men (more than 50 times) during our experiment did not mention ‘gay’ anywhere in the ad text.”

Discussion
No comments yet.