I recently had the privilege of attending a talk where Paul Ohm presented the main ideas behind his latest research paper.
I found his reporting on re-identifying users from supposedly non-personally identifiable information fascinating:
-87.1% of Americans can be uniquely identified by their 5-digit zip code combined with the date, month, and year of their birth.
-80% of anonymized Netflix users could be uniquely identified by 3 movie reviews (movie, date, review value).
His take-home message?
Data can either be useful, or perfectly anonymous, but never both.
The majority of laws and contracts dealing with personal information draw a line between “personally identifiable information” and “non-personally identifiable information” (aka aggregate, anonymous data).
But, if you can use non-personally identifiable information to derive personally identifiable information, then the two categories collapse into one.
It will be interesting to see how advertisers, social networks, governments, and end users respond to reality that the separate categories we’ve built into the laws and contracts may not actually exist.