In 2014, the New York Taxi & Limousine Company (TLC) released a large "de-identified" dataset containing 173 million taxi rides taken in 2013. Soon after, computer scientist Anthony Tockar managed to undo the hashed taxi registration numbers. Tockar went on to combine public photos of celebrities getting in or out of cabs, to recreate their trips, including it was alleged, where those trips had started at strip clubs. See Anna Johnston's analysis here.
This re-identification demonstration has been used by some to bolster a general claim that anonymity is increasingly impossible.
On the other hand, medical research advocates like Columbia University epidemiologist Daniel Barth-Jones argue that the practice of de-identification can be robust and should not be so easly dismissed as impractical. The identifiability of celebrities in these sorts of datasets is a statistical anomaly reasons Barth-Jones and should not be used to frighten regular people out of participating in medical research on anonymised data. He wrote, in a law journal article, that:
"[Examining] a minuscule proportion of cases from a population of 173 million rides couldn't possibly form any meaningful basis of evidence for broad assertions about the risks that taxi-riders might face from such a data release." (emphasis added by me).
In his position, Barth-Jones is understandably worried that re-identification of small proportions of special cases is being used to exaggerate the risks to ordinary people. But Barth-Jones belittles the risk of re-identification with exaggerations of his own. The assertion "couldn't possibly form any meaningful basis" over-states his case quite dramatically. The fact that any people at all were re-identified plainly does create a basis for concern for everyone.
Barth-Jones objects to any conclusion that "it's virtually impossible to anonymise large data sets" but in an absolute sense, that claim is surely true. If any proportion of people in a dataset may be identified, then that data set is plainly not "anonymous". Moreover, as statistics and mathematical techniques (like facial recognition) improve, and as more ancillary datasets (like social media photos) become accessible, the proportion of individuals who may be re-identified will keep going up.
[Readers who wish to pursue these matters further should look at the recent Harvard Law School online symposium on "Re-identification Demonstrations", hosted by Michelle Meyer, in which Daniel Barth-Jones and I participated, among many others.]
Both sides of this vexed debate need more nuance. Privacy advocates have no wish to quell medical research per se, nor do they call for absolute privacy guarantees, but we do seek full disclosure of the risks, so that the cost-benefit equation is understood by all. One of the obvious lessons in all this is that "anonymous" or "de-identified" are useless descriptions. We need tools that meaningfully describe the probability of re-identification.
And we need policy and regulatory mechanisms to curb inappropriate re-identification.
I argue that the act of re-identification ought to be treated as an act of Algorithmic Collection of PII, and regulated as just another type of collection, albeit an indirect one. If a statistical process results in a person's name being added to a hitherto anonymous record in a database, it is as if the data custodian went to a third party and asked them "do you know the name of the person this record is about?". The fact that the data custodian was clever enough to avoid having to ask anyone about the identity of people in the re-identified dataset does not alter the privacy responsibilities arising. If the effect of an action is to convert anonymous data into personally identifiable information (PII), then that action collects PII. And in most places around the world, any collection of PII automatically falls under privacy regulations.
It looks like we will never guarantee anonymity, but the good news is that for privacy, we don't need to. Privacy is the protection you need when you affairs are not anonymous, for privacy is a regulated state where organisations that have knowledge about you are restrained in what they do with it. Equally, the ability to de-anonymise should be restricted in accordance with orthodox privacy regulations. If a party chooses to re-identify people in an ostensibly de-identified dataset, without a good reason and without consent, then that party may be in breach of data privacy laws, just as they would be if they collected the same PII by conventional means like questionnaires or surveillance.
Surely we can all agree that re-identification demonstrations serve to cast light on the claims made by governments for instance that certain citizen datasets can be anonymised. In Australia, the government is now implementing telecommunications metadata retention laws, in the interests of national security; the data we are told is de-identified and "secure". In the UK, the National Health Service plans to make de-identified patient data available to researchers. Whatever the merits of data mining in diverse fields like law enforcement and medical research, my point is that any government's claims of anonymisation must be treated critically (if not skeptically), and subjected to strenuous and ongoing privacy impact assessment.
Resources:
Getting Started Guide: Privacy Engineering
The State of Identity Management in 2015
The State of Digital Privacy in 2015