I recently released my latest Constellation ShortList™ for “Data Protection Infostructure” . In this blog post and the next, I drill into what these sorts of solutions are seeking to do.
“Data protection” in many parts of the world is simply synonymous with data privacy. For instance, the General Data Protection Regulation (GDPR) is a data privacy regime; it is very specifically about limiting the flow of personal information, a special class of data. Further, Europeans tend to refer to privacy regulators as Data Protection Authorities, and the conventional privacy compliance tool is the Data Protection Impact Assessment.
So “Data Protection” in Europe has a narrow, even technical, meaning.
Now, regular readers will know I am a huge fan of regulating the collection, use and disclosure of personal information. A great many problems of the digital era, from Surveillance Capitalism to Deep Fakes can be tackled by more strenuous and creative application of regular privacy rules featured in most legal systems.
Nevertheless, there is more to data protection than privacy. Privacy by its nature is restrictive. I’d like to spark a broader discussion about what it is about data that needs protecting. We could begin by asking, What is it that makes data valuable?
First let’s review how security professionals think about data.
Conventional wisdom in data security is that threats to information assets can be viewed in three different dimensions: Confidentiality, Integrity and Availability (or “C-I-A”). Different asset classes can be stronger to different degrees in any of these dimensions. For instance, patient information needs to be especially confidential but medical records also need to have high availability if they are to be useful at a point of care, and high integrity (error resistance) to keep patients safe.
On the other hand, historical employee records — often retained for legal reasons for seven years or more — might not need to be highly available, so archiving on magnetic tape or even paper is worthwhile to keep personal data away from hackers.
But the “C-I-A” perspective is missing so many of the richer dimensions that make data valuable.
Consider three current hot topics:
- Identity Theft is generally perpetrated by data thieves who acquire personal data and use it to impersonate their victims. The problem is that automated identification systems can’t tell if personal data is being presented by the individual concerned or by an imposter (see also my analysis of data breaches).
- Deep Fakes are images or audio that look or sound like real people but have actually been synthesised artificially (typically by Generative AI) instead of recording the real thing.
- And speaking of AI, there is increasing interest in the history of how models are trained. What sort of training data was used? Was it broad and deep enough to be free of bias? Were people in that data aware that it would be used to rain AIs?
Availability, Integrity and Confidentiality are not useful ways to think about safeguarding data in any of these cases. Think about how most LLMs today are rained on "public domain" data. No matter where you stand on the question of creators' intellectual property rights, we would all agree it's too late to make the artworks in question confidential.
Instead of "C", "I" or "A", stakeholders across these and similar examples may want assurances that:
- personal data submitted by a purported individual opening an account or applying for a job was really presented by that person
- creative works used to train an AI model have been licensed for use
- medical data used to train a diagnostic tool has been audited for bias and came from patients who gave informed consent
- the science behind a diagnostic tool has been properly evaluated, and
- software used to generate a particular result was version controlled and can be wound back to an earlier release if bugs are found.
From one digital use case to another, there will be different aspects or qualities of the data concerned that make the data fit for purpose — or in other words, valuable.
In my next blog, I will focus on one such dimension that’s missing from the traditional C-I-A picture: the origins of data.