What exactly makes any data valuable?
In my previous blog, The Future of Data Protection, I started to look at what it is that makes data valuable. I think this is the best way to frame the future of data protection. In each application, we must know where the value in a piece of data lies if we are to protect it.
There are so many different things that might matter about a piece of data and thus make it valuable:
- Authorship, including the authority or reputation of the author(s).
- Evidence, references, peer review, repeatability and so on.
- In the case of identifiable (personal) data, the individual’s consent to have the data processed.
- Details of the data collection process, ethics approval, or instrumentation as applicable.
- The algorithms (including software version numbers) used in analytics or automated decisions.
- Data processing system audits.
- Sometimes the locality or jurisdiction where data has been held is important.
- As data is added to, who were the contributors, and what were their affiliations?
- The release of data to the public or specific users may need specific approvals.
- What rights or conditions attach to released data as to further use or distribution?
A lot of this boils down to origin. Where did a given piece of data come from?
This simple question is inherently difficult to answer for most data, because raw data of course is just ones and zeros, able to be copied ad infinitum for near zero cost.
But several interesting approaches are emerging for telling the story behind a piece of data; that is, conveying the origins of data. These are some of the first examples of the solutions category I call Data Protection Infostructure.
Proof of personhood
How can we tell human authors and artists from robots? Or new bank account applicants from bots? The rise of Generative AI and synthetic identities has driven the need to know if we are dealing with a person or an automaton.
Identity crime is frequently perpetrated using stolen personal data. To fight this, we need to know not just the original source of identification data but also the source of each presentation. In other words, what path did a piece of important data take to get to where it needs to be used?
A sub-category of Data Protection Infostructure is emerging around proof of personhood.
Delivering this sort of assurance in a commercially sustainable way is proving harder than it looks. Only recently, an especially promising start-up IDPartner Systems, led by digital identity veteran Rod Boothby was unexpectedly wound up.
Content Provenance
A conceptually elegant capability with plenty of technical precedents is to digitally sign important content at the source, to convey its provenance. That’s how code signing works.
The Coalition for Content Provenance and Authenticity (C2PA) is developing a set of PKI-based standards with which content creators can be endorsed and certified with individual signing keys. C2PA will be implemented within existing authority and reputation structures such as broadcast media licensing, journalist credentialing, academic publishing and peer review.
Similar proposals are in varying stages of development for watermarking generative AI outputs and digitally signing photographic images immediately after capture, within camera sensors.
Confidential Computing
The path taken by any important data today can be complicated.
For example, powerful AI-based image processing is now built into many smartphone cameras; the automatic manipulation of regular photographs can be controversial, even if it’s not intended to mislead.
And the importance of Big Data and AI in all sorts of customer management and decision support systems has led to strengthened consumer protections (most notably in the European Union’s AI Act) to provide algorithmic accountability and explainability.
So, data now flows through complex and increasingly automated supply chains. Signing important data “at the source” isn’t enough when it goes through so many perfectly legitimate processing stages before reaching a consumer or a decision maker. Data may be transformed by AI systems that have been shaped by vastly greater volumes of training data. Moreover, those AI models may be evolving in real time, so the state of an algorithm or software program might be just as important to a computation as the input data was.
And we haven’t even touched on all the cryptographic key management needed for reliable signing and scalable verification.
For these reasons and more, there is an urgent need to safeguard data value chains in their entirety — from the rawest of raw data, before it leaves the silicon, through all processing and transformations. We are approaching a point in the growth of computing where every change to every piece of data needs to be accounted for.
Such a degree of control might seem fanciful but the Confidential Computing movement has the vision and, moreover, the key technology players that are needed to fundamentally harden every single link in the data supply ecosystem.
See also
- Blog post Confidential but in the limelight
- Blog post Confidence in computing
- Constellation ShortList™ for Data Protection Infostructure, and
- my recent interview with Larry Dignan on the Ins and Outs of Confidential Computing.