The release of OpenAI’s GPT-4 on 14 March was met with unparalleled hype. The company claims the model is more accurate, better at problem-solving, and more powerful than OpenAI’s previous model, GPT-3.
But with GPT-4 has OpenAI solved the privacy and data protection risks associated with GPT-3? Indeed, can large language models (LLMs) co-exist with strong data protection laws, such as the EU General Data Protection Regulation (GDPR)?
This article will consider the privacy and data protection risks raised by LLMs, with specific reference to GPT-4 and OpenAI’s policies.
AI Systems and ‘Privacy Attacks’
Privacy risks do not always result in material harm or distress to individuals. The intrusion of privacy can result in broader societal harms. But there are some individual privacy risks raised by AI systems like GPT-4.
Let’s look at two “privacy attacks” that can occur with AI systems.
One privacy risk is that a bad actor could prompt an AI system to generate outputs that reveal personal data included in its training set. This is sometimes called a “membership inference attack”.
For public LLMs, such as ChatGPT, the consequences of an attack like this are likely to be minor. The model is trained on publicly available data, and so knowing that an individual’s data is included in the training set does not reveal much about that person.
But the risk is more serious in other contexts, such as where GPT-4 is integrated into a private AI system used for particularly sensitive purposes.
For example, if an AI system is used in a clinical setting, there is a risk involved when revealing whose data was involved in the training of that system.
Model Inversion Attack
GPT-4 can be used to enhance the information that a user already has about a person. This is sometimes known as a “model inversion attack”.
OpenAI’s GPT-4 Technical Report states that “GPT-4 has the potential to be used to attempt to identify individuals when augmented with outside data”. However, the company says that it has taken several steps to mitigate this risk, including:
- Fine-tuning models to reject privacy attacks.
- Removing personal data from training data “where feasible”.
- Monitoring users’ attempts to identify individuals.
- Prohibiting such attempts in the software’s terms of service.
Unlike with previous AI models, OpenAI is releasing very limited information about GPT-4. The company cites the “competitive landscape” and safety concerns to justify this change in policy. Therefore, it’s hard to say how effective these mitigations will be.
Large Language Models and the GDPR: A Match Made in Hell?
As well as privacy risks, AI systems also present data protection compliance issues.
The GDPR’s “principles of data processing” require organizations to collect and use personal data in a responsible way. But these principles are arguably at odds with the massive data collection necessary to train an LLM.
All the GDPR principles are relevant to AI. Here are some examples of how the GDPR’s principles might conflict with the development and use of LLMs such as GPT-4.
Lawfulness, Fairness, and Transparency
Under the “lawfulness, fairness, and transparency” principle, organizations must use personal data in a way that:
- Complies with the law.
- Does not unnecessarily contradict people’s reasonable expectations.
- Is as clear and transparent as possible.
The lawfulness, fairness, and transparency principle relates to the GDPR’s notice requirements.
- People have a right to know what’s happening with their data and how to exercise their data protection rights.
- Under Article 14 of the GDPR, organizations should normally notify people when they’ve obtained their personal data from a third-party source.
Because of how personal data is collected to train LLMs such as GPT-4, it’s arguably almost impossible to fulfil the GDPR’s transparency obligations.
There are exceptions to the GDPR’s notification requirements—but it’s hard to say whether OpenAI’s activities would fall under any of these exceptions.
The “lawfulness, fairness, and transparency” principle also links to the GDPR’s “legal basis” provisions.
Organizations must have one of six legal bases for processing personal data. These include where a person has provided consent, where there is a legal obligation, or where the processing is in the organization’s “legitimate interests”.
Therefore, we can only speculate about which legal basis might apply to OpenAI’s AI-training activities.
The “purpose limitation” principle requires organizations to only collect personal data for “specified, explicit, and legitimate purposes”. The principle also restricts how organizations process personal data for unrelated further purposes.
Purpose limitation is a cornerstone of data protection. People provide personal data in a specific context (for example, posting on Reddit), with a reasonable expectation that this personal data won’t be used for unrelated further purposes (such as training an AI system).
There are exceptions to the general “purpose limitation” rule. For example, organizations can collect personal data for research purposes or for other purposes in the public interest, even when the data was originally provided for a different purpose.
However, even when an exception applies, organizations must put any possible privacy and security safeguards in place.
OpenAI might argue that one of the GDPR’s “purpose limitation” exceptions applies to the training of GPT-4. Others, including data protection regulators, might disagree.
Under the “data minimization” principle, organizations must not use more personal data than is necessary for a given purpose.
Training an LLM requires vast amounts of data, including personal data. We don’t know how much data was collected to train GPT-4. However, we know much more about GTP-4’s predecessor, GPT-3.
GPT-3 was trained on around 45 terabytes of data, consisting of books, Wikipedia articles, and—crucially—data scraped from the open web. We can assume that training GPT-4 required even more data than GPT-3.
It’s hard to say how much GPT-4’s training data is “personal data”. But needless to say, a lot of that information will consist of personal data . It’s possible that this data was not collected in accordance with the “data minimization” principle.
The “accuracy” principle requires personal data to be accurate and up-to-date. Inaccurate personal data should be corrected as soon as possible. Accuracy is a major issue for text generation engines such as GPT-4.
LLMs tend to “hallucinate” (generate inaccurate outputs). Some people estimate that up to 20% of GPT-3’s outputs were hallucinations. GPT-4 may improve on this figure but will not eliminate hallucinations.
When an AI researcher asked GPT-3 about herself last year, the model gave numerous false answers including that she was a model and a hockey player.
GPT-4’s outputs will likely include similarly inaccurate personal data. There is no clear way for individuals to rectify inaccurate personal data created in this way,
Under the “storage limitation” principle, organizations must only keep personal data as long as it is needed for a specified purpose.
Storage limitation can be an issue for AI systems, where personal data included in training sets may persist indefinitely.
A 2020 European Parliament report noted that there is “undoubtable tension between the AI-based processing of large sets of personal data and the principle of storage limitation”.
Personal data can be stored for longer periods where necessary for research or statistical purposes. However, it is not clear that OpenAI’s processing falls under this exception.
The principle of “storage limitation” is linked to the GDPR’s “right to erasure”. Later in the article, we’ll consider the issues that can arise for AI companies when fulfilling requests under the right to erasure.
Integrity and Confidentiality
The GDPR’s “integrity and confidentiality” principle requires organizations to take reasonable technical and organizational measures to ensure personal data is secure.
There are security risks when using an AI model such as GPT-4.OpenAI does not currently train GPT-4 on inputs provided by users. Therefore, personal data entered into (for example) ChatGPT today will not appear in ChatGPT’s outputs tomorrow.
However, OpenAI can access personal data inputted into some of its AI programs. The company uses these inputs to help develop its systems and may use them for training sets in future.
As such, the UK’s National Cyber Security Centre (NCSC) recommends not entering sensitive data (personal or otherwise) into public LLMs such as ChatGPT.
There is also a risk that GPT-4 can be used in cyberattacks. OpenAI’s report on GPT-4 notes that the model can be “useful for some subtasks of social engineering”, such as drafting phishing emails.
OpenAI’s Privacy Practices
OpenAI has been criticized for an alleged failure to comply with its obligations under the GDPR. Let’s look at some of the data protection issues that might be relevant to OpenAI as an organization.
Data Subject Rights
AI researcher Miguel Luengo-Oroz suggests that a neural network like GPT-4 cannot “forget” data that was present in its training sets—it can only adjust its algorithm to deprioritize data it deems less useful or relevant.
This presents a problem as it may be impossible for a company like OpenAI to fulfil requests under the GDPR’s “right to erasure”.
While neural networks like GPT-4 do not “contain” training data, their outputs can still contain personal data. If someone wished to eliminate the possibility of their personal data appearing in the model’s outputs, it is not clear how OpenAI could facilitate this request.
The right to erasure is not absolute. But other rights, such as the “right of access”, are much broader. It might also be infeasible for OpenAI to locate and provide a specific person’s data from a large and unstructured training set.
Data Processing Agreement
The GDPR requires “data controllers” to enter a contract (called a “data processing agreement”) with “data processors” who process personal data on their behalf.
If a company inputs personal data into a product running GPT-4 (or another OpenAI model), OpenAI would normally be a data processor and require a data processing agreement with that company.
Despite offering AI tools to European companies for several years, OpenAI only created a data processing agreement in March 2023. Furthermore, the agreement appears only to cover OpenAI’s API products, and not ChatGPT.
Therefore, anyone inputting personal data into ChatGPT risks violating the GDPR’s rules on controllers and processors.
International Data Transfers
The GDPR places strict rules on how organizations transfer personal data out of the EU to “third countries” such as the US.
Before transferring personal data to a third-country company, EU-based organizations must implement one of the GDPR’s “international transfer safeguards” and ensure that the personal data cannot be accessed by foreign intelligence services.
There is currently a lot of uncertainty around data transfers, as the EU’s top court has ruled that the GDPR’s international transfer safeguards are not valid when transferring personal data to US-based organizations in certain contexts.
The company states that it will only transfer personal data “pursuant to a legally valid transfer mechanism”. However, OpenAI does not specify which of the GDPR’s data transfer mechanisms it is relying on.
Recent GDPR regulatory action suggests that there may be no feasible way to transfer personal data to some US companies without breaching the EU’s data transfer rules.
These data transfer issues have led to EU regulators declaring that the use of platforms like Google Analytics and Facebook Login is illegal under the GDPR. Similar issues might arise for OpenAI.
We hope this guide was helpful. Thank you for reading and we wish you the best of luck with improving your company’s privacy practices! Stay tuned for more helpful articles and tips about growing your business and earning trust through data-protection compliance. Test your company’s privacy practices, CLICK HERE to receive your instant privacy score now!