#HelloWorld. We originally thought this edition would focus on OpenAI’s attempts to self-regulate GPT usage, but the European Union had other plans for us. This past Thursday, news broke of an agreement to add generative AI tools to the AI Act, the EU’s centerpiece AI legislation. So today’s issue starts there, before discussing OpenAI’s and others’ recent announcements regarding training data access and usage. Let’s stay smart together. (Subscribe to the mailing list to receive future issues).
The EU’s Artificial Intelligence Act: The EU has been debating a proposed AI Act since 2018. In 2021, it published a legislative framework that would classify AI products into one of four categories: unacceptable risk (and therefore forbidden); high risk (and therefore subject to regular risk assessments, independent testing, transparency disclosures, and strict data governance requirements); limited risk; and minimal risk. But this approach was developed before so-called “foundation models”—LLMs like ChatGPT and image generators like DALL-E and MidJourney—exploded into the public consciousness. So questions remained about whether the AI Act would be adjusted to accommodate this new reality.
The answer is yes. Although the text of the amendments has not yet been released, Euractiv and other news sources monitoring the EU legislative process reported at the end of last week that European Parliament members agreed to bring foundation models into the AI Act’s framework. Many of these models are likely to be regulated as “high risk” systems under the AI Act’s proposed taxonomy. And the amended AI Act contemplates at least two special requirements specific to foundation models:
- According to The Verge, model deployers would have to publicly disclose any “copyrighted material” used to train their models. The exact level of specificity required is unknown, but this disclosure mandate (if ultimately adopted) may prove exceedingly difficult. It’s no secret that OpenAI’s, Google’s, and other deployers’ foundation models are trained on massive scrapes of the entire web, which presumably include countless instances of copyrighted works.
- Downstream app developers fine-tuning or building on top of foundation models could also be subject to the AI Act’s heightened requirements, as part of a liability-shifting program. If these developers “substantially modify an AI system” by, for instance, running fine-tuning training on new datasets, those actions could bring the downstream developers under the regulatory regime.
Much remains before the AI Act is finally enacted. A full vote of the European Parliament is expected mid-June. Then comes a so-called “trilogue” process between the Parliament and two of the EU’s other main arms (the Council of Ministers and the European Council) to reconcile differences and arrive at final language for adoption. Whatever the final version of the AI Act, it is not expected to go into effect before 2025.
OpenAI’s Self-Regulation: While the EU debates wide-ranging regulations, model developers are not passively awaiting their fate. OpenAI has reportedly made changes to ChatGPT in Italy to address the personal privacy and age-gating concerns that caused Italian regulators to temporarily ban the product in early April. According to statements sent to the AP and The Verge, OpenAI deployed a new web form that permits EU users to object to having their data used for model training; published a new help-center article describing how ChatGPT collects and uses personal data; and added an age-verification step to its sign-up process.
More generally, even outside the EU, OpenAI is adopting measures to give users more control over their data. Last week, OpenAI announced that users can now turn off their conversation history, effectively opting out of having those conversations used to “train and improve OpenAI models.” By default, OpenAI uses conversations to update its models, but under its new approach would retain conversations only for 30 days and “review them only when needed to monitor abuse, before permanently deleting.”
What we’re reading and following: With attention now increasingly focused on model training data, the next big question is valuation—how much will data holders charge and how much will model developers be willing to pay for access to that data? In the past two weeks, Reddit and Stack Overflow announced they will not be providing AI training access to their data for free, at least for use in commercial applications. Unsurprisingly, these data owners have not publicly shared what such access would cost. According to Wired, accessing 200 million tweets on Twitter today would run you $210,000 monthly—perhaps a useful benchmark for future negotiations in this space, and surely a sign that model training, already expensive, is about to get more so.
What should we be following? Have suggestions for legal topics to cover in future editions? Please send them to AI-Update@duanemorris.com. We’d love to hear from you and continue the conversation.
Editor-in-Chief: Alex Goranin
If you were forwarded this newsletter, subscribe to the mailing list to receive future issues.