The legal perils of training generative AI

Associate Andrew Wilson-Bushell discusses the legal implications of training generative AI, and takes a look at the publishing world's recent reaction to the technology.

“Write me a short mystery story in the style of Margaret Atwood, about a machine stealing creativity”.

A fairly innocuous prompt, with big ramifications. When plugged into ChatGPT (I used version 3.5) the results are surprisingly emotive. A rather erudite tale (considering the lack of subsequent prompts) is generated in which our heroine, Eleanor, “a writer whose words had once danced like fireflies on the pages” is locked in a discordant struggle against the ‘Artificium’, a sinister machine built “in a world that had lost its capacity to dream… to salvage brilliance from the abyss of mediocrity”. The tale builds to the climactic statement: “you’ll never possess the true source of inspiration—the human heart and mind”.

But what was ChatGPT’s true source of inspiration when creating this (rather technologically remarkable) story?

On 30 June 2023, the Authors Guild published its open letter to the CEOs of OpenAI, Alphabet, Meta, Stability AI and IBM calling for fair compensation for authors whose works are being used to train AI systems. On the 18 July, the letter was reported to have gained over 10,000 signatures, including high profile authors Dan Brown, Suzanne Collins, and Margaret Atwood herself, amongst others. The letter focusses on how AI systems rely on authors’ works, their style, their choice of language and their ideas, to train AI systems. In turn, these AI Systems generate revenue for the technology companies (a subscription to ChatGPT Plus currently costs $20/month).

The Authors Guild’s open letter is the latest step, both in the US and across the world, in the developing battle between Big Tech and rights holders regarding the training of generative AI systems. In the UK, other industries have also moved to challenge Big Tech’s use of creatives’ work as training data. It is important to note, however, that the Authors Guild, in its letter, has chosen to focus on compensation, rather than prevention.

Even before the recent surge of AI development, analogous issues have been dealt with by Big Tech and creative industries. For example, Meta has recently been challenged by news organisations across the world for Facebook’s use of news on its site in the US. How creatives, whether that be writers, musicians, visual artists or otherwise, will be compensated if or when their works are used for training AI systems will continue to be a source of contention as these systems develop. The result of this process will ultimately come down to a variety of factors, including practically being able to prove that copyright works have been used as training data, collective bargaining strategies, regulatory input, industry codes of practice, and what the courts decide in the coming months and years in response to the various claims being brought at the moment.

Using copyright works as training data

Fundamentally, the debate around the use of works as training data for AI systems revolves around copyright, which most jurisdictions protect in some way, shape or form. In England, technology companies are generally not permitted (unless one of a limited number of exceptions apply) to scrape copyright works without permission, even if the works are available online. The US has a broader “fair-use” regime for use of copyright works, which the Authors Guild addresses head-on in their letter.

The Authors Guild points out that, where a company is scraping the internet for training data, at least a portion of the content ingested won’t have been made available legally in the first place, with books being made available on pirate websites and similar sources. A website’s terms and conditions may also seek to restrict the use of content for certain purposes.

There is a litany of litigation, predominantly in the US, around the free use of copyrighted works as training data. Here in England, various cases brought by Getty Images against StabilityAI are proceeding through the courts. Getty claims that StabilityAI used its library of images (which it would otherwise charge a licence fee for) to train the AI system Stable Diffusion. In that case, evidence of copying was made somewhat easier by the fact Stable Diffusion generated images that appeared to include the Getty Images watermark.

Creatives are now gaining more avenues to check whether their works have been made available as training data. Services like Have I Been Trained search the Laion-5B and Laion-400M image datasets and provide an API to enable an opt-out service (although AI developers need to implement this API themselves, reducing its usefulness). This service has been a gateway for individual creatives to demonstrate infringement and bring legal actions against services like DeviantArt (in respect of DreamUp) and, again, StabilityAI.

Industry reactions

This issue is not isolated to the book publishing world. Every creative industry is taking note of the use by AI systems of their copyright work.

In some cases, these concerns reveal themselves in lobbying efforts. For example, the music industry recently spoke out against the UK Government’s plans, now under review, to extend an exception that would permit the use of copyright works for text and data mining purposes.

In other areas, we might see the development of industry codes of practice. Typically, legislators are slow to keep up with the onward march of technological progress. The EU legislature is doing its best to keep pace, with the proposed EU AI Act currently working its way through the law-making process. That legislation, should it come into effect in the form currently proposed, provides for mechanisms to create industry-wide codes of practice for the use of AI systems (suggesting that stakeholders develop codes of practice that “may cover one or more AI systems taking into account the similarity of the intended purpose of the relevant systems”) and requires AI systems to comply with certain transparency requirements (such as requiring generative AI models to provide a summary of training data used).

Not just copyright?

Infringement of copyright has been the headline grabbing consequence of training AI systems from information available online, but it is also worth briefly touching on other issues that creatives and technology providers should be aware of.

If training data includes personal data (such as an author’s biographies or likenesses), then (depending on the relevant jurisdiction of the data subject) data protection laws like the EU or UK GDPR may be engaged. Equally, in England, the Computer Misuse Act 1990 prohibits unauthorised access to computer material which (in theory at least) might capture unpermitted scraping of online content.

It is also worth considering any existing contractual terms that are in place. Website owners are increasingly taking the opportunity to update or emphasise existing terms that seek to prohibit scraping and using their information for commercial purposes such as training an AI system.

Finally, other rights aside from just copyright might exist in content that has been used as training data. In the UK and EU, separate legal rights known as “database rights” exist in structured content where a substantial investment has been made in obtaining, verifying or presenting the contents of a database.

Final thoughts

The creative industries being challenged by emerging technology is nothing new. The early-2000’s saw the rise of piracy and peer-to-peer sharing threaten the existence of music and publishing industries. Once the dust settles, the world will continue ever onward and incorporate AI into creative workflows. The task in the meantime is staking a claim for fair remuneration for creators.

With generative AI already adopted to such a degree that the digital cat is very much out of the proverbial bag, the availability and use of the technology is only going to increase. Whilst this article discusses the inevitable challenges with this (and does not even touch on the complications around the output of the generative AI systems), it is worth bearing in mind that there are also exciting opportunities. For example, even now, new skilled art forms are developing as people perfect ‘the art of the prompt’ to generate AI-powered art.

Andrew's article was published in World Intellectual Property Review, 1 September 2023, and can be found here.

No items found.

The legal perils of training generative AI

Using copyright works as training data

Industry reactions

Not just copyright?

Final thoughts

News & Insights

Film & TV Briefing: Friday 10 July 2026

BBC and Channel 4 in discussion over shared streaming platform

High Court rules on unlawful information gathering claims against Associated Newspapers

ITV and Sky reshape the UK media landscape with £1.6bn deal