The following MBW Views op/ed comes from Ed Newton-Rex (pictured inset), CEO of the ethical generative AI non-profit, Fairly Trained.
A veteran expert in the world of gen-AI, Newton-Rex is also the former VP Audio at Stability AI, and the founder of JukeDeck (acquired by TikTok/ByteDance in 2019).
In this op/ed, Newton-Rex argues that “opt-out schemes in generative AI are not only egregiously unfair to rights holders – they simply do not work”. He adds: “They give rights holders the illusion of control over how their works are used; nothing more”.
Over to Ed…
If you’re a musician, try this exercise: go and opt out of all generative AI models training on your music.
Personally, over the years, I’ve published recordings on MySpace, Bebo, SoundCloud, Facebook, X, Spotify and other DSPs (via two or three different distributors), and my own website; I’ve published sheet music on MuseScore, Medium, on my website, and via two music publishing companies; I’ve had my music recorded by various groups, and shared online in audio and video formats; I have co-writing credits on a number of songs, and I’ve played in recordings for various bands and groups; and I’ve licensed music to various companies, productions and events, many of which have been recorded.
To be clear: opting all of this music out of generative AI training is all but impossible.
Despite this, there are legislators in various countries who are being convinced, thanks to lobbying from the most powerful and valuable AI companies, that providing a right to opt out of training would represent some sort of sensible compromise between AI companies (who scrape and train models on the world’s content) and creators and rights holders (who mostly view this as illegal and immoral). The UK government is the latest to reportedly favor this solution to the growing disagreement between the two sides – a disagreement that is existential for the creative industries.
But opt-out is no compromise. As I outline in this essay, opt-out schemes in generative AI are not only egregiously unfair to rights holders – they simply do not work. They give rights holders the illusion of control over how their works are used; nothing more.
The main problem with opt-out schemes – and there are many – is that, to put it plainly, they do not let you successfully opt-out your work from training. Take the opt-out scheme that’s by far the most commonly-used today, the Robots Exclusion Protocol. As a website owner, you can include some text in your site’s robots.txt file that forbids or permits access by various web crawlers.
This was originally created to control crawler access for search engines in the 90s, but it’s now also used to block crawlers gathering data for AI training.
“Opt-out schemes in generative AI are not only egregiously unfair to rights holders – they simply do not work. They give rights holders the illusion of control over how their works are used; nothing more.”
Now, there are many problems with robots.txt as an opt-out scheme for generative AI training – it is observed only voluntarily, and companies that claim to adhere to it have been shown to ignore it. But even in a hypothetical world in which observing robots.txt exclusions were a legal requirement, using it to opt out does not successfully opt your works out of training.
It opts that web domain out of AI training – but your works will also be available in all sorts of other places. A composer’s music will be used in a YouTube video; a photographer’s photo will feature in an ad hosted on another site; a journalist’s article will be screenshotted and shared on social media. Opt-out schemes tied to URLs – which the most common opt-out schemes are – totally fail to opt out downstream copies of your work, of which there will be many. Your works will still be used to train the models, despite having opted out.
But the alternative – attaching metadata to the content itself – is no better. Metadata is routinely removed, sometimes intentionally and maliciously, but more commonly automatically. For instance, metadata is removed from an image when you share it to X. Attaching metadata to media to indicate that it is opted-out of AI training is essentially meaningless, like leaving your bike unlocked but taping on a hastily-scribbled note saying ‘do not steal’.
The fundamental issue is that there is no opt-out solution that successfully opts content itself out of training. Why not maintain a central repository of opted-out works, and scan AI training datasets for these works ahead of training, you may ask? Because automatic content recognition is simply not up to it. Taking this approach would allow modified versions of opted-out media to be trained on, as well as individual copyrighted elements that make up part of a larger copyrighted work – think a composition as part of a sound recording, the lyrics in a song, the dialogue in a film.
It would also permit training on bootlegged recordings of live performances, and works transposed from one medium to another (those screenshots of articles again).
The fact that opt-out schemes don’t actually give rights holders the power to opt their works out of training should be enough to disqualify them from being discussed as a solution to the argument between creators and AI companies. But there are many other reasons to consign opt-outs to the dustbin of terrible regulatory proposals.
The first and most obvious is the injustice of putting what is a huge administrative burden of opting out on rights holders, when the benefits of training accrue entirely to the AI companies, and training evidently harms rights holders (since generative AI competes the work it’s trained on, and the people behind that work). No opt-out solution has been conceived of that would opt-out all of a given individual’s works in one step, so the work required to opt out all of your works is gargantuan. On top of this, most people eligible for opt-outs miss the chance to take them up. To quote my essay:
A Cloudflare study from July 2024 found that only 16% of the 100 top-visited sites hosted on Cloudflare blocked AI crawlers (and only 8.8% of the top 1,000). Even among the most popular sites, opt-out take-up is relatively low: more than 40% of the top 100 English language news sites blocked no AI crawlers as of February 2024, a full 15 months after the release of ChatGPT – and these are the companies you would expect to be the most likely to be aware of opt-out availability.
When companies run opt-outs, they generally don’t share opt-out numbers, but, when they do, we see similarly low take-up: for instance, when AudioSparx ran an opt-out of a training deal with Stability AI for its members, only 10% opted out.
These incredibly low take-up numbers for opt-out schemes are in stark contrast to the high percentage of creators who, when asked their views in polls, insist that rights holders be paid for AI training: 93% of musicians in Australia and New Zealand, 94% of visual artists in the UK, and similar numbers in other groups. This huge discrepancy is easily explained: most people don’t know they can opt out, and if they do they don’t know how to opt out. This was made clear when Udemy ran an opt-out scheme, and users of the platform were angry to discover they had missed the opt-out window.
Of course, this should come as no surprise: we know that shifting from opt-in to opt-out schemes dramatically changes the behavior of the people eligible for the scheme. You run an opt-out scheme if you want most people to neglect to opt-out, whether intentionally or not. Governments that allow AI companies to train on copyrighted work without a license, only requiring them to adhere to opt-outs, are implicitly saying they want the majority of their country’s IP to be available to AI companies for free.
What’s more, while opt-out schemes are inevitably missed by all types of rights holders, they are particularly unfair on small rights holders and individual creators. The fewer visits your website gets, the less likely you are to take advantage of opt-outs. This is self-evident: individual creators, lacking the resources to devote to paying attention to opt-out schemes, are clearly less likely to exercise their right to opt out. Opt-out schemes disproportionately penalise the very creators and small rights holders who most need our support.
Opt-out schemes also present an impossible choice to many rights holders. Large language models are becoming the norm in internet search, be it via Google’s AI overviews or new products like Perplexity. But rights holders cannot use opt-out schemes to distinguish between allowing their content to be referenced in search results (in order to direct users to their websites) and allowing their content to be trained on (which enables the resulting trained model to output material based on, and competing with, their content). That is, if you opt out of AI training, you won’t appear in these LLM-powered search results – so you opt out of being findable on the internet. There is no real choice here.
And the timeframes on which AI companies implement opt-out requests are totally unacceptable. No opt-out scheme in existence has required AI companies to retrain their models from scratch, using no trace of the opted-out works, removing previous models from circulation. Models are trained well in advance of going live (for instance, it was a year between the end of GPT-4’s training and its release to all developers), and they are usually live for a long time. When you opt out of training, it may be months or even years before models using your works are no longer used. And if your works are used in open source models – which they almost certainly are – those models will survive indefinitely, since they cannot simply be turned off at source.
“When you opt out of training, it may be months or even years before models using your works are no longer used. And if your works are used in open source models – which they almost certainly are – those models will survive indefinitely, since they cannot simply be turned off at source.”
Then there is the problem of synthetic data. Synthetic data is content created by a generative AI model, which is then used to train a new model. The models used to create synthetic data are often trained on copyrighted work. No opt-out scheme has ever required the removal from training datasets of synthetic data created using models trained on the opted-out works, and there is no hint that any future opt-out scheme will do so. So, even after you’ve opted out your works, and even after you’ve waited the months or years for this to take effect, content created using your works will still be being used to train AI models. Opt-out schemes do nothing about this copyright laundering.
Outside of AI, opt-out schemes can make sense when failing to opt out brings no harm to the individual, or, sometimes, when the people eligible to opt out are given a clear, meaningful chance to do so. Some countries have adopted opt-out schemes for organ donation, pension scheme contributions, email marketing. But generative AI training harms the person whose work is trained on, since it improves a model that will compete with them; and there is no clear point of engagement with the rights holder where you can prominently present the option to opt-out, like an unsubscribe link in a marketing email. There are situations in which opt-out schemes are appropriate. Sourcing data for training generative AI models is clearly not one of them.
Proponents of opt-out schemes for generative AI training – mostly AI companies – gleefully point to the EU’s AI Act, the first major legislation to codify generative AI opt-outs in law. But being the first to make a decision does not mean you’ve got it right. While there will no doubt be legal battles around its interpretation, AI companies are reading it as carte blanche to train on anything not opted-out. As such, it is a truly terrible piece of legislation from the perspective of rights holders. All of the problems with opt-outs I’ve outlined in this piece will become active issues: the impossibility of opting out copies of your work; the indefinite delay between opting out and models that use your work being switched off; the giant burden of opting out and the inescapable fact that most people will fail to opt out as a result; the huge, entirely unaddressed problem of synthetic data. The EU, in its rush to finalize legislation, has created a framework that entirely fails to protect the rights of creators and rights holders, essentially handing AI companies the majority of the world’s creative work, irrespective of copyright. This should not be taken as a model to be copied in other jurisdictions.
Today’s generative AI opt-out schemes are obviously dysfunctional. For instance, there is no sign that either of the two most well-known AI music companies, Suno and Udio, observe opt-outs. But we shouldn’t get hung up on the limitations of existing opt-out schemes. All opt-out schemes for generative AI training – either real or hypothetical – fail to achieve what they purport to achieve. As a rights holder, you cannot successfully use them to opt out your works, and the burden of work required to complete even an ineffective opt-out is enormous, which detrimentally affects all rights holders but which is particularly punitive to individual creators.
We need to move past the facade of opt-out representing a compromise with AI companies that rights holders ought to accept. Opt-out is the desired outcome for AI companies, not a compromise. It gives AI companies access to content against rights holders’ wishes, while exonerating them of what now amounts to years of exploitation of the life’s work of millions of creators around the world; and it gives rights holders only the palest illusion of control, upending copyright law to all but ensure their works can be used to build highly scalable competitors that will supplant them in the market. Opt-in consent is the only reasonable path forward for generative AI training.Music Business Worldwide