5 takeaways from Meta and Anthropic’s wins in US copyright lawsuits

Read Time:9 Minute, 21 Second

The future of generative AI could be shaped by the complex relationship between AI models and the data used to train them.

AI companies pour billions to acquire talent and purchase GPUs required to build large language models (LLMs). But data, the third critical resource used to train AI models, is often scraped from the internet without paying or asking for permission. Creators have argued that this is unfair and unsustainable as much of the training data is said to be protected by copyright.

How do copyright principles apply to AI? Have AI companies violated copyright law by not seeking permission to use creative work to train their AI models? If so, at what stage of AI model development did the infringement occur? Are there any legal exceptions?

Story continues below this ad

These are just some of the questions that have largely gone untested in courts until now. But two recent court rulings have started to flesh out some answers.

Last week, a US district court ruled that Anthropic did not violate copyright law by using books to train its Claude AI models. The AI startup’s use of a database comprising scanned, purchased books combined with a specific training method was deemed by US District Judge William Alsup as being transformative enough to meet the standards of fair use.

In the same week, Meta also scored a win in a major copyright case, where US District Court Judge Vince Chhabria ruled that the AI training involved in developing its Llama models held up under the fair use doctrine of US copyright law.

At the outset, the summary judgments might seem like landmark victories for the two AI companies. However, a closer examination reveals that the rulings in both cases are extremely narrow. They are not deterministic and make the legal dilemmas of copyright plain.

Story continues below this ad

The precedence they set is still unclear as the rulings keep the door open for creators, publishers, and other rights-holders to sue AI companies for copyright infringement, while also signalling to them which legal arguments are likely to succeed or fail in court. Both Meta and Anthropic are also still on trial for separate allegations that they used pirated digital copies of millions of books to train their AI models.

Here are a few key takeaways from the two AI copyright rulings.

First, what is fair use?

US copyright law rests on a few fundamental questions such as: Did you make a copy of the copyrighted material? Did you have permission to do so? If not, does an exception like fair use apply?

Note that fair use is an affirmative defence. This means that the defendant acknowledges that a copy was made but argues that it was legally justified. The judge then evaluates this claim on a case-by-case basis. Under section 107 of the US Copyright Act, courts can consider four factors when determining fair use:

Story continues below this ad

– The purpose and character of the use (whether it was used for non-profit educational or commercial purposes).
– The nature of the copyrighted work.
– The amount and substantiality of the portion of work used in relation to the copyrighted work.
– The effect of the use of the market value of the copyrighted work

Takeaway 1: The methods used to train AI models

AI companies often use web crawlers and scrapers to find, download, and train their models on as much content as they can gather. Many of them are secretive about their training datasets as they are wary of such disclosures exposing them to copyright lawsuits.

In 2021, Anthropic co-founder Ben Mann downloaded a database of over 1,96,640 books called Books3 and used it for AI training even though he knew they were pirated copies, as per the ruling. He also downloaded 50 lakh pirated books from LibGen and 20 lakh pirated books from Pirate Library Mirror. These databases of pirated e-books also included ones authored by Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who were the plaintiffs in the infringement case.

Three years later, Anthropic’s perspective on using pirated books for AI training changed and the startup chose to bulk purchase books from distributors and retailers for its research library. The physical copies of these purchased books were stripped of their binding, the pages were torn, and scanned into digital form. The court noted that Anthropic retained pirated copies of work even after it decided that it would not use a specific work for training purposes.

Story continues below this ad

Meanwhile, Meta also torrented books to train its Llama models with CEO Mark Zuckerberg allegedly giving the green light himself, as per court filings. The social media giant downloaded Anna’s Archive, a compilation of ‘shadow libraries’
including LibGen, Z-Library, and others, and torrented more than 80.6 terabytes of data from LibGen.

Takeaway 2: Legally acquiring data for AI training matters

On Anthropic’s purchase and scanning of physical books into digital copies for its library, the court found this use as transformative enough to be protected by fair use.

“Anthropic purchased its print copies fair and square. With each purchase came entitlement for Anthropic to “dispose[ ]” each copy as it saw fit. So, Anthropic was entitled to keep the copies in its central library for all the ordinary uses,” Judge Alsup said.

“Here, every purchased print copy was copied in order to save storage space and to enable searchability as a digital copy. The print original was destroyed. One replaced the other. And, there is no evidence that the new, digital copy was shown, shared, or sold outside the company,” he added.

Story continues below this ad

“Yes, Authors also might have wished to charge Anthropic more for digital than for print copies. And, this order takes for granted that the Authors could have succeeded if Anthropic had been barred from the format change,” the court held.

Takeaway 3: AI training is the same as human learning

AI companies have argued that training on copyrighted work is fair game, comparing it to how humans learn from the same material. But creators counter that the scale of AI training is vastly different. However, in the Anthropic case, the authors conceded that “using works to train Claude’s underlying LLMs was like using works to train any person to read and write.”

In regards to the authors’ argument that LLM training is intended to memorise the creative elements of their work, Judge Alsup said: “Yes, Claude has outputted grammar, composition, and style that the underlying LLM distilled from thousands of works. But if someone were to read all the modern-day classics because of their exceptional expression, memorise them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not,” the court said.

Takeaway 4: AI training does not pose significant market harm

A key argument by creators and rights-holders is that AI models compete with their training data. This means that an AI music generator competes with the musicians whose creative works were used to train the model.

Story continues below this ad

In the Anthropic case, the authors argued that AI training will result in an explosion of books that compete with their works. However, Judge Alsup ruled that the copies used to train the LLMs do not displace the demand for the copies of the authors’ work in a way that counts under the Copyright Act.

“[The] Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works,” he said. When the authors argued that Anthropic’s training activity had a negative impact on the market for licencing content to AI companies, the court said that such a market is “not one the Copyright Act entitles Authors to exploit.”

In the Meta case, Judge Chhabria held that the authors had failed to present a compelling argument that the company’s use of books to train Llama caused “market harm.”

“On this record Meta has defeated the plaintiffs’ half-hearted argument that its copying causes or threatens significant market harm. That conclusion may be in significant tension with reality,” he said.

Story continues below this ad

But he also noted several flaws in Meta’s defence. “Meta seems to imply that such a ruling would stop the development of LLMs and other generative AI technologies in its tracks. This is nonsense,” Judge Chhabria wrote.

Takeaway 5: Using pirated material for AI training could spell trouble

While the court ruled that Anthropic’s use of purchased books as transformative, it did not agree that creating a central library for LLM training was transformative as well.

This is because Anthropic’s central library of copyrighted material also comprised 70 lakh pirated e-books. “Pirating copies to build a research library without paying for it, and to retain copies should they prove useful for one thing or another, was its own use — and not a transformative one,” Judge Alsup wrote.

“This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” the court noted.

Story continues below this ad

It said that a trial will be held later on Anthropic’s use of pirated copies for AI training. Additionally, Judge Chhabria said that they would meet on July 11 to “discuss how to proceed on the plaintiffs’ separate claim that Meta unlawfully distributed their protected works during the torrenting process.”

What next?

The copyright cases against Meta and Anthropic focused on infringement at the training stage of AI development as opposed to the inference level, where the models’ outputs would have to be evaluated.

In the Anthropic case, the court stopped short of evaluating model outputs as the authors also did not argue that any infringing content ever reached the user and focused, instead, at the input end. “If the outputs that the users saw had infringing content, the case would have been different,” Judge Alsup noted.

In the Meta case, the authors argued that its use of copyrighted content was not covered under fair use as its Llama models would output material that “mimics” their work if prompted to do so. But the court found that even using “adversarial” prompts could not get Llama to produce more than 50 words of any of the authors’ books.

This could potentially set a legal precedent for copyright lawsuits against OpenAI brought by New York Times in the US and ANI news agency in India.

Source link