Anthropic is being sued for copying books to train Claude

Link post

OpenAI faces 10 copyright lawsuits and Anthropic is starting to get sued as well. Whether or not you agree with copyright, this is worth looking into. Lawsuits could hinder AI companies from scaling further.

The recent filing against Anthropic is notable because the plaintiffs have evidence of Anthropic copying their works. Because they were able to identify their data being trained on, the case against Anthropic is much stronger.

Here is the core argument from the filing:

In a December 2021 research paper on large language model training, Anthropic described creating a dataset “most of which we sourced from the Pile” and which included “32% internet books,” a code word in the industry for pirated copies of books available on the internet.
More recently, in July 2024, Anthropic has publicly acknowledged that it used The Pile to train its Claude models. As reported by Proof News, company spokesperson Jennifer Martinez “confirm[ed] use of the Pile in Anthropic’s generative AI assistant Claude.” Anthropic confirmed the same to Vox News. Independent researchers have tested Claude to shed light on the composition of its training set, and their work has confirmed a high likelihood that Claude was trained on copyrighted books.
Anthropic thus copied and exploited a trove of copyrighted books—including but not limited to the books contained in Books3—knowing that it was violating copyright laws. Instead of sourcing training material from pirated troves of copyrighted books from this modern-day Napster, Anthropic could have sought and obtained a license to make copies of them. It instead made the deliberate decision to cut corners and rely on stolen materials to train their models.
…
Anthropic, in taking authors’ works without compensation, has deprived authors of books sales and licensing revenues. There has long been an established market for the sale of books and e-books, yet Anthropic ignored it and chose to scrape a massive corpus of copyrighted books from the internet, without even paying for an initial copy.
Anthropic has also usurped a licensing market for copyright owners. In the last two years, a thriving licensing market for copyrighted training data has developed. A number of AI companies, including OpenAI, Google, and Meta, have paid hundreds of millions of dollars to obtain licenses to reproduce copyrighted material for LLM training. These include deals with Axel Springer, News Corporation, the Associated Press, and others. Furthermore, absent Anthropic’s largescale copyright infringement, blanket licensing practices would be possible through clearinghouses, like the Copyright Clearance Center, which recently launched a collective licensing mechanism that is available on the market today.
Anthropic, however, has chosen to use Plaintiffs works and the works owned by the Class free of charge, and in doing so has harmed the market for the copyrighted works by depriving them of book sales and licensing revenue.

If you want to enable more lawsuits against large AI companies for data laundering, do advocate for transparency. You can make the case for a MILD standard like my collaborators and I are doing in Europe.