Meta is currently under legal fire for allegedly using millions of pirated books to train its AI systems. The lawsuit, Richard Kadrey et al. v. Meta Platforms, features an impressive roster of authors, including literary heavyweights Andrew Sean Greer and Ta-Nehisi Coates, challenging the company’s controversial data practices. At the heart of the legal battle is a troubling contradiction: Meta is defending its actions by devaluing the very content it relied on to build its multibillion-dollar AI empire.
Meta’s central defense hinges on the claim that its use of over seven million books sourced from the pirated online library LibGen falls under “fair use.” According to the company, this large-scale scraping of copyrighted works isn’t illegal because the individual contribution of each book to the overall performance of its language models is allegedly negligible.

In support of this, Meta brought forward an expert who claimed that a single book altered the AI model’s performance “by less than 0.06 percent on industry standard benchmarks”—a statistical ripple so small, it was deemed “no different from noise.” On this basis, Meta argued that there’s no economic value in the individual works, and thus no market for compensating authors: “For there to be a market, there must be something of value to exchange,” the company reasoned, “but none of [the authors’] works has economic value, individually, as training data.”
But this argument raises eyebrows. How can books be simultaneously too insignificant to warrant compensation, yet critical enough to train AI systems that power some of the most profitable tech products in the world?
Vanity Fair’s coverage sheds more light on this doublethink, highlighting internal Meta communications that show not only an awareness of the legal gray area but a concerted effort to keep their practices quiet. Employees are said to have deliberately removed copyright pages from the scraped books and shared informal justifications like, “I didn’t ask questions, but this is what OpenAI does with GPT-3, what Google does with PALM, and what DeepMind does with Chinchilla, so we will do it too.”

This reveals a broader pattern of tacit agreement among top AI firms—an unspoken industry norm where pirated or copyrighted data is harvested en masse, while legal and ethical concerns are swept under the rug. A confidential Meta presentation stated bluntly: “In no case would we disclose publicly that we had trained on LibGen.” It even warned that public exposure could “undermine our negotiating position with regulators.”
Meta’s argument is part of a larger industry narrative that wavers between dismissing the creative value of human-generated content and portraying it as indispensable for AI progress. OpenAI once argued to the UK Parliament that there simply isn’t enough public domain material available to sufficiently train their models, and therefore, they must mine copyrighted content, without paying for it.