ChatGPT Is About To Face Some Copyright Issues After ‘Memorizing’ These Books

Jannat Un Nisa

3 years ago

Researchers at the University of California, Berkeley, have discovered that OpenAI’s ChatGPT and GPT-4 language models have been trained on copyrighted books.

The study published in a paper titled “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” found that the models have memorized a wide range of copyrighted materials, with the level of memorization linked to the frequency of book passages appearing online. The list of books that the models have memorized includes popular titles like The Lord of the Rings, Harry Potter, and The Hunger Games.

The researchers note that science fiction and fantasy titles dominate the list, which they attribute to the popularity of these titles on the web. However, this has resulted in the models exhibiting less knowledge of works in other genres.

The study also highlights the importance of responsible data curation in machine learning, with the authors advocating for the use of public training data to increase model transparency.

The researchers conducted a “name cloze” test to predict a single name in a passage of 40-60 tokens with no other named entities. Passing the test indicates that the model has memorized the associated text. However, the authors explain that “the question of whether [the books] truly exist within the training data of these models is not answerable.”

As the study’s co-author David Bamman explains, “Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors.”

According to Tyler Ochoa, a law professor at Santa Clara University, there are three primary copyright issues related to AI-generated text. The first issue concerns whether copying a significant amount of text or images for model training falls under fair use. Ochoa believes that it is likely to be considered fair use.

The second issue is whether the AI-generated output is too similar to the input data, which the paper refers to as “memorization.” Ochoa asserts that such similarity could constitute copyright infringement. Finally, the third issue pertains to whether the AI-generated text that is not a copy of an existing work is protected by copyright.

In the United States, Ochoa notes that human creativity is currently required for copyright protection. However, activities such as modifying and arranging AI model output can make copyright protection more plausible.

Ochoa believes that lawsuits against companies such as OpenAI and Google, which generate large language models that generate text, are likely to arise in the future.

“So far, we’ve seen lawsuits over issues one and three,” said Ochoa. “Issue one lawsuit so far has involved AI image-generating models, but lawsuits against AI text-generating models are inevitable. We have not yet seen any lawsuits involving issue two. The paper demonstrates that such similarity is possible, and in my opinion, when that occurs, there will be lawsuits, and it will almost certainly constitute copyright infringement.”

As AI-generated text becomes more prevalent, copyright issues in AI-generated content will become increasingly important. The research emphasizes the need for transparency in model behavior and the careful documentation of data to advance responsible data curation in machine learning.

Related Articles