AI trained on AI-generated content is predicted to produce junk in the near future, according to machine learning experts. A recent paper released by British and Canadian scientists delves into the consequences of training successive generations of AI models on each other. The researchers coined the term “model collapse” to describe the phenomenon where text and images lose intelligibility over time.
The exponential growth of AI-generated content on the internet poses a significant challenge. Large Language Models (LLMs) like ChatGPT and OpenAI have been predominantly trained on human-generated data from the web. However, as AI-generated content becomes more prevalent, future LLMs will learn from this material, leading to an escalating cycle of errors and nonsensical output.
As generations of AI models are trained on flawed predecessors, the distinction between fact and fiction becomes blurred. The researchers explain that AIs can start misinterpreting what they perceive as reality, reinforcing their own beliefs in the process. The analogy of Mozart and Salieri is used to illustrate this problem. If an AI model trained on Mozart’s music produces a diluted version, subsequent generations will further degrade the quality until the output becomes unrecognizable.
Dr Ilia Shumailov, the lead author of the paper, highlights that the issue lies in the AI’s perception of probability after training on earlier AI models. Improbable events become less likely to be reflected in the AI’s output, leading to a narrowing of possibilities for the next generation of AIs trained on that output.
The paper provides an example where a human-generated text about medieval architecture was passed through an AI language model multiple times. By the ninth generation, the text had deteriorated into nonsensical babbling about jackrabbits. This decline in quality is compared to environmental pollution, with the internet being filled with meaningless content.
AI-generated content is already prevalent online, with automated news sites and chatbot-driven marketing and PR agencies on the rise. However, human writers should not rest easy based on these findings. While human-generated data is valuable due to its natural variation and improbable results, it is not an absolute requirement for training AIs.
In conclusion, as AI-generated content proliferates and future AI models are trained on flawed predecessors, the risk of producing junk content increases. The challenges posed by model collapse highlight the need for careful consideration and ongoing research to ensure the integrity and reliability of AI-generated content in the future.