AI companies are facing a looming challenge: they’re running out of training data. As they strive to develop larger and more advanced models, the internet’s vast resources are becoming insufficient to meet their needs.
The Wall Street Journal has highlighted this issue, noting that companies are now exploring alternative data sources such as publicly available video transcripts and even AI-generated “synthetic data.” Dataology, led by ex-Meta and Google DeepMind researcher Ari Morcos, is among those working on ways to train models more efficiently with less data and resources. However, many major companies are turning to unconventional and controversial methods of data training.
For example, OpenAI is reportedly considering training its upcoming GPT-5 model on transcriptions from public YouTube videos. This move comes as the company’s chief technology officer, Mira Murati, faces questions about whether its Sora video generator was trained using YouTube data. Synthetic data has also garnered attention, with concerns raised about potential “model collapse” or “Habsburg AI” resulting from training on AI-generated data.
In response to these concerns, companies like OpenAI and Anthropic, a spinoff from OpenAI focused on ethical AI, are developing supposedly higher-quality synthetic data. Anthropic’s Claude 3 LLM, for instance, was trained on “data we generate internally,” according to the company’s announcement. Despite the potential of synthetic data, its specifics remain closely guarded.
While some researchers predict that AI could run out of usable training data within the next few years, others like Pablo Villalobos of Epoch remain optimistic. Villalobos suggests that breakthroughs in AI could address this challenge, emphasizing that there’s no need for panic.
However, an alternative solution exists: AI companies could reconsider their pursuit of ever-larger models. This approach would not only mitigate the training data shortage but also reduce the environmental impact of AI, which consumes significant electricity and relies on rare-earth minerals for computing chips.