Site icon Wonderful Engineering

Elon Musk Agrees That We’ve Exhausted AI Training Data

The growing consensus among AI experts that the data available for training AI models is now at its limit has been joined by Elon Musk. Musk said in a live-streamed conversation with Stagwell chairman Mark Penn on X that we’ve now ‘basically exhausted the cumulative sum of human knowledge … in AI training.’ “Last year that happened.”

The phenomenon was mirrored by insights from former OpenAI chief scientist Ilya Sutskever, who described it as ‘peak data’ at the NeurIPS conference in December and was echoed by Musk, founder of the AI company xAI. As the lack of new, untapped real world data forces a paradigm shift in AI model development, the industry is turning to synthetic data — data created by AI itself. Musk explained that the only way to supplement [real world data] is with synthetic data, where the AI creates [training data]. He also pointed out that AI could learn to grade its synthetic outputs.

Already, several tech giants like Microsoft, Meta, OpenAI, and Anthropic are using synthetic data to train advanced AI models. For example, Microsoft’s Phi-4 and Google’s Gemma models are trained on synthetic data. AI-generated data also helped Anthropic’s Claude 3.5 Sonnet and Meta’s Llama series.

The advantages of synthetic data include cost efficiency. Writer’s AI startup, Writer, built its Palmyra X 004 model almost entirely on synthetic data for $700,000, compared to $4.6 million for a similarly scaled OpenAI model. The approach, however, has its drawbacks. Synthetic data can cause ‘model collapse,’ where AI becomes less creative and more biased over time, research suggests. Synthetic data that is flawed can lead models trained on it to perpetuate and amplify these issues, rendering the models unreliable and ineffective. In the years to come, synthetic data will be a key challenge to navigate as AI evolves, to ensure balanced, innovative, and trustworthy systems.

Exit mobile version