AI companies typically build their models using publicly available content, ranging from YouTube videos to newspaper articles. However, many of these content hosts are now imposing restrictions, potentially leading to a crisis that could make AI models less effective, according to a new study by the Massachusetts Institute of Technology’s Data Provenance Initiative.
The researchers conducted an audit of 14,000 websites scraped by prominent AI training datasets. The study revealed that approximately 28 percent “of the most actively maintained, critical sources” on the internet are now “fully restricted from use.” Website administrators are increasingly adding stringent limitations on how web crawler bots can scrape their content.
“If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” the researchers write. It’s understandable that content hosts would put restrictions on their now-valuable data caches. AI companies have taken publicly available material, much of it copyrighted, and used it to generate revenue without permission, upsetting many stakeholders, from The New York Times to celebrities like Sarah Silverman.
What’s particularly troubling is that figures like OpenAI CTO Mira Murati suggest some creative jobs should disappear, despite these jobs producing the content powering models like OpenAI’s ChatGPT. This perceived arrogance, and the resulting backlash, has created a “consent in crisis,” as the study researchers term it. The once open internet is becoming increasingly restricted, making AI models more biased, less diverse, and less fresh.
In response, some companies are attempting to use synthetic data, generated by AI, as a workaround. However, this has proven to be a poor substitute for original human-produced content. Others, like OpenAI, have struck deals with media companies, but these agreements have raised concerns as the goals of tech companies and media outfits often conflict.
As the landscape evolves, stockpiles of training data are becoming more valuable—and scarce—than ever. Time will tell how these changes will affect the development and effectiveness of AI models. One thing is certain: the era of freely available training data may be coming to an end.