Data has emerged as a vital resource since AI transforms the digital landscape with advanced models and algorithms. Large language models like OpenAI’s ChatGPT rely heavily on vast data volumes, intensifying demand in the data analytics industry, which was valued at $41.05 billion in 2022 and is expected to surge to $279.21 billion by 2030, with a 27.3% annual growth rate. Yet, ethical concerns around data collection are gaining attention, particularly from former OpenAI researcher Suchir Balaji, who warns that OpenAI’s practices may be “destroying” the internet and infringing on copyright laws.
Suchir left OpenAI in August 2023 due to ethical concerns. According to his LinkedIn profile, Balaji’s work spanned several key areas, including post-training for ChatGPT, reasoning algorithms, and reinforcement learning. As OpenAI’s data-driven generative AI (GenAI) models evolved, Balaji grew increasingly troubled by the broader impacts of these technologies on digital ecosystems and user-generated content. He noted in interviews with The New York Times, “If you believe what I believe, you have to just leave the company.”
Balaji went on to publish an article on his website explaining his views. He argued that GenAI, particularly OpenAI’s models, erodes the foundational structure of the internet, especially within open-source communities, where participation is dwindling as users turn to AI for answers instead of human interactions. As a published AI expert with over 8,000 citations, Balaji is well-versed in the technical intricacies of AI. His post, titled When Does Generative AI Qualify for Fair Use?, asserts that GenAI tools fall short of fair use standards, as they reformat and alter existing content without truly transforming it.
Balaji further argues that GenAI poses a risk to content creators by substituting human work and reducing the availability of original data for future AI training. This issue, he explains, leads to errors in these large language models (LLMs), where AI systems can fabricate “hallucinations” and output nonsensical information. Balaji asserts that no fair use justification can adequately defend models like ChatGPT, especially given the financial impacts on creators.
“This is not a sustainable model for the internet ecosystem as a whole,” he stated in his New York Times interview.
In response, OpenAI firmly rejected Balaji’s criticisms, saying, “We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents.” The company emphasized that its approach to data use is fair, fostering innovation and competitiveness in the U.S. market.
Despite OpenAI’s stance, the company faces several legal disputes, including a significant lawsuit from The New York Times, which accuses OpenAI and its partner Microsoft of exploiting the newspaper’s content without compensation. The complaint alleges that these companies “seek to free-ride on The Times’s massive investment in its journalism” by creating AI-driven products that draw audiences away from the original publisher.
Additionally, other news outlets, along with YouTube creators, actors, authors, and the Center for Investigative Reporting, have filed similar copyright infringement suits against OpenAI.
Intellectual property lawyer Bradley J. Hulbert expressed that existing intellectual property laws lag behind AI developments, noting, “Given that AI is evolving so quickly, it is time for Congress to step in.”