OpenAI Is Being Sued For Using Stolen Data To Train Their AI Models

Taimur

3 years ago

The Clarkson Law Firm filed a complaint in court, saying that ChatGPT and Dall-E are using people’s private information without their permission. They said that ChatGPT and Dall-E allege ChatGPT and Dall-E “use stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge.”

OpenAI collected a lot of words from the internet, including personal information from sites like Twitter and Reddit, to train their language model.

The law firm claims that OpenAI did this secretly and without following the required rules saying they “did so in secret, and without registering as a data broker as it was required to do under applicable law.” OpenAI has faced criticism for how it collects data for ChatGPT and for not giving users a clear way to say no to using their conversations and personal information.

Italy even banned ChatGPT because it didn’t do enough to protect user data, especially from minors. This lawsuit focuses on OpenAI’s privacy policies for current users and on data that was collected from the web without people knowing it would be used in ChatGPT.

OpenAI has made money from this data through investments and subscriptions, but they haven’t paid the people whose data they used.

The complaint has 15 charges, including violating people’s privacy, not taking care to protect personal data, and stealing a lot of personal data to train their models. Some datasets, like Common Crawl, Wikipedia, and Reddit, have personal information that is available to the public, but companies need to follow the right rules to buy and use that data.

OpenAI allegedly used this data without getting permission from the users, specifically for ChatGPT. Even though people’s personal information is public on social media, blogs, and articles, using that data outside of its intended platform can be seen as a violation of privacy.

In Europe, there are laws that make a difference between data that is in the public domain and data that can be used freely, thanks to the GDPR law. But in the US, this is still being debated.

Nader Henein, a privacy research VP at Gartner who thinks the sentiment of the lawsuit is valid, said, “People should have control as to how their data is used, even when it is available in the public domain.” But Henein is unsure if the US legal system would agree.

Ryan Clarkson, the managing partner of Clarkson Law Firm, said in a blog post that it’s important to take action now with existing laws instead of waiting for the government to make new regulations.

“We cannot afford to pay the cost of negative outcomes with AI like we’ve done with social media, or like we did with nuclear. As a society, the price we would all pay is far too steep.”

Related Articles