Does OpenAI not know what data Sora is trained with?

31 March 20242 months ago babelfish

OpenAI's chief technology officer, Mira Murati, told the Wall Street Journal that she isn't sure what data is used to train Sora, the artificial intelligence application that creates astonishing videos from just a few lines of text. In Italy, at the beginning of the month, the Privacy Guarantor started an investigation. All the details

When directors and videomakers saw what Sora, OpenAI's artificial intelligence (AI) tool that creates "realistic" and "imaginative" videos of about a minute, thanks to a banal text that anyone can type, was able to do, they were left disconcerted and have – understandably – started to fear for their jobs.

But how can Sora be so “good”? All thanks to the data that instructs it, that is, the information in the form of text, images and online videos with which it is trained. Just as happens with the material with which ChatGpt and other AI chatbots are fed, which from a textual input elaborate (or re-elaborate?) what is written by others (remember the case of the New York Times which sued OpenAI for infringement of copyright ).

However, all this beautiful data that feeds the AI, if it belongs to someone and therefore is not public, must be paid for. The Wall Street Journal then asked Mira Murati, technical director (CTO) of OpenAI (as well as CEO for a couple of days when Sam Altman was removed ), where the ones used to train Sora come from but her answers were not among the clearest…

WHAT OPENAI (NOT) SAY ABOUT SORA

In an interview defined by some as " cringe " , i.e. which arouses embarrassment and at the same time discomfort in those who observe, Murati initially declared that the data underlying Sora are "data available to the public and licensed data".

But when the WSJ journalist asked her if these also included videos from YouTube, Facebook or Instagram, Murati – visibly embarrassed – said she was "not sure" and then rejected further questions aimed at delving deeper into the issue.

“I won't go into detail about the data used but it was publicly available or licensed data,” he reiterated. Regarding the image and video platform Shutterstock , with which OpenAI has an agreement , Murati confirmed only after the interview that that was also among the licensed data.

Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: I'm actually not sure about that…

(I really do encourage you to watch the full @WSJ interview where Murati did answer a lot of the biggest questions about Sora. Full interview, ironically, on YouTube:… pic.twitter.com/51O8Wyt53c

— Joanna Stern (@JoannaStern) March 14, 2024

LEGAL ACTIONS AGAINST OPENAI

Murati's reticence (or ignorance?) regarding data could be a way to avoid further copyright litigation. In fact, OpenAI is at the center of several legal actions over the training data of its AI models.

In late June 2023, a class action lawsuit was filed in California against the company for allegedly secretly collecting “massive amounts of personal data from the Internet” without asking for consent.

The following month, authors Sarah Silverman, Richard Kadrey, and Christopher Golden sued OpenAI on two counts of copyright infringement. Additionally, the lawsuits allege that “Meta's ChatGpt and LLaMA were trained on illegally acquired datasets containing their works.”

And last December the New York Times sued Microsoft and OpenAI in a similar copyright complaint, alleging that the companies used its articles to train the chatbot.

WHO WILL WIN?

The outcome of the legal proceedings, however, appears uncertain but, as the lawyer Laura Turini has highlighted several times in the Appunti di Stefano Feltri newsletter, it seems unlikely that OpenAI will lose because copyright is a slippery slope in the field of artificial intelligence.

Meanwhile, on March 8, the Privacy Guarantor started an investigation against the software house for the possible implications that Sora could have on the processing of personal data of users located in the European Union and in particular in Italy .

Within 20 days OpenAI will have to specify whether the AI video generator will be offered to EU users and clarify some issues: how the algorithm is trained; the data collected and processed to train it, especially if it concerns personal data; whether among these there are also particular categories of data (religious, philosophical beliefs, political opinions, genetic data, health, sexual life); and what sources are used.

This time perhaps it will be better for OpenAI to be able to respond more precisely…

This is a machine translation from Italian language of a post published on Start Magazine at the URL https://www.startmag.it/innovazione/da-openai-non-sanno-con-quali-dati-viene-addestrato-sora/ on Sun, 31 Mar 2024 05:43:38 +0000.