A Peak Behind the Curtain! OpenAI Allows Inspection of Training Data by Copyright Case Plaintiffs.
OpenAI's legal cases explained, as well as the implications on the wider world of AI in the wake of real-time formation of legal precedent.
Hello and welcome to Byte Sized, where we do short and sweet breakdowns of industry trends and news, every week. Let’s get into it.
OpenAI’s Lawsuits Explained, and a Peak Behind the Curtain
OpenAI, a key player in the modern AI landscape, has found itself at the center of an impressive legal portfolio. That is to say, lawsuits. The nature of these lawsuits varies from case to case, however, for today, we’re going to take a look at a specific subset. That is, OpenAI’s turbulent relationship with copyright. For reference, the relevant cases that OpenAI is currently involved in, in no particular order, are as follows:
New York Times v. OpenAI: The New York Times alleges that OpenAI and Microsoft used New York Times articles in the training of GPT Large Language Models.
Alden Global Capital et al. v. OpenAI LP et al.: the Chicago Tribune, New York Daily News, Denver Post, Orlando Sentinel, Sun Sentinel, San Jose Mercury News, Orange County Register, and St. Paul Pioneer Press allege that OpenAI and Microsoft used their articles in the training of proprietary models.
Authors Guild et al. v. OpenAI, Inc. et al.: prominent authors, including George R.R. Martin, John Grisham, and Jodi Picoult, allege that OpenAI used their work to train proprietary models.
Each of these noteworthy cases carries its own nuances and complexities. For example, the New York Times case against OpenAI hinges on a “theory of romantic authorship”, to quote Harvard Law Review. OpenAI’s models are trained on massive corpora of data, through which the model learns statistical relationships between terms to predict tokens. The Times’ case hinges on whether OpenAI used New York Times articles in their training data.
However, according to a Harvard Law Review essay on the subject, the New York Times finds itself in a familiar case, on the other side of the courtroom. To quote Audrey Pope’s NYT v. OpenAI: The Times’s About-Face:
“The Times has been here before — not as plaintiff but as defendant. Less than three decades ago, in New York Times Co. v. Tasini, the publisher fought against a group of its own freelance authors… The Times (…) argued— though the Tasini Court disagreed — that technological development would be irrevocably thwarted by a win for the freelancers. But today, the Times is entirely unsympathetic to the analogous threat its suit poses to AI companies.”
So, the New York Times may not be entirely faultless in this case, however, it is an important one nonetheless.
As for the Alden Global suit, the premise is roughly the same. As put by Frank Pine, executive editor for the MediaNews Group and Tribune Publishing:
“We’ve spent billions of dollars gathering information and reporting news at our publications, and we can’t allow OpenAI and Microsoft to expand the Big Tech playbook of stealing our work to build their own businesses at our expense.”
Of course, this allegation is only just that, and while these cases play themselves out, OpenAI remains free from liability for damages. However, it is an elusive question. If copyrighted content finds its way into a models training, is it the same as copyright infringement? Perhaps if the model plainly generates an article from the New York Times as a completion, the case is cut and dry. However, if bits and pieces of an article such as this one find their way into a model response… does that count?
This question and others find themselves at the epicenter of a burgeoning philosophical dispute. We’ve talked about this dispute before here on Byte Sized, as OpenAI isn’t the only company facing legal problems. However, it is still an open question, and these kinds of legal disputes are often how these questions are resolved. However, they do hinge on whether copyrighted intellectual property was actually used during training.
Recently, an OpenAI staffer named Suchir Balaji blew the whistle on potentially malicious practices by OpenAI. Balaji chose to leave OpenAI, citing ethical concerns, claiming that OpenAI’s training practices did include scraping of copyrighted data. According to Balaji, in 2020, OpenAI used its status as a research firm to justify collecting and organizing such data, given the rather lax rules regarding such research. However, as the company pivoted towards a for-profit model, such practices became increasingly shady.
These claims, however, remain claims. That might be changing soon. The last case I mentioned, Authors Guild et al. v. OpenAI, Inc. et al., has taken an interesting turn. For the first time, OpenAI will allow the plaintiffs to inspect their training data. This is a pivotal move, as the OpenAI model-training-set has become one of the industry’s most closely guarded secrets, as companies grow further from research into capitalization. However, the move would assumedly allow a definitive answer to the plaintiffs’ claims.
While we have no information yet, and it remains to be seen what information will be publically available, this brings us one step closer to establishing precedent regarding AI and copyright. While it may not seem consequential for the average consumer, these questions do affect end-users as well. Indirectly, trouble for OpenAI could mean trouble for OpenAI customers. More directly, if your use of a model could lead to you infringing on copyright, then you could find yourself in the middle of your own lawsuit.
The legal landscape has simply yet to be explored. I will, however, continue to cover these stories as developments continue.
That’s all, folks
That’s all for today! If you haven’t already, check out last week’s post on Firefox’s major exploit, and the FTC’s new “click to cancel” rule. Today’s story is an interesting one to me because I do think that it poses some crucial philosophical questions. Is model training more alike to inspiration than copying? Is it the same for a human to see a painting and attempt to replicate it? I’d love to hear your thoughts below.
With that said, I’m going to call it here. As always, thank you so much for reading, and I’ll see you next time. Goodbye!
Credits
Thumbnail
Bilal Azhar at Intelligence Imaginarium
Music
Track - Sunset by Lukrembo, Source - https://freetouse.com/music, Free Music No Copyright (Safe)


