AI In The News

Uncle Pete · December 27, 2023, 19:10

https://www.msn.com/en-us/money/companies/ny-times-sues-openai-microsoft-for-infringing-copyrighted-works/ar-AA1m75sX?ocid=00000000&pc=U528&cvid=36cdc2530ee347549955a4670eb08328&ei=17

NEW YORK (Reuters) - The New York Times sued OpenAI and Microsoft on Wednesday, accusing them of using millions of the newspaper's articles without permission to help train chatbots to provide information to readers.

The newspaper's complaint, filed in Manhattan federal court, accused OpenAI and Microsoft of trying to "free-ride on The Times's massive investment in its journalism" by using it to provide alternative means to deliver information to readers.

"There is nothing 'transformative' about using The Times's content without payment to create products that substitute for The Times and steal audiences away from it," the Times said.

The case is New York Times Co v Microsoft Corp et al, U.S. District Court, Southern District of New York, No. 23-11195.

cascoly · December 27, 2023, 21:04

you beat me to, it - i was about to, post this, too -- looks like a major case that may lead to a legal conclusion though I distrust the ability of knowledge-deficient judges to really understand the issues involved (much like the copyright office's decision

i've been reading Wolfram(of Mathematica fame) on how ChatGPT works (available a on Kindle Unlimited for free)

https://www.amazon.com/What-ChatGPT-Doing-Does-Work-ebook/dp/B0BY59PT5Z

it gets complicated quickly and he explains why there is no connection between training and generation since the process from source to dataset is not commutative.

like MJ et al., the data used for training is massive (even NYTimes huge content is dwarfed by several orders of magnitude). it should come down whether the scraping amounts to fair use.

however, there are some significant differences from AI-gen images as the Timmes alleges wholesale reproduction of significant quantities of text by chatGPT - something no one ah as been able to show re AI-image generation.

danielstassen · December 27, 2023, 23:15

Very interesting topic thank you for sharing.

Uncle Pete · December 28, 2023, 16:50

Quote from: cascoly on December 27, 2023, 21:04
you beat me to, it - i was about to, post this, too -- looks like a major case that may lead to a legal conclusion though I distrust the ability of knowledge-deficient judges to really understand the issues involved (much like the copyright office's decision

i've been reading Wolfram(of Mathematica fame) on how ChatGPT works (available a on Kindle Unlimited for free)

https://www.amazon.com/What-ChatGPT-Doing-Does-Work-ebook/dp/B0BY59PT5Z

it gets complicated quickly and he explains why there is no connection between training and generation since the process from source to dataset is not commutative.

like MJ et al., the data used for training is massive (even NYTimes huge content is dwarfed by several orders of magnitude). it should come down whether the scraping amounts to fair use.

however, there are some significant differences from AI-gen images as the Timmes alleges wholesale reproduction of significant quantities of text by chatGPT - something no one ah as been able to show re AI-image generation.

Yes and I agree, there is no connection between training and generation since the process from source to dataset is not commutative but still these cases have to be decided. When reading more of the background, the fair use has already been decided in the past. NYT is trying to make the same claim that has already been defeated. A second bite at the same arguments.

I understand their point, that copying and using, is not transformative, if the original data is then repeated. NYT says, bits of information are traceable directly to their articles and publications.

In past cases, like the one from photographers in CA, the claimants have not been able to show direct copying and use in the output. They need to prove a connection from the training data, directly to the output results. That hasn't happened yet.

cascoly · December 29, 2023, 20:57

some further commentary on the case - turns out the copied text was not a random article but cherry-picked:

AI #44: Copyright Confrontation
Zvi Mowshowitz newsletter

The New York Times has thrown down the gauntlet, suing OpenAI and Microsoft for copyright infringement. Others are complaining about recreated images in the otherwise deeply awesome MidJourney v6.0. As is usually the case, the critics misunderstand the technology involved, complain about infringements that inflict no substantial damages, engineer many of the complaints being made and make cringeworthy accusations.

That does not, however, mean that The New York Times case is baseless. There are still very real copyright issues at the heart of Generative AI. This suit is a serious effort by top lawyers. It has strong legal merit. They are likely to win if the case is not settled.
...
In a handful of famous cases, there seems to be an exception. Exactly as in the MidJourney examples, why are we seeing NYT article text almost exactly (but not quite) copied anyway in some cases? Because it is iconic.
Kevin Bryan: NYT/OpenAI lawsuit completely misunderstands how LLMs work, and judges getting this wrong will do huge damage to AI. Basic point: LLMs DON'T "STORE" UNDERLYING TRAINING TEXT. It is impossible- the parameter size of GPT-3.5 or 4 is not enough to losslessly encode the training set.

Ok, now let's see NYT examples. Here GPT spits out almost perfectly the opening paragraphs of a "snow fall" article from 2012. But this text is all over the internet - super famous article! That's why GPT's posterior predictions given the previous article paragraph are so good.

Likewise, in the famous Guy Fieri Times Square review, GPT repeats almost perfectly whole paragraphs. But these paragraphs have also been repeated dozens of times across the internet! That's why the LLM posterior probability next word distribution picks them up.

In practice, one can think of this as ChatGPT committing copyright infringement if and only if everyone else is committing copyright infringement on that exact same passage, making it so often duplicated that it learned this is something people reproduce.

My take? OpenAI can't really defend this practice without some heavy changes to the instructions and a whole lot of litigating about how the tech works. It will be smarter to settle than fight
....

>>>

bold text my emphasis -- all caps in original

... much more detail in the newsletter:

For free subscription: Don't Worry About the Vase | Substack

https://thezvi.substack.com/?utm_source=substack&utm_medium=email

A world made of gears. Doing both speed premium short term updates and long term world model building. Currently focused on weekly AI updates. Explorations include AI, policy, rationality, medicine and fertility, education and games.

By Zvi Mowshowitz

PigsInSpace · December 30, 2023, 17:46

"We know you copied us — you used the word milquetoast in generated articles. No human has ever used that word other than NY Times reporters!"

Uncle Pete · December 31, 2023, 19:11

Is it transformative?

Wikipedia:

The transformative nature of computer based analytical processes such as text mining, web mining and data mining has led many to form the view that such uses would be protected under fair use. This view was substantiated by the rulings of Judge Denny Chin in Authors Guild, Inc. v. Google, Inc., a case involving mass digitisation of millions of books from research library collections. As part of the ruling that found the book digitisation project was fair use, the judge stated "Google Books is also transformative in the sense that it has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas"

Text and data mining was subject to further review in Authors Guild v. HathiTrust, a case derived from the same digitization project mentioned above. Judge Harold Baer, in finding that the defendant's uses were transformative, stated that 'the search capabilities of the [HathiTrust Digital Library] have already given rise to new methods of academic inquiry such as text mining."

I'm pointing this out as New York Times is trying to make the same claim as the two above, that already have decisions, in favor of fair use. Sometimes a case like this would be refused and not heard, as it has already been decided in the past.

Uncle Pete · January 31, 2024, 23:31

Under appeal but I think still interesting. willfully blind to infringement If we notify these sites, and they do nothing, they can be sued.

Vacating the district court's order granting in part and
denying in part Redbubble's motion for judgment as a matter
of law, the panel held that a party is liable for contributory
infringement when it continues to supply its product to one
whom it knows or has reason to know is engaging in
trademark infringement. A party meets this standard if it is
willfully blind to infringement. Agreeing with other circuits,
the panel held that contributory trademark liability requires
the defendant to have knowledge of specific infringers or
instances of infringement. General knowledge of
infringement on the defendant's platform, even of the
plaintiff's trademarks, is not enough to show willful
blindness. The panel remanded for reconsideration of
Redbubble's motion under the correct legal standard.

https://cdn.ca9.uscourts.gov/datastore/opinions/2023/07/24/21-56150.pdf

Uncle Pete · May 01, 2026, 14:57

News of a settlement because the company used scrapped materials, from pirate sites.

"In June 2025, Judge William Alsup of the U.S. District Court for the Northern District of California ruled on summary judgment that using books without permission to train AI was fair use if they were acquired legally, but he denied Anthropic's request for summary judgment related to piracy—finding that the piracy was not fair use."

That's the summary of the important detail. If the AI training was from legitimate sites, then it's fair use. (in this opinion) But if it's from a pirate site, then it's piracy. In order for these authors and publishers to make a claim and receive payment, the work must have been registered.

What that's leading to, is if there's ever a settlement for photos, from stolen materials that was used for AI training, the images must have been registered, in advance.

https://authorsguild.org/advocacy/artificial-intelligence/what-authors-need-to-know-about-the-anthropic-settlement/

https://www.wral.com/business/technology/anthropic-settlement-copyright-questions-data-provenance-sept-2025/

From the second article: "Provenance you can audit. Track where all of your data came from—licensed archives, public-domain repositories, creator uploads under clear terms, or lawfully purchased collections. Avoid gray-market mirrors and "misc. web" buckets you can't defend. Courts are paying attention to the acquisition method, not just use."

Andreus · May 02, 2026, 13:16

Quote from: Uncle Pete on May 01, 2026, 14:57
News of a settlement because the company used scrapped materials, from pirate sites.

"In June 2025, Judge William Alsup of the U.S. District Court for the Northern District of California ruled on summary judgment that using books without permission to train AI was fair use if they were acquired legally, but he denied Anthropic's request for summary judgment related to piracy—finding that the piracy was not fair use."

That's the summary of the important detail. If the AI training was from legitimate sites, then it's fair use. (in this opinion) But if it's from a pirate site, then it's piracy. In order for these authors and publishers to make a claim and receive payment, the work must have been registered.

What that's leading to, is if there's ever a settlement for photos, from stolen materials that was used for AI training, the images must have been registered, in advance.

https://authorsguild.org/advocacy/artificial-intelligence/what-authors-need-to-know-about-the-anthropic-settlement/

https://www.wral.com/business/technology/anthropic-settlement-copyright-questions-data-provenance-sept-2025/

From the second article: "Provenance you can audit. Track where all of your data came from—licensed archives, public-domain repositories, creator uploads under clear terms, or lawfully purchased collections. Avoid gray-market mirrors and "misc. web" buckets you can't defend. Courts are paying attention to the acquisition method, not just use."

They've been hammering us for decades about how bad the piracy of copyrighted material is.
Then a tech giant uses a pirate site to train its models. That alone says a lot about their (lack of) ethics.

Uncle Pete · May 02, 2026, 17:07

Quote from: Andreus on May 02, 2026, 13:16
Quote from: Uncle Pete on May 01, 2026, 14:57
News of a settlement because the company used scrapped materials, from pirate sites.

"In June 2025, Judge William Alsup of the U.S. District Court for the Northern District of California ruled on summary judgment that using books without permission to train AI was fair use if they were acquired legally, but he denied Anthropic's request for summary judgment related to piracy—finding that the piracy was not fair use."

That's the summary of the important detail. If the AI training was from legitimate sites, then it's fair use. (in this opinion) But if it's from a pirate site, then it's piracy. In order for these authors and publishers to make a claim and receive payment, the work must have been registered.

What that's leading to, is if there's ever a settlement for photos, from stolen materials that was used for AI training, the images must have been registered, in advance.

https://authorsguild.org/advocacy/artificial-intelligence/what-authors-need-to-know-about-the-anthropic-settlement/

https://www.wral.com/business/technology/anthropic-settlement-copyright-questions-data-provenance-sept-2025/

From the second article: "Provenance you can audit. Track where all of your data came from—licensed archives, public-domain repositories, creator uploads under clear terms, or lawfully purchased collections. Avoid gray-market mirrors and "misc. web" buckets you can't defend. Courts are paying attention to the acquisition method, not just use."

They've been hammering us for decades about how bad the piracy of copyrighted material is.
Then a tech giant uses a pirate site to train its models. That alone says a lot about their (lack of) ethics.

Whoever thought that copying books from a pirate site was a good idea, has a hole in their head. Poor thinking all the way. I suppose, in simple terms, rushing into this for an advantage and not due diligence? Or maybe just trying to get ahead of the competition and taking the loss later.

What's going to happen with photos? Well the detail I pointed out is, they would have to be registered before the infringement. That's going to eliminate many people who rely on the automatic copyright, not the registration process.

In addition, the courts are still leaning towards fair use for training, if collected from the proper sites. I see the whole settlement as good for us, because data/images will be collected and purchased more carefully, to avoid huge settlements like this. Keep in mind that Getty is suing others who used the Getty images for training AI. I don't think we'll see any of that money if they win, but at least it's forcing more ethical behavior and potentially more payments in the future.

Training is more organized now and coming from more specific data sets. The days of, get all you can, as fast as they can, are over.

MicrostockGroup Sponsors

AI In The News

danielstassen