Copyright Wars in the AI Age - EU Digital Partners

Introduction

In recent legal developments, The New York Times has taken decisive action against OpenAI and Microsoft, filing a complaint in the Southern District of New York on December 27, 2023. The crux of the matter revolves around the alleged utilization of the Times’s copyrighted works in the development of generative artificial intelligence (AI) products, specifically Microsoft’s Copilot (formerly Bing Chat) and OpenAI’s ChatGPT. This groundbreaking case, encapsulated in The New York Times Co. v. Microsoft Corp. et al., Case No. 1:23-cv-11195 prompts a pivotal question:

When a tech company utilizes copyrighted material, including news articles, investigations, and opinion pieces, to train an AI chatbot capable of engaging in conversations on a myriad of topics, does it constitute plagiarism or an infringement of intellectual property rights?

Case background

The New York Times, in a comprehensive 70-page complaint contends that Microsoft and OpenAI are implicated in a case of “systematic and competitive infringement.” In simple words, the matter revolves around the incorporation of the Times’s content into the expansive Large-Language Models (LLMs) that power OpenAI’s ChatGPT and Microsoft’s Copilot. LLMs function by undergoing training on vast datasets, often sourced from the Internet, to develop the ability to recognize patterns and predict word associations. This iterative process results in the LLMs, utilized by chatbots like ChatGPT, being capable of delivering coherent responses to user queries. The Times argues that the construction, storage, and replication of text datasets encompassing its proprietary works during the training of LLMs constitute an infringement upon the newspaper’s exclusive rights to that material.

The lawsuit specifically alleges that OpenAI is engaging in unauthorized use of copyrighted material, capitalizing on the publication’s work and reputation for financial gain. The Times asserts that ChatGPT and analogous generative tools produce outputs that include portions substantially similar to the newspaper’s works, with some instances even featuring verbatim reproductions. The lawsuit contends that these generative AI tools directly compete with The Times by offering infringing substitute products for its copyrighted content.

In an effort to address these concerns, The Times initiated contact with Microsoft and OpenAI in April 2023, seeking to commence licensing negotiations. However, the complaint notes that despite the dialogue, no agreement has been reached between the parties. As a result, The Times has chosen to pursue legal action, alleging that the generative tools developed by Microsoft and OpenAI are not only infringing upon its exclusive rights but also pose a direct competitive threat to its content.

Issues explained

The New York Times has presented a dual-faceted case against OpenAI and Microsoft, asserting two primary grounds of copyright infringement.

First and foremost, echoing recent AI copyright lawsuits, The Times contends that its rights were violated through the “scraping” of its articles.
This involves the digital scanning and replication of content, which was subsequently included in the extensive datasets used to train GPT-4 and other AI models—a practice commonly referred to as the “input” side of the alleged infringement.
Secondly, The Times’s lawsuit points to instances where OpenAI’s GPT-4 language model, versions of which power ChatGPT and Bing, seemingly produced detailed summaries of paywalled articles, such as Wirecutter product reviews, or even entire sections of specific Times articles.
In essence, The Times argues that the alleged copyright violation extends beyond the input phase to the “output” generated by these AI tools.

As part of its case, The Times provides an illustrative example. An excerpt from its Pulitzer Prize-winning 2019 series on exploitative lending in New York City’s taxi industry was fed into ChatGPT, resulting in a recitation of the text quoted above.

The contributions made by ChatGPT were highlighted in black. Notably, with “minimal prompting,” ChatGPT reproduced the excerpt with a subtle shift in terminology—substituting “medallions” for “cabs” and employing “key initiatives” instead of “priorities.” Additionally, a word was added, and six others were omitted.

This process is known as “memorization,” wherein models reproduce segments of the content they underwent training on. The legal complaint, outlined in Exhibit J, showcases 100 instances where ChatGPT generates articles verbatim.

The Times contends that ChatGPT goes beyond merely scraping data from NYT articles or imitating its voice; rather, it produces output that not only recites Times content verbatim but also closely summarizes it and mimics its expressive style.

Navigating IPR dilema

In accordance with the Copyright Act of 1976 in the United States, creators, including prominent entities like The New York Times, possess exclusive rights to replicate, distribute, and showcase their creative works. Engaging in AI model training using such works may potentially infringe on these rights, especially in the absence of proper authorization.

Section 106 Exclusive Rights: Section 106 of the Copyright Act of 1976 bestows copyright owners with exclusive rights, encompassing the reproduction, creation of derivative works, and distribution of their copyrighted material. Additionally, it includes the rights to publicly perform and display works, with specific provisions for various types of creative works and sound recordings. The New York Times, for instance, holds exclusive rights as granted by Section 106, covering reproduction, derivative works, distribution, public performance, and public display of their copyrighted material.
Section 501(a) Copyright Infringement Definition: Section 501(a) of the Copyright Act of 1976 defines an infringer as an individual who violates the exclusive rights of a copyright owner or author, as outlined in Section 106. OpenAI and Microsoft, having utilized The New York Times’ material without permission, could potentially be considered infringers under Section 501(a), thereby violating the exclusive rights granted by Section 106.
Section 506(a) Criminal Infringement Criteria: Section 506(a) of the Copyright Act of 1976 outlines criteria for criminal infringement, requiring the demonstration of a valid copyright, wilful infringement, and infringement for commercial advantage or private financial gain, with the infringer’s knowledge or awareness of its commercial intent. The actions of OpenAI and Microsoft may be deemed criminal infringement under Section 506(a) if the government can establish the presence of a valid copyright (Section 106), wilful infringement, and an intent for commercial advantage or private financial gain.
Section 107 Fair Use Provision: Section 107 of the Copyright Act of 1976 provides for fair use of copyrighted works for purposes such as criticism, comment, news reporting, teaching, scholarship, or research, exempting such use from copyright infringement. In the case of OpenAI and Microsoft, their defense may hinge on Section 107’s fair use provision, asserting that their use falls under fair use as it involves training AI models for innovative purposes, aligning potentially with research and development. However, the determination of fair use involves a careful consideration of factors such as the purpose and character of the use, the nature of the work, the amount used, and the impact on the market.

Fair use defence

The legal filing envisions the possibility of Microsoft and OpenAI invoking fair use as a defense. According to the fair use doctrine outlined in 17 U.S.C. § 107, certain unlicensed uses of copyrighted material may be considered non-infringing if employed for purposes such as:

criticism
comment
news reporting
teaching
scholarship or
research.

The determination of fair use involves a careful consideration of four statutory factors. Recent copyright case law has placed increased emphasis on the transformative nature of the secondary use under the first factor, with many courts recognizing technological innovations as sufficiently transformative to warrant fair use protection.

The four factors influencing ‘fair use’ are:

the purpose and character of use
the nature of the copyrighted work
the amount or substantiality of the portion used and
the effect of use on the potential market

The latter three factors appear to favour The Times in this case, as The Times argues that its works are highly creative, OpenAI uses the entirety of The Times’s works, and there is a claimed impact on revenue. However, the pivotal point revolves around the first factor — the purpose and character of OpenAI’s use — and whether it is deemed “transformative.”

Referencing the landmark Feist Publications case, the U.S. Supreme Court held that information devoid of a minimum level of original creativity is not eligible for copyright protection. In essence, copyrights safeguard creativity, not the process before or after creation. OpenAI’s argument centers on the new and distinct purpose of using articles for training and developing a language model, contrasting it with the simple act of reading or subscribing to news.

Renowned novelists, including David Baldacci, Jonathan Franzen, John Grisham, and Scott Turow, have also filed lawsuits against OpenAI and Microsoft in a Manhattan federal court, alleging potential co-opting of tens of thousands of their books by AI systems. In a separate case in San Francisco, comedian Sarah Silverman and other authors sued OpenAI and Meta Platforms for “ingesting” their works, resulting in a partial dismissal by a judge in November. The lawsuit by The Times comes seven years after the U.S. Supreme Court declined to revive a challenge to Google’s digital library of millions of books. In that case, a federal appeals court deemed the library, which provided access to text snippets, to be a fair use of authors’ works.

Way Ahead

The next stage in the legal proceedings is anticipated to unfold as Microsoft and OpenAI submit motions to dismiss or responses to the complaint by the February 26th deadline. In the absence of a settlement, one can anticipate a prolonged and vigorously contested legal conflict, akin to the protracted dispute surrounding Google Books, which spanned a decade without culminating in a trial. Irrespective of the resolution of the present case, it is certain that forthcoming copyright law decisions will carry significant consequences for both the field of artificial intelligence and copyright holders.

Author:
Kosha Doshi, Final Year Student at Symbiosis Law School, Pune and Legal Intern Data Privacy and Digital Law at EU Digital Partners.
Kosha is also a co-author of “Facial Recognition at CrossRoads: Policy Perspectives on Disruption and Innovation,” at the Closing the Gap 2023 | Emerging and Disruptive Technologies: Regional Perspectives Conference in the Hague, Netherlands.