AI Copyright & Training Data Rights

In 2023, The New York Times sued OpenAI. Getty Images sued Stability AI. Sarah Silverman and other writers filed class action lawsuits against major AI companies. The fight wasn’t over AI’s capabilities. It was over data: who owns it, who can use it, and what compensation creators deserve when their work trains AI models.

This is the legal frontier of AI—and the outcome will determine whether AI companies can train their models freely or whether they’ll need to license data and compensate creators.

If you use AI in any capacity, or if your content is being used to train AI models, understanding copyright and data rights isn’t optional. The law is still being written, but the direction is clear: creators will have more rights, and AI companies will have fewer.

The Core Conflict: Fair Use vs. Commercial Exploitation

AI copyright disputes fundamentally ask: does training an AI on copyrighted work count as “fair use”?

This question will define the AI industry for the next decade. And both sides have legitimate arguments.

Fair Use Doctrine: The Original Framework

Under US copyright law, “fair use” allows limited use of copyrighted material without permission. Fair use applies when:

The use is transformative (creates new value, doesn’t just copy the original).
The amount used is reasonable (you don’t reproduce the entire work).
The original market isn’t harmed (the new use doesn’t cannibalize sales of the original).
It’s for criticism, commentary, education, or research (non-commercial purposes get more deference).

Classic fair use: a book reviewer quoting a paragraph to analyze a novel. The use is transformative (creates critical analysis), limited (one paragraph, not the whole book), doesn’t harm the book’s sales, and serves an educational purpose.

AI Training vs. Fair Use: The Legal Tension

AI companies argue that training models on copyrighted data is fair use because:

Transformation argument: Feeding billions of articles into an AI model creates something fundamentally new (a language model), not a copy of the original. The model learns patterns and associations, not articles themselves.
Limited extraction: The AI doesn’t copy articles verbatim. It learns patterns and associations. The output is new text, not reproductions.
Non-competing use: An AI model isn’t a replacement for the original work. A ChatGPT response isn’t a newspaper article. They serve different purposes.
Education and research: AI training is, in a sense, research—analyzing patterns in human-generated text.

But creators argue the opposite:

Market harm is real and measurable: When an AI trained on journalists’ work produces articles without compensation, it cannibalizes demand for human journalism. The New York Times argued that GPT-4 can generate articles that directly compete with their reporting. Why would a reader pay for Times journalism when they can get free AI summaries?
Scale changes the calculus: Quoting one paragraph from one book is fair use. Ingesting the entirety of human-generated content (billions of articles, millions of books, terabytes of images) to build a commercial product isn’t the same thing. Scale matters legally.
Licensing precedent exists: Google licensed images from Getty Images and other providers when building Google Images. Why shouldn’t AI companies do the same? If they’re claiming fair use, why did Google take a licensing approach?
Compensation precedent: When tech companies license music from record labels, or news from news agencies, they pay. Why should AI be different?

As of March 2024, no court has definitively ruled on whether AI training counts as fair use. But several cases are in progress, and early judicial signals suggest fair use will be narrower for AI than tech companies hoped. The courts seem skeptical of the “it’s research” argument when billions of dollars are at stake.

The New York Times v. OpenAI: The Landmark Case

The Times’ lawsuit against OpenAI and Microsoft is the most high-profile AI copyright case. The stakes are enormous for everyone.

What the Times Alleged

The Times provided evidence that:

OpenAI trained GPT-4 on millions of New York Times articles without permission or compensation. The Times demonstrated that GPT-4 was specifically trained on Times content by analyzing patterns in its outputs.
GPT-4 can reproduce verbatim or near-verbatim passages from Times articles. The Times showed examples where GPT-4 outputs were nearly identical to published articles.
This directly harms the Times’ business by cannibalizing demand for their journalism. Readers who would pay for Times content can now get AI summaries for free.
OpenAI has built a $100B+ company using copyrighted content without licensing it. The Times calculated that if OpenAI had licensed Times content at market rates, it would have cost millions of dollars.

OpenAI’s Defense

OpenAI argues:

Training is fair use under established copyright doctrine. They’re not copying articles; they’re learning from them.
GPT doesn’t reproduce articles; it generates new text based on learned patterns. The output is transformative.
The business model (paid API access) is transformative and non-competing with journalism. GPT-4 isn’t a newspaper replacement; it’s a different product.
Fair use is essential to AI development. If companies had to license every data source, AI development would become impossibly expensive.

The Broader Implications

If courts rule against OpenAI, the implications are seismic for the entire industry:

Licensing requirement: AI companies would need to license training data or face liability. This dramatically increases costs and changes business models.
Retroactive compensation: Courts might order damages for past unauthorized use. OpenAI could face billions in fines.
Industry disruption: Training AI models on billions of unlicensed data becomes impossible without permission and payment. The free-data era ends.
Precedent for other creators: If the Times wins, musicians, photographers, programmers, and other creators will file similar lawsuits. The legal precedent cascades.

If courts rule for OpenAI, creators lose leverage and AI companies can continue training freely. This would establish that fair use applies broadly to AI training, regardless of scale.

Global Regulatory Response: The Faster Track

While courts deliberate, regulators are moving faster. They’re not waiting for judicial clarity.

European Union: AI Act & Digital Markets Act

The EU’s AI Act (effective 2024-2025) includes specific provisions on AI training data:

Transparency requirement: AI companies must disclose which copyrighted works were used in training. No more black-box training. Regulators want to know what data was used.
Opt-out rights for creators: Authors and artists can request their work be excluded from training datasets. This is a creator empowerment mechanism.
Licensing incentives: The EU encourages licensed data use through regulatory favors and reduced compliance requirements for systems trained on licensed data.

The Digital Markets Act goes further: high-risk AI systems (those deployed at scale for significant decisions) must obtain explicit consent or licenses for copyrighted training data. It’s a regulation-by-design approach.

United Kingdom

The UK’s AI Bill of Rights includes a principle of “fair training data use.” While non-binding, it signals that regulators expect AI companies to compensate creators and use data transparently.

United States: Legislative Proposals (Moving but Slow)

Several proposed bills would create AI copyright protections:

The COPIED Act (Compensating Original Producers for their Production and Performances in Digital Marketplaces): Would explicitly require licenses for copyrighted training data and create a right to compensation for creators whose work is used in AI training.
The Generative AI Copyright Disclosure Act: Would mandate disclosure of training data sources (transparency requirement).
State-level efforts: California and New York are proposing AI transparency and compensation laws that could become templates for federal law.

Training Data Licensing: The Emerging Market

As copyright restrictions tighten, a new market is emerging: training data licensing. Companies are figuring out how to compensate creators while getting legal training data.

What This Looks Like in Practice

Getty Images licensing deal: Getty Images signed licensing agreements with Shutterstock and others to provide legal, licensed training data for AI image generation. Getty gets paid; the AI company gets licensed content; photographers are (theoretically) compensated through Getty. This is the model regulators want to see replicated.

Journalism licensing: The Financial Times, BBC, The Wall Street Journal, and others are negotiating licensing deals with AI companies. Instead of allowing free training access, they’re negotiating per-article or per-token fees. Some deals include residual payments based on how much AI systems use that content.

Writer compensation models: Some platforms are experimenting with direct creator compensation. If your article trains an AI model, you get paid based on usage. Platforms like Substack are exploring this for their writers.

Book licensing: Publishers are negotiating licensing deals for AI training. Some refuse entirely; others negotiate rates. The Authors Guild is advocating for compensation standards.

The Economics of Licensed Training Data

Licensed data is more expensive. Training GPT-4 on licensed data would cost significantly more than the zero-cost approach of using unlicensed content. OpenAI, Anthropic, and Meta are all exploring licensing to reduce legal risk and improve reputational standing.

The cost gets passed to end users (higher subscription prices for ChatGPT, Claude, etc.) or absorbed by reduced profit margins. This is actually economically efficient: it internalizes the cost of using creators’ work, which should affect pricing.

This is actually good for creators: it creates a financial incentive to use licensed data, which means compensation flows back to original authors, photographers, musicians, and artists. The market naturally aligns incentives toward creator compensation.

Fair Use in AI: What’s Actually Changing?

Even if courts ultimately narrow fair use, some AI applications will still qualify as fair use. The boundaries are shifting, but not eliminating fair use entirely.

Fair Use Likely Still Applies To:

Research and criticism: Training AI specifically to analyze, critique, or understand existing works (academic use, non-commercial research).
Index and search: Using copyrighted data to build searchable indices (similar to how Google Scholar indexes academic papers).
Parody and transformative commentary: AI trained to generate parodies or analyze literary style for educational purposes.
Non-competitive AI: Training models that don’t compete with the original market (e.g., training on books to build a recommendation system, not to generate competing books).

Fair Use Less Likely To Apply To:

Commercial language models trained on copyrighted text to generate competing products (ChatGPT trained on news articles to generate summaries that compete with the news).
Image generators trained on artistic works to reduce demand for commissioned art or to generate similar images.
Large-scale ingestion of entire databases (all of Wikipedia, all of published books, all of Getty’s image library, etc.) without licensing or compensation.

What This Means for Different Stakeholders

For Content Creators

Protect yourself now:

Add copyright notices to your published work. It signals that you assert copyright and claim compensation rights. Make your copyright claim explicit.
Monitor AI training datasets: Tools like HaveIBeenTrained.com (for images), Authors Guild resources (for books), and Copyscape derivatives (for text) let you check if your work was used in popular AI models.
Negotiate licensing deals: If you have a significant body of work, contact AI companies about licensing compensation. Early deals will set better precedents than later ones.
Support advocacy organizations pushing for stronger copyright protections (Authors Guild, Getty Images coalition, photographer associations, etc.). Collective advocacy is more powerful than individual action.
Document your work and creation dates: Legal claims require proof of original authorship and creation date. Make this documentation airtight.

For AI Companies

The smart move is to proactively license:

Reduce legal risk: Licensed data removes the fair use question. You have explicit permission. This is worth the licensing cost.
Build reputation: Companies that compensate creators have better brand positioning and attract talent more easily. Being known as fair to creators matters in hiring and partnerships.
Diversify sources: Mix licensed commercial data with open-source and public domain sources to reduce costs while maintaining quality.
Transparent training: Disclose which sources you used. This builds trust and positions you well for future regulation.

For Regulators

The momentum is clear: mandatory licensing and transparency. Expect:

Disclosure requirements: AI companies must list which copyrighted sources were used in training. This becomes standard practice.
Opt-out rights: Creators can request removal of their work from training datasets. Regulators are empowering individual agency.
Compensation frameworks: Regulators are designing mechanisms to allocate licensing revenue back to creators. This is the public policy frontier.
Licensing standards: Expect emerging standards for what constitutes “fair” licensing rates and creator compensation.

The Bottom Line: Data Rights Are Being Codified

The era of “train on anything without permission” is ending. Whether through courts, regulation, or market forces, AI companies will face pressure to license training data and compensate creators.

This creates friction and higher costs. But it also creates an incentive structure where creators are compensated for their work and have agency over how it’s used. The AI industry will mature, costs will rise, but fairness will improve.

The legal frontier is still being written, but the direction is clear: fair use won’t be a blanket defense for AI training at scale. Data rights—the rights of creators to control and monetize their work—are being codified into law.

If you’re training AI models or relying on AI to generate content, factor in licensing costs now. If you’re a creator, document your work and monitor how it’s used. The next 2-3 years will define who owns data rights in the AI era, and being proactive now puts you in a stronger position.

AI Copyright & Training Data Rights: The Legal Frontier