Back to blog

Famous Authors Sue OpenAI, Anthropic and Google For Book Piracy to Train AI

Hello HaWkers, a new chapter in the battle between content creators and artificial intelligence companies is being written in American courts. A group of renowned authors, including John Carreyrou, author of the bestseller "Bad Blood", has filed a lawsuit against six tech giants, accusing them of using pirated copies of their books to train AI models.

Have you ever stopped to think about where all the knowledge that chatbots like ChatGPT and Claude demonstrate about literature, science, and history comes from? The answer may involve practices that border on piracy.

The Details of the Lawsuit

The lawsuit was filed in December 2025 and targets the world's largest AI companies:

Companies Being Sued

Defendants in the lawsuit:

  • OpenAI (ChatGPT, GPT-4, GPT-5)
  • Anthropic (Claude)
  • Google (Gemini)
  • Meta (LLaMA)
  • xAI (Grok)
  • Perplexity

The Main Accusation

The authors allege that these companies trained their language models using pirated copies of their books, obtained from illegal ebook sharing sites. The accusation is serious because:

Points of the accusation:

  • Books were obtained from known pirate sources
  • No license or permission was requested
  • No compensation was offered
  • Commercial models profit from the content

Who Are the Authors

The group of authors represents a diversity of genres and styles:

Main Names Involved

John Carreyrou:

  • Author of "Bad Blood", about the Theranos scandal
  • Investigative journalist for the Wall Street Journal
  • His book sold millions of copies worldwide

Other participating authors:

  • Fiction and non-fiction writers
  • Journalists and biographers
  • Authors of technical and scientific books

The diversity of the group shows that the problem affects the entire publishing industry.

The Evidence Presented

The lawsuit presents evidence that AI models know the content of books in ways that suggest direct training:

Demonstrations in the Lawsuit

Test 1 - Exact Quotes:
When asked to cite specific passages from books, models frequently produce excerpts that match word for word with the original text.

Test 2 - Structural Knowledge:
Models demonstrate knowledge of the structure and organization of books that would be unlikely without access to the complete text.

Test 3 - Piracy Traces:
Some model outputs include artifacts typical of pirated ebooks, such as incorrectly removed watermarks or broken formatting.

The AI Companies' Defense

AI companies have used various arguments in their defense in similar cases:

Common Arguments

Fair Use:
Companies argue that using texts for training constitutes "fair use" under American law, similar to how search engines index content.

Transformation:
The argument is that models don't reproduce texts but transform them into general knowledge, creating something new.

Public Benefit:
The thesis that AI benefits society as a whole, justifying the use of diverse data for training.

Authors' Counterpoints

Argument 1: Fair use doesn't apply to large-scale commercial use
Argument 2: Models can and do reproduce literal excerpts
Argument 3: Authors don't consent to "public benefit" at the expense of their rights

The Impact on the AI Industry

This lawsuit could have significant consequences:

Possible Scenarios

If authors win:

  • Companies may need to pay retroactive royalties
  • New models will need content licensing
  • AI training cost will increase significantly
  • Smaller models from resource-limited companies may disappear

If companies win:

  • Legal precedent for data use in training
  • Other creators will have fewer legal resources
  • Could accelerate AI development
  • Ethical questions will remain

The Ethical Question

Beyond legal issues, there's an important ethical debate:

Different Perspectives

Authors' view:

  • Creative work has value and should be compensated
  • Consent is fundamental
  • Corporate profit doesn't justify value extraction

Companies' view:

  • AI benefits all of society
  • Models don't replace original books
  • Restrictions could delay technological progress

Intermediate view:

  • Licensing system could benefit both
  • Fair compensation is possible
  • Transparency about training data is necessary

What This Means For Developers

As a developer, you might be thinking: how does this affect me?

Practical Implications

1. Use of AI APIs:
If companies are found liable, API costs may increase to cover licensing.

2. Model Development:
Startups wanting to train their own models will need to be more careful with data sources.

3. Code and Documentation:
The same debate applies to source code used to train programming models (Copilot, etc.).

The Code Question

This lawsuit focuses on books, but the same questions apply to code:

Open questions:

  • Can open source code be used for training?
  • Do licenses like GPL apply to AI outputs?
  • Should developers be compensated?

Similar lawsuits involving code are already underway against GitHub and Microsoft over Copilot.

Licensing Initiatives

Some companies are already moving toward more ethical models:

Licensing Examples

Existing agreements:

  • Reddit licensed content to Google
  • News Corp made a deal with OpenAI
  • Shutterstock licenses images for training
  • Stack Overflow negotiates licensing

The emerging model:

  • Content platforms negotiate collective agreements
  • Individual authors can opt in
  • Royalties distributed based on usage

The Future of AI Training

This lawsuit could define how AI will be trained in the future:

Likely Trends

Short term (2026):

  • More transparency about datasets
  • Opt-out options for creators
  • First important court decisions

Medium term (2027-2028):

  • Standardized licensing systems
  • Compensation for training use
  • "Ethical AI" certifications

Long term:

  • Models trained only with licensed data
  • Consolidated training data market
  • Clear government regulation

How to Follow the Case

If you want to follow the development of this lawsuit:

Resources

Journalistic coverage:

  • TechCrunch
  • The Verge
  • Ars Technica

Legal documents:

  • PACER (US federal court system)
  • CourtListener

Specialized analyses:

  • EFF (Electronic Frontier Foundation)
  • Authors Guild

Conclusion

The lawsuit filed by authors against AI companies represents a decisive moment for the industry. The decisions that emerge from this and similar cases will define the rules of the game for AI development in the coming decades.

For developers, it's important to follow these developments because they will directly affect the tools we use, the costs involved, and the ethical questions we'll need to consider when creating AI-powered products.

Technology advances, but questions about rights, compensation, and consent are fundamental to a healthy and sustainable ecosystem.

If you're interested in AI ethics issues, I recommend checking out another article: Anthropic Detects AI Being Used in Sophisticated Cyberattacks where you'll discover the emerging risks of autonomous AI.

Let's go! 🦅

Comments (0)

This article has no comments yet 😢. Be the first! 🚀🦅

Add comments