Famous Authors Sue OpenAI, Anthropic and Google For Book Piracy to Train AI
Hello HaWkers, a new chapter in the battle between content creators and artificial intelligence companies is being written in American courts. A group of renowned authors, including John Carreyrou, author of the bestseller "Bad Blood", has filed a lawsuit against six tech giants, accusing them of using pirated copies of their books to train AI models.
Have you ever stopped to think about where all the knowledge that chatbots like ChatGPT and Claude demonstrate about literature, science, and history comes from? The answer may involve practices that border on piracy.
The Details of the Lawsuit
The lawsuit was filed in December 2025 and targets the world's largest AI companies:
Companies Being Sued
Defendants in the lawsuit:
- OpenAI (ChatGPT, GPT-4, GPT-5)
- Anthropic (Claude)
- Google (Gemini)
- Meta (LLaMA)
- xAI (Grok)
- Perplexity
The Main Accusation
The authors allege that these companies trained their language models using pirated copies of their books, obtained from illegal ebook sharing sites. The accusation is serious because:
Points of the accusation:
- Books were obtained from known pirate sources
- No license or permission was requested
- No compensation was offered
- Commercial models profit from the content
Who Are the Authors
The group of authors represents a diversity of genres and styles:
Main Names Involved
John Carreyrou:
- Author of "Bad Blood", about the Theranos scandal
- Investigative journalist for the Wall Street Journal
- His book sold millions of copies worldwide
Other participating authors:
- Fiction and non-fiction writers
- Journalists and biographers
- Authors of technical and scientific books
The diversity of the group shows that the problem affects the entire publishing industry.
The Evidence Presented
The lawsuit presents evidence that AI models know the content of books in ways that suggest direct training:
Demonstrations in the Lawsuit
Test 1 - Exact Quotes:
When asked to cite specific passages from books, models frequently produce excerpts that match word for word with the original text.
Test 2 - Structural Knowledge:
Models demonstrate knowledge of the structure and organization of books that would be unlikely without access to the complete text.
Test 3 - Piracy Traces:
Some model outputs include artifacts typical of pirated ebooks, such as incorrectly removed watermarks or broken formatting.
The AI Companies' Defense
AI companies have used various arguments in their defense in similar cases:
Common Arguments
Fair Use:
Companies argue that using texts for training constitutes "fair use" under American law, similar to how search engines index content.
Transformation:
The argument is that models don't reproduce texts but transform them into general knowledge, creating something new.
Public Benefit:
The thesis that AI benefits society as a whole, justifying the use of diverse data for training.
Authors' Counterpoints
Argument 1: Fair use doesn't apply to large-scale commercial use
Argument 2: Models can and do reproduce literal excerpts
Argument 3: Authors don't consent to "public benefit" at the expense of their rights
The Impact on the AI Industry
This lawsuit could have significant consequences:
Possible Scenarios
If authors win:
- Companies may need to pay retroactive royalties
- New models will need content licensing
- AI training cost will increase significantly
- Smaller models from resource-limited companies may disappear
If companies win:
- Legal precedent for data use in training
- Other creators will have fewer legal resources
- Could accelerate AI development
- Ethical questions will remain
The Ethical Question
Beyond legal issues, there's an important ethical debate:
Different Perspectives
Authors' view:
- Creative work has value and should be compensated
- Consent is fundamental
- Corporate profit doesn't justify value extraction
Companies' view:
- AI benefits all of society
- Models don't replace original books
- Restrictions could delay technological progress
Intermediate view:
- Licensing system could benefit both
- Fair compensation is possible
- Transparency about training data is necessary
What This Means For Developers
As a developer, you might be thinking: how does this affect me?
Practical Implications
1. Use of AI APIs:
If companies are found liable, API costs may increase to cover licensing.
2. Model Development:
Startups wanting to train their own models will need to be more careful with data sources.
3. Code and Documentation:
The same debate applies to source code used to train programming models (Copilot, etc.).
The Code Question
This lawsuit focuses on books, but the same questions apply to code:
Open questions:
- Can open source code be used for training?
- Do licenses like GPL apply to AI outputs?
- Should developers be compensated?
Similar lawsuits involving code are already underway against GitHub and Microsoft over Copilot.
Licensing Initiatives
Some companies are already moving toward more ethical models:
Licensing Examples
Existing agreements:
- Reddit licensed content to Google
- News Corp made a deal with OpenAI
- Shutterstock licenses images for training
- Stack Overflow negotiates licensing
The emerging model:
- Content platforms negotiate collective agreements
- Individual authors can opt in
- Royalties distributed based on usage
The Future of AI Training
This lawsuit could define how AI will be trained in the future:
Likely Trends
Short term (2026):
- More transparency about datasets
- Opt-out options for creators
- First important court decisions
Medium term (2027-2028):
- Standardized licensing systems
- Compensation for training use
- "Ethical AI" certifications
Long term:
- Models trained only with licensed data
- Consolidated training data market
- Clear government regulation
How to Follow the Case
If you want to follow the development of this lawsuit:
Resources
Journalistic coverage:
- TechCrunch
- The Verge
- Ars Technica
Legal documents:
- PACER (US federal court system)
- CourtListener
Specialized analyses:
- EFF (Electronic Frontier Foundation)
- Authors Guild
Conclusion
The lawsuit filed by authors against AI companies represents a decisive moment for the industry. The decisions that emerge from this and similar cases will define the rules of the game for AI development in the coming decades.
For developers, it's important to follow these developments because they will directly affect the tools we use, the costs involved, and the ethical questions we'll need to consider when creating AI-powered products.
Technology advances, but questions about rights, compensation, and consent are fundamental to a healthy and sustainable ecosystem.
If you're interested in AI ethics issues, I recommend checking out another article: Anthropic Detects AI Being Used in Sophisticated Cyberattacks where you'll discover the emerging risks of autonomous AI.

