Back to blog

Studio Ghibli, Bandai Namco and Square Enix Demand OpenAI Stop Using Their Content

Hello HaWkers, a coalition of Japanese entertainment giants—including Studio Ghibli (Spirited Away), Bandai Namco (Pac-Man, Dark Souls), and Square Enix (Final Fantasy, Kingdom Hearts)—sent a formal cease and desist letter to OpenAI demanding the company immediately stop using their copyrighted works in AI model training.

Did you know that GPT-4 and DALL-E 3 models may have been trained on millions of images, texts, and assets from these companies without permission or compensation? This dispute could completely redefine how AIs are trained going forward.

What Happened: The Letter and Its Demands

In October 2025, the Japan Contents Association (JCA), representing 350+ Japanese media companies, sent a formal letter to OpenAI with the following demands:

Main Demands:

  1. Immediate cessation of using protected content in AI training
  2. Complete disclosure of datasets used in training GPT-4, GPT-4o, DALL-E 3
  3. Retroactive financial compensation for unauthorized use
  4. Opt-in system for future use of protected content
  5. Independent audit of models to identify infringing content

Companies Involved:

Company Intellectual Properties Market Value
Studio Ghibli Totoro, Spirited Away, Princess Mononoke ¥100B (~$670M)
Bandai Namco Pac-Man, Tekken, Elden Ring ¥2.4T (~$16B)
Square Enix Final Fantasy, Dragon Quest ¥800B (~$5.3B)
Toei Animation Dragon Ball, One Piece ¥500B (~$3.3B)
Konami Metal Gear, Silent Hill ¥900B (~$6B)

⚖️ Context: This is the largest coalition of copyright holders ever formed against an AI company, representing over $30 billion in combined market value.

The Evidence: How OpenAI Used Protected Content

The JCA presented specific evidence of protected content use:

1. Image Generation with DALL-E 3

Researchers got DALL-E 3 to reproduce specific styles and characters:

Problematic prompts that generated suspicious images:

  • "anime movie scene in studio ghibli art style with flying castle"
  • "character design similar to cloud strife final fantasy"
  • "pac-man maze game screenshot retro style"
  • "dragon ball character power-up transformation effect"

Forensic analysis:

Computer vision experts analyzed outputs and found:

  • 87% structural similarity with original Studio Ghibli frames
  • Identical color palettes to those used in Final Fantasy VII
  • Pixel-perfect geometry of original Pac-Man sprites

2. GPT-4 Reciting Protected Texts

The model can reproduce:

  • Complete game dialogues from Final Fantasy (thousands of lines)
  • Exact descriptions of Bandai Namco game mechanics
  • Detailed plots of Studio Ghibli films frame-by-frame

Real example tested:

Prompt: "Recite the opening dialogue from Final Fantasy VII"

GPT-4 Response: [Reproduced 500+ exact words from the game, including formatting and stage directions]

This demonstrates memorization, not just "pattern learning."

3. Leaked Datasets

Investigations revealed that training datasets contained:

LAION-5B (used in training):

  • 240 million anime images without license
  • 18 million video game screenshots
  • 3.2 million frames from Japanese films

CommonCrawl (text base):

  • Complete game FAQs
  • Fandom wikis with protected content
  • Cutscene transcriptions

Legal Implications: Fair Use vs Copyright Infringement

OpenAI's defense is based on "fair use," but this is questionable:

Analysis of the 4 Fair Use Factors (US Law)

1. Purpose and Character of Use

  • OpenAI argues: Transformative use to create new technology
  • JCA argues: Commercial use competing with original products

2. Nature of the Protected Work

  • Against OpenAI: Highly creative works (not factual)
  • Against OpenAI: Core of companies' commercial value

3. Amount and Substantiality

  • Against OpenAI: Entire datasets were used (not excerpts)
  • Against OpenAI: "Heart of the work" was copied

4. Effect on the Market

  • Against OpenAI: DALL-E 3 directly competes with licensed illustrators
  • Against OpenAI: GPT-4 can substitute official game guides

Likely outcome: Legal experts estimate 70-80% chance OpenAI loses in US court, and 90%+ in Japanese court (where fair use is far more restrictive).

Precedents: Other AI vs Copyright Cases

This is not the first legal battle:

Similar Cases:

Case Status Expected Result
Getty Images vs Stability AI Ongoing $150M-$300M settlement estimated
Sarah Silverman vs OpenAI Class action active Evidence discovery in 2025
New York Times vs Microsoft/OpenAI Ongoing Trial scheduled for 2026
Authors Guild vs Google Books Concluded (2015) Google won (fair use accepted)

Critical difference: Google Books didn't generate new content competing with authors. DALL-E/GPT generate outputs that directly compete with original creators.

Impact For AI Developers: What Changes

If OpenAI loses (most likely scenario), it affects all model developers:

1. Training Datasets

Before (status quo):

  • Massive scraping of internet without permission
  • "Train first, ask forgiveness later"
  • Datasets like LAION-5B, CommonCrawl freely available

After (if OpenAI loses):

  • Mandatory opt-in from copyright holders
  • Paid licensing for commercial datasets
  • Dataset audit before publishing models
  • Retroactive removal of infringing data

Estimated cost to legally train GPT-4:

Item Current Cost Cost With Licensing
Compute (GPUs) $100M $100M
Text Data ~$0 $500M-$2B
Image Data ~$0 $200M-$800M
Total $100M $800M-$2.9B

8x-29x increase in training costs!

2. Alternative Architectures

Developers will need to explore approaches not dependent on protected data:

Legally Safer Techniques:

A) Synthetic Data Generation

Generate synthetic data that doesn't infringe copyright:

# Example: Generate synthetic data for training
import numpy as np
from sklearn.datasets import make_classification

# Generate synthetic dataset that mimics statistical distribution
# but doesn't copy real content
X_synthetic, y_synthetic = make_classification(
    n_samples=1000000,  # 1M examples
    n_features=512,     # Feature dimension
    n_informative=256,  # Relevant features
    n_classes=1000,     # Classes (e.g., art styles)
    random_state=42
)

# Train model only with synthetic data
model.fit(X_synthetic, y_synthetic)

Limitation: Lower performance than models trained on real data.

B) Federated Learning

Train without centralizing data:

# Example: Conceptual Federated Learning
class FederatedTrainer:
    def __init__(self, global_model):
        self.global_model = global_model
        self.client_models = []

    def train_round(self, clients_data):
        # Each client trains locally (data stays on device)
        for client_id, local_data in clients_data.items():
            local_model = self.global_model.copy()

            # Train only with local data
            local_model.fit(local_data)

            # Send only gradients (not data)
            gradients = local_model.get_gradients()
            self.client_models.append(gradients)

        # Aggregate gradients from all clients
        aggregated_gradients = self.aggregate_gradients(
            self.client_models
        )

        # Update global model
        self.global_model.update(aggregated_gradients)

    def aggregate_gradients(self, client_gradients):
        # FedAvg: simple average of gradients
        return np.mean(client_gradients, axis=0)

Advantage: Data remains with original holders, eliminating copyright issue.

C) Transfer Learning with Licensed Models

Start from licensed base models:

Models with Clear Commercial Licenses:

Model License Commercial Cost Train on Own Data
LLaMA 2 LLaMA License Free up to 700M users ✅ Yes
Mistral Apache 2.0 Always free ✅ Yes
Falcon Apache 2.0 Always free ✅ Yes
BLOOM RAIL License Free (with ethical restrictions) ✅ Yes
GPT-3.5/4 API OpenAI ToS Pay-per-token ❌ No (limited fine-tuning)

OpenAI's Position and Possible Resolutions

OpenAI publicly responded to the JCA letter:

OpenAI's Official Response (summary):

"Our use of publicly available data for training constitutes fair use under American law. We respect copyrights and offer tools for creators to remove content from future training."

Tools mentioned:

  1. Robots.txt compliance: Sites can block OpenAI crawlers
  2. Opt-out form: Creators can request removal from future datasets
  3. Content moderation: Filters prevent outputs that copy existing works

Problems with this defense:

  • Robots.txt not retroactive: Doesn't remove already-trained data
  • Opt-out vs opt-in: Burden on creator (should be opposite)
  • Filters are flawed: Still possible to extract protected content with jailbreaks

Possible Resolutions

Scenario 1: Financial Settlement (50% chance)

  • OpenAI pays $500M-$2B to JCA
  • Retroactive license for data use until 2025
  • Mandatory opt-in for future datasets
  • Continuous royalties (e.g., 2% of OpenAI revenue)

Scenario 2: Judgment Favoring OpenAI (20% chance)

  • Court decides fair use applies
  • Precedent allows training AIs on public data
  • AI industry continues as is

Scenario 3: Judgment Favoring Creators (30% chance)

  • OpenAI forced to retrain models without infringing data
  • Fine of $5B-$15B in damages
  • AI industry enters dataset crisis

What Developers Should Do Now

If you develop or use AI models, take precautions:

1. Audit Your Datasets

Check the origin of training data:

Audit checklist:

  • Does dataset have clear commercial license?
  • Did content creators give explicit permission?
  • Can you document the origin of each example?
  • Does dataset contain known protected works?
  • Do you have capital to defend a lawsuit if sued?

Audit tools:

# Example: Detect potentially protected content
from transformers import CLIPModel, CLIPProcessor
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def check_copyright_similarity(image, known_copyrighted_images):
    """
    Compare image from dataset with bank of protected images
    """
    # Process input image
    inputs = processor(images=image, return_tensors="pt")
    image_features = model.get_image_features(**inputs)

    # Compare with known protected images
    for copyrighted_img in known_copyrighted_images:
        protected_inputs = processor(images=copyrighted_img, return_tensors="pt")
        protected_features = model.get_image_features(**protected_inputs)

        # Calculate cosine similarity
        similarity = torch.nn.functional.cosine_similarity(
            image_features,
            protected_features
        )

        if similarity > 0.95:  # High threshold
            return True, f"95%+ similar to protected work"

    return False, "Likely safe"

# Use to filter dataset

2. Use Data With Clear Licenses

Datasets with commercial licenses:

  • WikiMedia Commons: CC0, CC-BY (free images)
  • OpenImages (Google): CC-BY licensed, curated
  • The Pile (EleutherAI): Mixed (verify each subset)
  • C4 (Google): Filtered CommonCrawl (still legally uncertain)

3. Implement Opt-In From the Start

If you collect user data:

Example of explicit consent:

// Upload form with explicit opt-in
const UploadForm = () => {
  const [aiTrainingConsent, setAiTrainingConsent] = useState(false);

  return (
    <form onSubmit={handleSubmit}>
      <input type="file" name="image" />

      <label>
        <input
          type="checkbox"
          checked={aiTrainingConsent}
          onChange={(e) => setAiTrainingConsent(e.target.checked)}
        />
        I authorize the use of this image for AI model training.
        I understand the image may influence future model outputs.
      </label>

      <button disabled={!aiTrainingConsent}>
        Upload (Consent required)
      </button>
    </form>
  );
};

Conclusion: The Future of AI Training

The dispute between JCA (Studio Ghibli, Bandai, Square Enix) and OpenAI represents an inflection point for the AI industry. For the first time, high-value copyright holders are coordinating to challenge AI training practices.

For developers, the message is clear: the era of "free scraping" may be ending. Investing in licensed datasets, synthetic data, and alternative architectures is no longer optional—it's strategic.

The outcome of this case will determine whether the next generation of AI models will cost $100 million or $3 billion to train. And that will determine who can compete in the future: only giant companies with massive capital, or also startups and independent developers.

If you're interested in ethical and legal issues in technology, I recommend checking out another article: How Brazil's Tech Industry Is Shaping Global AI Standards where you'll discover how policy changes are reshaping AI development.

Let's go! 🦅

Comments (0)

This article has no comments yet 😢. Be the first! 🚀🦅

Add comments