Back to blog

Wikipedia Signs Agreements with Microsoft, Meta, Amazon and Mistral For AI Licensing

Hello HaWkers, important news about the AI data ecosystem emerged this week. The Wikimedia Foundation announced licensing agreements with Microsoft, Meta, Perplexity, Amazon, and Mistral for using Wikipedia data in AI model training.

This marks a significant change in how public data is monetized for AI.

What Was Announced

The Licensing Agreements

The Wikimedia Foundation, the nonprofit organization behind Wikipedia, signed agreements with five major technology companies.

Participating companies:

  • Microsoft
  • Meta
  • Perplexity
  • Amazon
  • Mistral

Known details:

  • Values were not publicly disclosed
  • Agreements include structured data access
  • Wikipedia attribution will be required
  • Part of the funds goes to Wikimedia projects

Why This Is Important

The Wikipedia Data Context

Wikipedia is one of the largest sources of structured knowledge on the internet and has been widely used to train AI models.

Wikipedia scale:

  • 60+ million articles
  • 300+ languages
  • 100+ billion pageviews per year
  • One of the top 10 data sources for LLMs

AI use before agreements:

  • Massive scraping without formal permission
  • Data used in virtually all LLMs
  • No compensation to Wikimedia
  • Inconsistent attribution

The Paradigm Shift

These agreements represent an evolution in the relationship between data sources and AI companies:

Before:

  • Free scraping from public sources
  • No compensation
  • Optional attribution
  • Unrestricted use

After:

  • Formal licensing
  • Financial compensation
  • Mandatory attribution
  • Specific terms of use

Agreement Details

What Companies Get

With the agreements, participating companies receive:

Benefits:

  1. Structured access: Dedicated and optimized API
  2. Clean data: Standardized format for training
  3. Updates: Access to new content
  4. Legitimacy: Formally authorized use
  5. Metadata: Information about sources and edits

What Wikimedia Gets

The foundation receives:

Counterparts:

  • Financial compensation (values not disclosed)
  • Mandatory attribution in products
  • Investment in Wikipedia infrastructure
  • Collaboration on knowledge projects

Who Is Not in the Agreement

Notably, some important companies were not mentioned:

Absent:

  • OpenAI
  • Google
  • Anthropic
  • Apple

Possible reasons:

  • Ongoing negotiations
  • Disagreement about terms
  • Already have separate agreements
  • Prefer traditional scraping

Impact For Developers

Wikipedia Data Access

If you develop applications that use Wikipedia data, understand the options:

Access options:

Method Legality Cost Quality
Public API Allowed Free Good
Public Dumps Allowed Free Excellent
Direct Scraping Gray area Free Variable
Corporate Agreement Formal Paid Premium

Recommendations for developers:

For most cases, the public API or dumps are still valid:

// Example of using the Wikipedia API
async function getWikipediaContent(title: string): Promise<string> {
  const params = new URLSearchParams({
    action: 'query',
    titles: title,
    prop: 'extracts',
    exintro: 'true',
    format: 'json',
    origin: '*'
  });

  const response = await fetch(
    `https://en.wikipedia.org/w/api.php?${params}`
  );

  const data = await response.json();
  const pages = data.query.pages;
  const pageId = Object.keys(pages)[0];

  return pages[pageId].extract || '';
}

// Usage
const content = await getWikipediaContent('JavaScript');
console.log(content);

For Model Training

If you're training AI models, consider:

Legitimate options:

  1. Public dumps: Available for download
  2. Formal agreement: For commercial use at scale
  3. Alternative sources: Other wikis and datasets
# Download Wikipedia dump
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

The Licensing Model

How It Works

The agreements establish a model where:

Likely structure:

  1. Company pays fee (fixed or per use)
  2. Receives access to premium API
  3. Data comes pre-processed
  4. Attribution is mandatory in products
  5. Terms restrict certain uses

Estimated Values

Although not disclosed, we can estimate based on similar agreements:

Market estimates:

  • Reddit-Google: ~$60 million/year
  • Stack Overflow-OpenAI: ~$20 million/year
  • News outlets-OpenAI: $5-50 million/year each

Wikipedia probably:

  • Microsoft: $10-30 million/year (estimated)
  • Meta: $10-20 million/year (estimated)
  • Others: $5-15 million/year each (estimated)

Sustainability For Wikimedia

These agreements may represent a significant new revenue source:

Wikimedia finances (before):

  • Annual revenue: ~$150 million
  • Main source: Donations
  • Dependency: Fundraising campaigns

With AI agreements:

  • Potential additional revenue: $50-100 million/year
  • Source diversification
  • Less pressure on donations

Ecosystem Implications

Precedent For Other Sources

The Wikipedia agreement may inspire other data sources:

Who might follow:

  • Stack Overflow (already has agreements)
  • Reddit (already has agreement with Google)
  • GitHub (Microsoft already owns)
  • Specialized forums
  • News sites
  • Tech blogs

The Future of Open Knowledge

A tension emerges between:

Open knowledge:

  • Wikipedia is free to read
  • Anyone can edit
  • Mission to disseminate knowledge
  • Nonprofit

Monetization for AI:

  • Companies profit from data
  • Compensation for maintenance
  • Financial sustainability
  • Possible restrictive terms

Open questions:

  • Do agreements affect the mission?
  • Does data remain public?
  • Do volunteers feel valued?
  • Will quality be maintained?

What It Means For Users

Visible Changes

AI users may notice:

Product impact:

  • More attribution to Wikipedia
  • Possibly links to articles
  • Factual information quality
  • Better source citation

Attribution Example

AI models may include:

Response based on information from Wikipedia
Source: https://en.wikipedia.org/wiki/JavaScript
Last updated: January 2026

Community Reactions

Wikipedia Volunteers

The volunteer editor community has divided opinions:

In favor:

  • Financial sustainability
  • Recognition of work
  • Infrastructure investment
  • Wikipedia visibility

Against:

  • "Selling" volunteer work
  • Companies making billions
  • Insufficient compensation
  • Potential conflict of interest

AI Companies

Company reactions:

Positive:

  • Legitimacy in data use
  • Structured and updated access
  • Lower legal risk
  • Formalized relationship

Concerns:

  • Additional costs
  • Competitors may have same data
  • Usage restrictions
  • Precedent for other sources to ask for payment

Trends For 2026-2027

The AI Data Market

We're seeing the formation of a new market:

Emerging characteristics:

  1. Licensing as standard: Formal agreements becoming the norm
  2. Established prices: Market defining values
  3. Intermediaries: Licensing platforms emerging
  4. Regulation: Governments may intervene
  5. Consolidation: Large players dominating

Impact on Developers

For startups:

  • Data costs increasing
  • Higher entry barriers
  • Importance of proprietary data
  • Business models affected

For large companies:

  • Competitive advantage through agreements
  • Costs as cost of doing business
  • Source diversification
  • Investment in own data

Recommendations

For Developers

  1. Document sources: Know where each dataset comes from
  2. Use official APIs: Avoid scraping when possible
  3. Consider licensing: For significant commercial use
  4. Track changes: Terms may change
  5. Invest in own data: Reduce external dependency

For Companies

  1. Evaluate needs: Need a formal agreement?
  2. Budget for data: Include in planning
  3. Diversify sources: Don't depend on just one
  4. Monitor regulation: Laws may change
  5. Consider contributing: Give back to open communities

Conclusion

Wikipedia's agreements with Microsoft, Meta, Amazon, Perplexity, and Mistral represent an important milestone in formalizing data use for AI training. This creates a model that will likely be followed by other data sources.

Key points:

  1. Wikipedia signed licensing agreements with 5 big techs
  2. Model includes payment and mandatory attribution
  3. OpenAI, Google, and Anthropic are not in the agreements
  4. Public API and dumps remain available
  5. Precedent may affect the entire data ecosystem

For developers, it's important to understand this new landscape and plan your data strategies considering growing costs and legitimacy requirements.

To learn more about how AI is changing, read: Project to Poison Web Crawlers.

Let's go! 🦅

Comments (0)

This article has no comments yet 😢. Be the first! 🚀🦅

Add comments