Wikipedia Signs Agreements with Microsoft, Meta, Amazon and Mistral For AI Licensing

Hello HaWkers, important news about the AI data ecosystem emerged this week. The Wikimedia Foundation announced licensing agreements with Microsoft, Meta, Perplexity, Amazon, and Mistral for using Wikipedia data in AI model training.

This marks a significant change in how public data is monetized for AI.

What Was Announced

The Licensing Agreements

The Wikimedia Foundation, the nonprofit organization behind Wikipedia, signed agreements with five major technology companies.

Participating companies:

Microsoft
Meta
Perplexity
Amazon
Mistral

Known details:

Values were not publicly disclosed
Agreements include structured data access
Wikipedia attribution will be required
Part of the funds goes to Wikimedia projects

Why This Is Important

The Wikipedia Data Context

Wikipedia is one of the largest sources of structured knowledge on the internet and has been widely used to train AI models.

Wikipedia scale:

60+ million articles
300+ languages
100+ billion pageviews per year
One of the top 10 data sources for LLMs

AI use before agreements:

Massive scraping without formal permission
Data used in virtually all LLMs
No compensation to Wikimedia
Inconsistent attribution

The Paradigm Shift

These agreements represent an evolution in the relationship between data sources and AI companies:

Before:

Free scraping from public sources
No compensation
Optional attribution
Unrestricted use

After:

Formal licensing
Financial compensation
Mandatory attribution
Specific terms of use

Agreement Details

What Companies Get

With the agreements, participating companies receive:

Benefits:

Structured access: Dedicated and optimized API
Clean data: Standardized format for training
Updates: Access to new content
Legitimacy: Formally authorized use
Metadata: Information about sources and edits

What Wikimedia Gets

The foundation receives:

Counterparts:

Financial compensation (values not disclosed)
Mandatory attribution in products
Investment in Wikipedia infrastructure
Collaboration on knowledge projects

Who Is Not in the Agreement

Notably, some important companies were not mentioned:

Absent:

OpenAI
Google
Anthropic
Apple

Possible reasons:

Ongoing negotiations
Disagreement about terms
Already have separate agreements
Prefer traditional scraping

Impact For Developers

Wikipedia Data Access

If you develop applications that use Wikipedia data, understand the options:

Access options:

Method	Legality	Cost	Quality
Public API	Allowed	Free	Good
Public Dumps	Allowed	Free	Excellent
Direct Scraping	Gray area	Free	Variable
Corporate Agreement	Formal	Paid	Premium

Recommendations for developers:

For most cases, the public API or dumps are still valid:

// Example of using the Wikipedia API
async function getWikipediaContent(title: string): Promise<string> {
  const params = new URLSearchParams({
    action: 'query',
    titles: title,
    prop: 'extracts',
    exintro: 'true',
    format: 'json',
    origin: '*'
  });

  const response = await fetch(
    `https://en.wikipedia.org/w/api.php?${params}`
  );

  const data = await response.json();
  const pages = data.query.pages;
  const pageId = Object.keys(pages)[0];

  return pages[pageId].extract || '';
}

// Usage
const content = await getWikipediaContent('JavaScript');
console.log(content);

For Model Training

If you're training AI models, consider:

Legitimate options:

Public dumps: Available for download
Formal agreement: For commercial use at scale
Alternative sources: Other wikis and datasets

# Download Wikipedia dump
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

The Licensing Model

How It Works

The agreements establish a model where:

Likely structure:

Company pays fee (fixed or per use)
Receives access to premium API
Data comes pre-processed
Attribution is mandatory in products
Terms restrict certain uses

Estimated Values

Although not disclosed, we can estimate based on similar agreements:

Market estimates:

Reddit-Google: ~$60 million/year
Stack Overflow-OpenAI: ~$20 million/year
News outlets-OpenAI: $5-50 million/year each

Wikipedia probably:

Microsoft: $10-30 million/year (estimated)
Meta: $10-20 million/year (estimated)
Others: $5-15 million/year each (estimated)

Sustainability For Wikimedia

These agreements may represent a significant new revenue source:

Wikimedia finances (before):

Annual revenue: ~$150 million
Main source: Donations
Dependency: Fundraising campaigns

With AI agreements:

Potential additional revenue: $50-100 million/year
Source diversification
Less pressure on donations

Ecosystem Implications

Precedent For Other Sources

The Wikipedia agreement may inspire other data sources:

Who might follow:

Stack Overflow (already has agreements)
Reddit (already has agreement with Google)
GitHub (Microsoft already owns)
Specialized forums
News sites
Tech blogs

The Future of Open Knowledge

A tension emerges between:

Open knowledge:

Wikipedia is free to read
Anyone can edit
Mission to disseminate knowledge
Nonprofit

Monetization for AI:

Companies profit from data
Compensation for maintenance
Financial sustainability
Possible restrictive terms

Open questions:

Do agreements affect the mission?
Does data remain public?
Do volunteers feel valued?
Will quality be maintained?

What It Means For Users

Visible Changes

AI users may notice:

Product impact:

More attribution to Wikipedia
Possibly links to articles
Factual information quality
Better source citation

Attribution Example

AI models may include:

Response based on information from Wikipedia
Source: https://en.wikipedia.org/wiki/JavaScript
Last updated: January 2026

Community Reactions

Wikipedia Volunteers

The volunteer editor community has divided opinions:

In favor:

Financial sustainability
Recognition of work
Infrastructure investment
Wikipedia visibility

Against:

"Selling" volunteer work
Companies making billions
Insufficient compensation
Potential conflict of interest

AI Companies

Company reactions:

Positive:

Legitimacy in data use
Structured and updated access
Lower legal risk
Formalized relationship

Concerns:

Additional costs
Competitors may have same data
Usage restrictions
Precedent for other sources to ask for payment

Trends For 2026-2027

The AI Data Market

We're seeing the formation of a new market:

Emerging characteristics:

Licensing as standard: Formal agreements becoming the norm
Established prices: Market defining values
Intermediaries: Licensing platforms emerging
Regulation: Governments may intervene
Consolidation: Large players dominating

Impact on Developers

For startups:

Data costs increasing
Higher entry barriers
Importance of proprietary data
Business models affected

For large companies:

Competitive advantage through agreements
Costs as cost of doing business
Source diversification
Investment in own data

Recommendations

For Developers

Document sources: Know where each dataset comes from
Use official APIs: Avoid scraping when possible
Consider licensing: For significant commercial use
Track changes: Terms may change
Invest in own data: Reduce external dependency

For Companies

Evaluate needs: Need a formal agreement?
Budget for data: Include in planning
Diversify sources: Don't depend on just one
Monitor regulation: Laws may change
Consider contributing: Give back to open communities

Conclusion

Wikipedia's agreements with Microsoft, Meta, Amazon, Perplexity, and Mistral represent an important milestone in formalizing data use for AI training. This creates a model that will likely be followed by other data sources.

Key points:

Wikipedia signed licensing agreements with 5 big techs
Model includes payment and mandatory attribution
OpenAI, Google, and Anthropic are not in the agreements
Public API and dumps remain available
Precedent may affect the entire data ecosystem

For developers, it's important to understand this new landscape and plan your data strategies considering growing costs and legitimacy requirements.

To learn more about how AI is changing, read: Project to Poison Web Crawlers.