Wikipedia Signs Agreements with Microsoft, Meta, Amazon and Mistral For AI Licensing
Hello HaWkers, important news about the AI data ecosystem emerged this week. The Wikimedia Foundation announced licensing agreements with Microsoft, Meta, Perplexity, Amazon, and Mistral for using Wikipedia data in AI model training.
This marks a significant change in how public data is monetized for AI.
What Was Announced
The Licensing Agreements
The Wikimedia Foundation, the nonprofit organization behind Wikipedia, signed agreements with five major technology companies.
Participating companies:
- Microsoft
- Meta
- Perplexity
- Amazon
- Mistral
Known details:
- Values were not publicly disclosed
- Agreements include structured data access
- Wikipedia attribution will be required
- Part of the funds goes to Wikimedia projects
Why This Is Important
The Wikipedia Data Context
Wikipedia is one of the largest sources of structured knowledge on the internet and has been widely used to train AI models.
Wikipedia scale:
- 60+ million articles
- 300+ languages
- 100+ billion pageviews per year
- One of the top 10 data sources for LLMs
AI use before agreements:
- Massive scraping without formal permission
- Data used in virtually all LLMs
- No compensation to Wikimedia
- Inconsistent attribution
The Paradigm Shift
These agreements represent an evolution in the relationship between data sources and AI companies:
Before:
- Free scraping from public sources
- No compensation
- Optional attribution
- Unrestricted use
After:
- Formal licensing
- Financial compensation
- Mandatory attribution
- Specific terms of use
Agreement Details
What Companies Get
With the agreements, participating companies receive:
Benefits:
- Structured access: Dedicated and optimized API
- Clean data: Standardized format for training
- Updates: Access to new content
- Legitimacy: Formally authorized use
- Metadata: Information about sources and edits
What Wikimedia Gets
The foundation receives:
Counterparts:
- Financial compensation (values not disclosed)
- Mandatory attribution in products
- Investment in Wikipedia infrastructure
- Collaboration on knowledge projects
Who Is Not in the Agreement
Notably, some important companies were not mentioned:
Absent:
- OpenAI
- Anthropic
- Apple
Possible reasons:
- Ongoing negotiations
- Disagreement about terms
- Already have separate agreements
- Prefer traditional scraping
Impact For Developers
Wikipedia Data Access
If you develop applications that use Wikipedia data, understand the options:
Access options:
| Method | Legality | Cost | Quality |
|---|---|---|---|
| Public API | Allowed | Free | Good |
| Public Dumps | Allowed | Free | Excellent |
| Direct Scraping | Gray area | Free | Variable |
| Corporate Agreement | Formal | Paid | Premium |
Recommendations for developers:
For most cases, the public API or dumps are still valid:
// Example of using the Wikipedia API
async function getWikipediaContent(title: string): Promise<string> {
const params = new URLSearchParams({
action: 'query',
titles: title,
prop: 'extracts',
exintro: 'true',
format: 'json',
origin: '*'
});
const response = await fetch(
`https://en.wikipedia.org/w/api.php?${params}`
);
const data = await response.json();
const pages = data.query.pages;
const pageId = Object.keys(pages)[0];
return pages[pageId].extract || '';
}
// Usage
const content = await getWikipediaContent('JavaScript');
console.log(content);For Model Training
If you're training AI models, consider:
Legitimate options:
- Public dumps: Available for download
- Formal agreement: For commercial use at scale
- Alternative sources: Other wikis and datasets
# Download Wikipedia dump
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
The Licensing Model
How It Works
The agreements establish a model where:
Likely structure:
- Company pays fee (fixed or per use)
- Receives access to premium API
- Data comes pre-processed
- Attribution is mandatory in products
- Terms restrict certain uses
Estimated Values
Although not disclosed, we can estimate based on similar agreements:
Market estimates:
- Reddit-Google: ~$60 million/year
- Stack Overflow-OpenAI: ~$20 million/year
- News outlets-OpenAI: $5-50 million/year each
Wikipedia probably:
- Microsoft: $10-30 million/year (estimated)
- Meta: $10-20 million/year (estimated)
- Others: $5-15 million/year each (estimated)
Sustainability For Wikimedia
These agreements may represent a significant new revenue source:
Wikimedia finances (before):
- Annual revenue: ~$150 million
- Main source: Donations
- Dependency: Fundraising campaigns
With AI agreements:
- Potential additional revenue: $50-100 million/year
- Source diversification
- Less pressure on donations
Ecosystem Implications
Precedent For Other Sources
The Wikipedia agreement may inspire other data sources:
Who might follow:
- Stack Overflow (already has agreements)
- Reddit (already has agreement with Google)
- GitHub (Microsoft already owns)
- Specialized forums
- News sites
- Tech blogs
The Future of Open Knowledge
A tension emerges between:
Open knowledge:
- Wikipedia is free to read
- Anyone can edit
- Mission to disseminate knowledge
- Nonprofit
Monetization for AI:
- Companies profit from data
- Compensation for maintenance
- Financial sustainability
- Possible restrictive terms
Open questions:
- Do agreements affect the mission?
- Does data remain public?
- Do volunteers feel valued?
- Will quality be maintained?
What It Means For Users
Visible Changes
AI users may notice:
Product impact:
- More attribution to Wikipedia
- Possibly links to articles
- Factual information quality
- Better source citation
Attribution Example
AI models may include:
Response based on information from Wikipedia
Source: https://en.wikipedia.org/wiki/JavaScript
Last updated: January 2026Community Reactions
Wikipedia Volunteers
The volunteer editor community has divided opinions:
In favor:
- Financial sustainability
- Recognition of work
- Infrastructure investment
- Wikipedia visibility
Against:
- "Selling" volunteer work
- Companies making billions
- Insufficient compensation
- Potential conflict of interest
AI Companies
Company reactions:
Positive:
- Legitimacy in data use
- Structured and updated access
- Lower legal risk
- Formalized relationship
Concerns:
- Additional costs
- Competitors may have same data
- Usage restrictions
- Precedent for other sources to ask for payment
Trends For 2026-2027
The AI Data Market
We're seeing the formation of a new market:
Emerging characteristics:
- Licensing as standard: Formal agreements becoming the norm
- Established prices: Market defining values
- Intermediaries: Licensing platforms emerging
- Regulation: Governments may intervene
- Consolidation: Large players dominating
Impact on Developers
For startups:
- Data costs increasing
- Higher entry barriers
- Importance of proprietary data
- Business models affected
For large companies:
- Competitive advantage through agreements
- Costs as cost of doing business
- Source diversification
- Investment in own data
Recommendations
For Developers
- Document sources: Know where each dataset comes from
- Use official APIs: Avoid scraping when possible
- Consider licensing: For significant commercial use
- Track changes: Terms may change
- Invest in own data: Reduce external dependency
For Companies
- Evaluate needs: Need a formal agreement?
- Budget for data: Include in planning
- Diversify sources: Don't depend on just one
- Monitor regulation: Laws may change
- Consider contributing: Give back to open communities
Conclusion
Wikipedia's agreements with Microsoft, Meta, Amazon, Perplexity, and Mistral represent an important milestone in formalizing data use for AI training. This creates a model that will likely be followed by other data sources.
Key points:
- Wikipedia signed licensing agreements with 5 big techs
- Model includes payment and mandatory attribution
- OpenAI, Google, and Anthropic are not in the agreements
- Public API and dumps remain available
- Precedent may affect the entire data ecosystem
For developers, it's important to understand this new landscape and plan your data strategies considering growing costs and legitimacy requirements.
To learn more about how AI is changing, read: Project to Poison Web Crawlers.

