AI bots are taking a toll on Wikipedia’s bandwidth, but the Wikimedia Foundation has rolled out a potential solution.
Bots often cause more trouble than the average human user, as they are more likely to scrape even the most obscure corners of Wikipedia. Bandwidth for downloading multimedia, for example, grew by 50% since January 2024, the foundation noted earlier this month. However, the traffic isn’t coming from human readers but automated programs constantly downloading openly licensed images to feed images to AI models.
To address the problem, the Foundation teamed up with Google-owned firm Kaggle to produce Wikipedia content “in a developer-friendly, machine-readable format” in English and French.
“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP [natural language processing] pipelines,” the foundation says.
Kaggle says the offering, currently in beta, is “immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis.” AI developers using the dataset will get “high-utility elements” including article abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections.
All the content is derived from Wikipedia and is freely licensed under two open-source licenses: the Creative Commons Attribution-ShareAlike 4.0 and the GNU Free Documentation License (GFDL), though public domain or alternative licenses may apply in some cases.
Recommended by Our Editors
We’ve seen organizations use less collaborative approaches to dealing with the threat of AI bots. Reddit introduced progressively stricter controls to stop bots from accessing the platform, after instituting a controversial change to its API policies in 2023 that forced devs to pay up.
Many other organizations, such as The New York Times, have sued over AI scraping bots, though their motivation is financial rather than performance-related. The lawsuit alleges that ChatGPT maker OpenAI is liable for billions in damages because it scraped NYT articles to train its AI models without permission. Other publications have made deals with AI startups.
Get Our Best Stories!
Your Daily Dose of Our Top Tech News
By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.
Thanks for signing up!
Your subscription has been confirmed. Keep an eye on your inbox!
About Will McCurdy
Contributor

Leave a Comment
Your email address will not be published. Required fields are marked *