Wikipedia Tests New Way to Keep AI Bots Away, Preserve Bandwidth

Wikipedia Tests New Way to Keep AI Bots Away, Preserve Bandwidth

AI bots are taking a toll on Wikipedia’s bandwidth, but the Wikimedia Foundation has rolled out a potential solution.

Bots often cause more trouble than the average human user, as they are more likely to scrape even the most obscure corners of Wikipedia. Bandwidth for downloading multimedia, for example, grew by 50% since January 2024, the foundation noted earlier this month. However, the traffic isn’t coming from human readers but automated programs constantly downloading openly licensed images to feed images to AI models.

To address the problem, the Foundation teamed up with Google-owned firm Kaggle to produce Wikipedia content “in a developer-friendly, machine-readable format” in English and French.

“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP [natural language processing] pipelines,” the foundation says.

Kaggle says the offering, currently in beta, is “immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis.” AI developers using the dataset will get “high-utility elements” including article abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections.

All the content is derived from Wikipedia and is freely licensed under two open-source licenses: the Creative Commons Attribution-ShareAlike 4.0 and the GNU Free Documentation License (GFDL), though public domain or alternative licenses may apply in some cases.

Recommended by Our Editors

We’ve seen organizations use less collaborative approaches to dealing with the threat of AI bots. Reddit introduced progressively stricter controls to stop bots from accessing the platform, after instituting a controversial change to its API policies in 2023 that forced devs to pay up.

Many other organizations, such as The New York Times, have sued over AI scraping bots, though their motivation is financial rather than performance-related. The lawsuit alleges that ChatGPT maker OpenAI is liable for billions in damages because it scraped NYT articles to train its AI models without permission. Other publications have made deals with AI startups.

Get Our Best Stories!


Newsletter Icon


Your Daily Dose of Our Top Tech News

Sign up for our What’s New Now newsletter to receive the latest news, best new products, and expert advice from the editors of PCMag.

By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.

Thanks for signing up!

Your subscription has been confirmed. Keep an eye on your inbox!

About Will McCurdy

Contributor

Will McCurdy

I’m a reporter covering weekend news. Before joining PCMag in 2024, I picked up bylines in BBC News, The Guardian, The Times of London, The Daily Beast, Vice, Slate, Fast Company, The Evening Standard, The i, TechRadar, and Decrypt Media.

I’ve been a PC gamer since you had to install games from multiple CD-ROMs by hand. As a reporter, I’m passionate about the intersection of tech and human lives. I’ve covered everything from crypto scandals to the art world, as well as conspiracy theories, UK politics, and Russia and foreign affairs.

Read Will’s full bio

Read the latest from Will McCurdy

Leave a Comment

Your email address will not be published. Required fields are marked *