Wikimedia, the organization behind Wikipedia, said its infrastructure is being taxed by non-human traffic scraping the site for data to train AI models.

With the rise of these data collecting bots, the Wikimedia funding model is being turned upside down. Wikipedia content has been a big part of search engine results and that brought traffic to the company's site. AI has changed that equation and will challenge Wikimedia's ability to sustain itself.

In a post, Wikimedia said:

"Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone."

The Wikimedia experience with scraper bots collecting data to train AI models highlights another battle in the growing data access war. With large language models (LLM) already absorbing most of the world's data already, there are multiple issues revolving around infrastructure costs, API access and establishing a compensation model.

For enterprises, there will be data issues too as they try to leverage first-party data and sometimes skirmish with vendors who want to control access to their platforms by third parties. Agentic AI's biggest hurdle will be standards and charges to enable agents from different platforms to communicate, negotiate and carry out tasks. As AI develops there’s a risk that free content and data dies.

In fact, Wikimedia is paying more in infrastructure due to scraper bots downloading openly licensed images. Wikimedia said its content is free, but infrastructure isn't. Sixty-five percent of Wikimedia's most expensive traffic comes from bots.

Wikimedia said it is working on an attribution system for automated traffic so it can offer tiers for high volume scraping and API use. The company is also looking to reduce the amount of traffic generated by scrapers and the bandwidth consumed.