Internet

AI Website Scrapers Are Evolving at Alarming Rates

Can regulation play a part in the industry's future?

A digital representation of files and folders

By Tobias Carroll / July 30, 2024 10:44 am

How much of the open internet is fair game for AI companies scraping the web to train their software? In recent weeks, that question has entered the spotlight with the news that Reddit has blocked most search engines — but, notably, not Google — from digging through its archive. It’s clear that there’s a science to configuring a website to keep it from being scraped, but according to a recent report, many sites are simply unable to stop these AI bots due to the rapid rate of their development.

Writing at 404 Media, Jason Koebler reported on an unfortunate phenomenon: Websites taking steps to keep the AI company Anthropic (referred to as one of the industry’s “hottest start-ups” in a New York Times headline earlier this year) from scraping their content are being thwarted because the tactics that previously worked no longer apply to the software that’s currently being used to crawl the web.

A spokesperson for the online resource Dark Visitors told 404 Media that this is indicative of a larger problem: namely, that the technology involved is being updated at a breakneck pace. “The ecosystem of agents is changing quickly, so it’s basically impossible for website owners to manually keep up,” the Dark Visitors spokesperson said.

The Case for and Against AI Art

Eva Toorenent and Phillip Toledano are on opposite sides of the heated debate over artificial intelligence. Here, they make their arguments.

Given the pace at which AI companies, search engines and other entities are now digging through content on the web, some website owners are being forced to reckon with the day-to-day implications of this level of scraping, and putting the larger issue of stolen content on the back-burner. For example, last week, 404 Media also reported that Anthropic’s bot “hit iFixit’s website nearly a million times in a single day” — something that could affect the site’s functionality.

It’s hard to imagine this being sustainable for much longer, but it also begs the question of what regulation of scraper bots might look like. Earlier this summer, Reuters reported that some AI companies were bypassing existing standards to regulate what could and could not be scraped. Expect to see this issue come to a head sooner rather than later.

The Case for and Against AI Art

More Like This

How to Behave Online: 84 New Rules

Could AI and Machine Learning Expand What We Know About Alzheimer’s?

Machine Learning Could Provide a Diagnosis After a Single Cough

Apparently AI Models Can Create Other AI Models Now