For many years, the standard method to stop and control web crawlers has been the robots.txt file. Dating back to 1994 as part of the Robots Exclusion Protocol (REP) on the nascent web, robots.txt is arguably misunderstood by those outside of tech circles. It’s not a security feature; it’s more of a ‘gentleman’s agreement,’ telling web crawlers which parts of a website they can access and which parts are off limits.
The robots.txt file isn’t – and never was – perfect. Malicious actors knew that they could simply ignore or find ways to circumvent the command. Yet, the role of robots.txt has been undergoing more scrutiny of late due to the AI boom. The race for data to train AI models has become a frenetic competition between AI companies like Google and OpenAI, as well as smaller bespoke AI businesses.
The Value of Data
All data is valuable in the race to create prescient models. You can argue that the data contained in a long-form paywalled New York Times article is of equal importance to what influencers or doing on Instagram or gaming trends on a Megaways slots platform. The point, as such, is that there is a rush to collect as much human knowledge as possible to train AI bots. Broadly speaking, a lot of AI companies have been deploying web crawlers and ignoring established conventions like robots.txt files. And, as you expect, not everyone is happy about it.
The headlines, of course, focus on organizations whose business model is focused on making money from data, i.e., media outlets. The aforementioned New York Times, for example, intends to sue OpenAI for “billions” for what it describes as egregious copyright infringements. The NYT certainly isn’t alone. It goes beyond new media, too, with legal challenges from everyone from booksellers to digital artists.
Cloudflare’s AI Labyrinth
There have been some interesting initiatives put in place to combat the web crawlers. Recently, Cloudflare, the prominent website infrastructure company, has introduced its “AI Labyrinth,” making it available to its millions of customers. The AI Labyrinth works as a kind of honeypot trap, enticing web crawlers into a near-endless loop of web pages. To you or me, those web pages will look like nonsense – pure gobbledygook – but they are like nectar to the web crawlers.

Cloudflare’s approach is not the only solution on offer, nor is it the first attempt at trying to block AI web crawlers. Traditional methods, ranging from firewalls to CAPTCHA verification to dynamic content loading, have been used to combat the problem, but nothing is 100% fool-proof. Cloudflare’s solution has gained a lot of attention – it used by around 33% of Fortune 500 companies – but it remains to be seen how effective it is.
The Ethics of Web Crawling
In the end, the wider debate is about ethics. Some believe that any information posted on the web should be fair game, even if it is copyrighted or behind a paywall. Even if the New York Times (we are just using the publication as an example) puts robust anti-web crawler safeguards on its platforms, its articles usually end up published elsewhere on third-party websites. Yet, you can understand the worry for these publications over how AI might impact their business models.
Our argument is simple: we are only seeing the tip of the iceberg. Listen to what people like (OpenAI founder) Sam Altman are saying. Data – and the cost of data – are the most important thing in this so-called AI race. Data holders are becoming increasingly aware of that, and they want AI companies to either pay or refrain from taking data that infringes their copyrighted material. The outcome could be decided in the courts, sure, but you can be sure that companies are going to take more technical methods to protect their data, too.