Web Scraping Tarpits: Who Actually Gets Caught

Sites Are Setting Traps for AI Crawlers

A tool called Nepenthes went viral in early 2025. It generates infinite mazes of fake web pages, each linking to more fake pages, designed to trap crawlers in a loop they can't escape. The text on those pages? Algorithmically generated gibberish, crafted to pollute AI training datasets with garbage.

Nepenthes isn't alone. Projects like Locaine and a growing list of open-source "tarpits" have popped up on GitHub, each with the same pitch: if AI companies won't respect robots.txt, site owners will fight back with poison.

The motivation makes sense. An academic study on arXiv found that AI-blocking among reputable sites jumped from 23% in September 2023 to nearly 60% by May 2025. BuzzStream's analysis showed 79% of top news sites now block AI training bots via robots.txt. And Cloudflare Radar reported that 75% of AI-related web traffic in mid-2025 was generated for training purposes, not search or inference.

But tarpits don't check credentials. They don't ask why you're crawling. They trap anything that looks automated.

Who's Actually Getting Trapped

The intended targets are obvious: GPTBot, ClaudeBot, the AI company crawlers collecting the open web for training data. The problem is that tarpits can't tell the difference between OpenAI's crawler and your price monitoring script.

Tarpits detect automated request patterns. If your scraper follows links systematically, hits pages at consistent intervals, or skips JavaScript execution (the way most AI training crawlers operate), it looks like a target. The trap doesn't care that you're a 10-person e-commerce team tracking competitor pricing. It sees bot-shaped traffic and starts serving fake pages.

This isn't just theoretical. Research from Rutgers and Wharton found that sites blocking AI crawlers saw a 23.1% decline in total traffic and a 13.9% drop in human traffic. The aggressive blocking posture doesn't just stop AI scrapers. It hurts the site's own visibility too.

And tarpits go further: they actively waste a crawler's compute, storage, and bandwidth while feeding it data that degrades whatever model or database it's building.

The Escalation Ladder

Robots.txt was always a gentleman's agreement. It worked when everyone followed the rules. When major AI companies started ignoring it (or finding creative interpretations of "crawling for search" vs. "crawling for training"), site owners escalated.

The pattern looks like this:

Robots.txt blocks: the polite request
User-agent filtering: blocking known AI crawler signatures
Behavioral detection: catching unknown crawlers by their request patterns
Tarpits: active countermeasures that waste resources and poison data

Each step catches more threats. Each step also catches more legitimate traffic. By step four, you're treating all automated access as hostile. So a scraper collecting publicly available product prices for a comparison service hits the same traps as GPTBot collecting data without permission.

What Data Teams Should Do Now

If you're running data collection at any scale, tarpits change the rules. Several things matter more than they used to.

Respect robots.txt, always. This sounds basic, but it's table stakes now. Sites use robots.txt as a first-pass filter. Ignore it, and you're putting yourself in the same category as the AI training bots that started this whole tarpit response.

Don't look like a training crawler. AI training crawlers have predictable signatures: they follow every link, request pages in bulk, skip JavaScript, and maintain regular intervals. If your scraper does the same, behavioral detection will flag it. Vary your timing. Load only what you need. Execute JavaScript when the site requires it. We wrote about what causes scrapers to get blocked in Why Your Web Scraper Keeps Breaking.

Validate incoming data. Tarpits serve plausible-looking garbage. If you're not checking responses in your pipeline, you could be storing Markov-generated text as real product descriptions. Build validation as a core step, not an afterthought.

Invest in your request infrastructure. The old playbook (rotate IPs, solve CAPTCHAs, retry on failure) isn't enough. Modern anti-bot systems analyze TLS fingerprints, browser behavior, and session patterns. Smart proxy routing helps, but the real shift is from IP-level to behavior-level detection. If you're scraping JavaScript-heavy sites, browser-based collection is increasingly the only reliable approach.

The Access Gap Is Widening

We think the web is heading toward a clear split. On one side: sites that monetize data through paid access agreements, API partnerships, and licensed crawling. On the other: sites that treat all automated access as a threat and deploy progressively aggressive countermeasures.

For data teams, this means collection costs will keep rising. Not because the technology is harder to build, but because the environment is more hostile. The teams that invest in responsible, transparent scraping practices will keep their access. The ones that look like training bots will get trapped, poisoned, and locked out.

Tarpits aren't going away. The question for your team isn't whether to worry about them. It's whether your infrastructure can spot the difference between a real page and a trap before that data hits your database.