====== Bot Traffic and the Crawler Problem ======

===== The Numbers =====

Per the [[https://www.imperva.com/resources/resource-library/reports/2025-bad-bot-report/|Imperva 2025 Bad Bot Report]]:

  * Automated traffic surpassed human-generated activity in 2024
  * **51% of all web traffic** is now bots
  * Good bots (search crawlers, etc.): 14%
  * **Bad bots: 37%** — six consecutive years of growth

Bad bots: data scraping, fraud, credential stuffing, server overwhelming. Growth driven by genAI tools making bot deployment faster, cheaper, and accessible to people with minimal technical skill.

===== AI Crawlers =====

AI training crawlers operate at an extractive crawl-to-referral ratio:

  * Anthropic's Claude crawler peaked at approximately **500,000:1** — 500,000 pages crawled per ~1 visitor sent back
  * **13.26% of AI bot requests actively ignored robots.txt** directives in Q2 2025

These crawlers consume bandwidth that independent publishers pay for, return nothing, and a significant fraction disregard the standard opt-out mechanism entirely.

===== Why robots.txt Is Not Enough =====

The worst offenders do not act in good faith to honour ''robots.txt''. Maintaining a blocklist is an uphill arms race — new crawlers appear faster than blocks can be added. For independent publishers, this is not a viable solution.

===== Real Mitigations =====

  * **DDoS protection / CDN** — absorbs volumetric bot traffic before it reaches your origin server. See [[folkzone:networking:deflect|Deflect]] (for independent/civil society sites) and [[folkzone:networking:cdn_consolidation|CDN options]]
  * **Rate limiting** — limit requests per IP per time window at the reverse proxy layer (Caddy, nginx)
  * **Honeypot paths** — serve fake endpoints that only bots visit; block IPs that hit them
  * **Netlify bandwidth** — if on a free tier, bot traffic can exhaust your bandwidth allocation before human visitors are served. 106.8 GB in a single month is not unusual for a site with any visibility

===== Intentional Apathy =====

One position: don't block AI crawlers. The [[indieweb:good_web|Good Web]] is built on openness; attempting to restrict crawlers selectively is an unwinnable arms race, and legitimate archiving bots (Wayback Machine, search engines) use the same mechanisms. See [[indieweb:principles|IndieWeb Principles]] on the balance between openness and protection.

===== See Also =====
  * [[folkzone:networking:start|Homelab Networking Index]]

  * [[folkzone:networking:deflect|Deflect]]
  * [[folkzone:networking:cdn_consolidation|CDN Consolidation]]
  * [[folkzone:services:cloudflared|Cloudflare Tunnel]]
  * [[folkzone:services:caddy|Caddy — Rate Limiting]]
  * [[start|Return to wiki home]]
  * [[folkzone:start|Return to folkzone]]