====== Bot Traffic and the Crawler Problem ====== ===== The Numbers ===== Per the [[https://www.imperva.com/resources/resource-library/reports/2025-bad-bot-report/|Imperva 2025 Bad Bot Report]]: * Automated traffic surpassed human-generated activity in 2024 * **51% of all web traffic** is now bots * Good bots (search crawlers, etc.): 14% * **Bad bots: 37%** — six consecutive years of growth Bad bots: data scraping, fraud, credential stuffing, server overwhelming. Growth driven by genAI tools making bot deployment faster, cheaper, and accessible to people with minimal technical skill. ===== AI Crawlers ===== AI training crawlers operate at an extractive crawl-to-referral ratio: * Anthropic's Claude crawler peaked at approximately **500,000:1** — 500,000 pages crawled per ~1 visitor sent back * **13.26% of AI bot requests actively ignored robots.txt** directives in Q2 2025 These crawlers consume bandwidth that independent publishers pay for, return nothing, and a significant fraction disregard the standard opt-out mechanism entirely. ===== Why robots.txt Is Not Enough ===== The worst offenders do not act in good faith to honour ''robots.txt''. Maintaining a blocklist is an uphill arms race — new crawlers appear faster than blocks can be added. For independent publishers, this is not a viable solution. ===== Real Mitigations ===== * **DDoS protection / CDN** — absorbs volumetric bot traffic before it reaches your origin server. See [[folkzone:networking:deflect|Deflect]] (for independent/civil society sites) and [[folkzone:networking:cdn_consolidation|CDN options]] * **Rate limiting** — limit requests per IP per time window at the reverse proxy layer (Caddy, nginx) * **Honeypot paths** — serve fake endpoints that only bots visit; block IPs that hit them * **Netlify bandwidth** — if on a free tier, bot traffic can exhaust your bandwidth allocation before human visitors are served. 106.8 GB in a single month is not unusual for a site with any visibility ===== Intentional Apathy ===== One position: don't block AI crawlers. The [[indieweb:good_web|Good Web]] is built on openness; attempting to restrict crawlers selectively is an unwinnable arms race, and legitimate archiving bots (Wayback Machine, search engines) use the same mechanisms. See [[indieweb:principles|IndieWeb Principles]] on the balance between openness and protection. ===== See Also ===== * [[folkzone:networking:start|Homelab Networking Index]] * [[folkzone:networking:deflect|Deflect]] * [[folkzone:networking:cdn_consolidation|CDN Consolidation]] * [[folkzone:services:cloudflared|Cloudflare Tunnel]] * [[folkzone:services:caddy|Caddy — Rate Limiting]] * [[start|Return to wiki home]]  * [[folkzone:start|Return to folkzone]]