Robots.txt for AI Crawlers: The 2026 Guide to Blocking GPTBot and More

Mirsal Saidu 7 min read

Protect your content from unauthorized AI training. How to block GPTBot, ClaudeBot, and Perplexity using robots.txt with updated 2026 crawler agents.

Robots.txt for AI Crawlers: The 2026 Guide to Blocking GPTBot and More

Every minute your site stays indexed by unauthorized AI crawlers, your original content is being absorbed into training corpora you never consented to. In 2026, the major AI labs have settled on a handful of well-behaved user agents, but the list has grown, the rules have shifted, and a stale robots.txt from 2023 will let through five crawlers it never anticipated. This guide gives you the exact directives to block GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and the rest of the 2026 lineup, with the nuance you need to keep search traffic while shutting the training door.

How do I block AI crawlers with robots.txt?

Add a User-agent block in your robots.txt file for each AI crawler you want to block, followed by Disallow: /. The 2026 essentials are GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and Bytespider. Place the file at your domain root and verify with each vendor's published agent documentation.

Why robots.txt still matters in 2026

The Robots Exclusion Protocol turned thirty last year, and despite predictions that AI scraping would render it obsolete, the opposite has happened. After a wave of lawsuits in 2024 and 2025, the major foundation-model labs publicly committed to honoring robots.txt directives, and crucially, they each publish a stable user-agent string so site owners can opt out.

That said, robots.txt is a request, not a wall. Honest crawlers obey it. Less scrupulous ones don't. For genuine enforcement you need WAF rules or IP-level blocks, but robots.txt remains the load-bearing first line of defense and the legal record that you withheld consent.

What changed between 2023 and 2026

  • Separate agents for training vs. retrieval. OpenAI, Anthropic, and Google now ship distinct user agents for model training and for real-time user-prompted retrieval. Blocking one no longer blocks the other.
  • Search-AI hybrids. Google-Extended, Applebot-Extended, and Bingbot's AI-training opt-out are now decoupled from their search crawlers, you can stay indexed in search while opting out of training.
  • New entrants. Meta-ExternalAgent, Amazonbot, Bytespider, and DuckAssistBot all became significant traffic sources between 2024 and 2026.

The 2026 AI crawler block list

Drop the following directives into your robots.txt file. Each block uses a separate User-agent declaration, combining them under one line is not part of the spec and is silently ignored by some crawlers.

# OpenAI — model training
User-agent: GPTBot
Disallow: /

# OpenAI — real-time browsing for ChatGPT users
User-agent: ChatGPT-User
Disallow: /

# OpenAI — search indexing (SearchGPT)
User-agent: OAI-SearchBot
Disallow: /

# Anthropic — Claude training and retrieval
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

# Google — AI training opt-out (does not affect Search)
User-agent: Google-Extended
Disallow: /

# Apple — Apple Intelligence training opt-out
User-agent: Applebot-Extended
Disallow: /

# Common Crawl — feeds most open-source training sets
User-agent: CCBot
Disallow: /

# ByteDance / TikTok
User-agent: Bytespider
Disallow: /

# Meta — Llama and AI assistants
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: FacebookBot
Disallow: /

# Amazon — Alexa and Bedrock training
User-agent: Amazonbot
Disallow: /

# Cohere
User-agent: cohere-ai
Disallow: /

# DuckDuckGo AI assistant
User-agent: DuckAssistBot
Disallow: /

Generating the right file for your domain is fiddly, one missing newline or a stray BOM character will invalidate the whole block. Our robots.txt generator handles the encoding, ordering, and validation in one step and outputs a current 2026 list.

Block training, allow search: the surgical approach

Most publishers don't want to disappear from Google or Bing, they want to stop unauthorized training. That distinction is now possible:

# Keep Google Search indexing
User-agent: Googlebot
Allow: /

# But block Gemini and AI Overviews training
User-agent: Google-Extended
Disallow: /

# Keep Bing Search
User-agent: Bingbot
Allow: /

# Apple Search stays, Apple Intelligence training blocked
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Disallow: /

This split lets you preserve organic traffic while withholding training consent, a configuration that was technically impossible until late 2023 and only became reliable across all three vendors in 2025.

Per-path blocks: protecting premium content only

Site-wide blocks can be too blunt. If you run a marketing site with a paywalled archive, you may want to expose the front end but seal off the archive:

User-agent: GPTBot
Disallow: /archive/
Disallow: /premium/
Disallow: /members/

User-agent: ClaudeBot
Disallow: /archive/
Disallow: /premium/
Disallow: /members/

Order matters less than you might think, modern crawlers evaluate the most specific match, but readability matters a lot. Group by agent, comment generously, and avoid trailing whitespace.

Common mistakes that silently fail

Wildcard user agents don't cover AI bots

User-agent: * followed by Disallow: / will not block GPTBot on its own. Most AI crawlers look for their specific agent string first and only fall back to * if no match is found, but in practice, several treat the absence of a named block as implicit permission. Always name the agent explicitly.

Meta robots tags don't help here

The noindex meta tag and X-Robots-Tag header tell search engines not to index a page. They do not instruct AI training crawlers to skip the content, most ingest the HTML regardless of indexing directives. Use robots.txt for ingestion control and meta robots for index control.

Caching and CDN propagation

If your site sits behind Cloudflare, Fastly, or a similar CDN, an updated robots.txt may take up to 24 hours to propagate. Purge the cache for /robots.txt immediately after editing, and verify with curl -A "GPTBot" https://yourdomain.com/robots.txt from a clean network.

Verifying your block list works

Three checks before you call it done:

  1. Fetch the file directly, curl https://yourdomain.com/robots.txt should return the file with HTTP 200, content type text/plain, and no HTML wrapping.
  2. Check vendor-published testers, OpenAI, Google, and Anthropic each publish either a tester or a documented agent verification page. Google Search Console's robots.txt tester remains the most reliable for Google agents.
  3. Monitor your logs, filter access logs for the agent strings above. A correctly blocked bot will still hit /robots.txt (that's how it learns it's blocked) but should not request other URLs afterward.

When robots.txt isn't enough

For sensitive content, paid archives, member-only material, proprietary data, layer additional defenses:

  • WAF rules blocking the published IP ranges of major AI vendors (OpenAI, Anthropic, and Perplexity all publish these).
  • Cloudflare's AI bot management, which uses behavioral signals to catch crawlers that spoof user agents.
  • Authentication. Anything truly valuable should require a login, not a politeness directive.

FAQ

Does blocking GPTBot remove my content from ChatGPT?

Going forward, yes — GPTBot will stop ingesting your pages for future model training. However, content already ingested in prior crawls remains in existing models. There is currently no retroactive removal mechanism short of a formal data deletion request to OpenAI.

Will blocking Google-Extended hurt my SEO?

No. Google-Extended is a separate agent from Googlebot, and Google has confirmed publicly that robots.txt directives for Google-Extended do not affect Search ranking, indexing, or inclusion in standard results. It only governs whether your content trains Gemini and powers AI Overviews.

How often should I update my robots.txt for AI crawlers?

Review quarterly at minimum. New agents appear roughly every six to nine months as new labs ship public models or existing labs split their crawler responsibilities. Subscribing to the AI vendor changelogs (OpenAI, Anthropic, Google AI) is the most reliable signal.

Do AI crawlers actually respect robots.txt?

The named major-lab crawlers in this guide do, based on independent verification studies published in 2024 and 2025. Smaller scrapers and adversarial actors do not. For high-value content, combine robots.txt with WAF rules and authentication.

Can I block AI crawlers from specific countries or regions?

Robots.txt does not support geo-targeting, it operates per user agent only. For regional blocks, use your CDN's geo-fencing or WAF rules. Note that most AI crawlers operate from a small set of US-based IP ranges regardless of where the resulting model is consumed.

What's the difference between ClaudeBot and Claude-Web?

ClaudeBot is Anthropic's training crawler, it ingests content for future Claude model training. Claude-Web is the real-time retrieval agent used when a Claude user enables web browsing during a conversation. Blocking one does not block the other; list both.

Last updated: 21 May 2026


Share this article:
M

Mirsal Saidu

Digital & Performance Marketer