Which AI engines respect robots.txt?The documented per-bot view, sourced from each vendor.

The major AI vendors publish two distinct kinds of robots.txt string: log-visible HTTP crawler user agents (which make real HTTP requests under that UA) and robots.txt usage-control tokens (which carry no HTTP fetcher; the crawling is done by a separate UA). Autonomous crawlers honour robots.txt directives; user-triggered fetchers (ChatGPT-User, Perplexity-User) are documented by their vendors as user-initiated requests where robots.txt may not apply in the same way. This page lists each, marks its kind, and links to the vendor docs. Treat the table as the canonical access-control reference for AI visibility work.

By Martin Yarnold · Updated 10 May 2026

robots.txt audit

Sentinel verifies your robots.txt against the documented AI crawler UAs and usage-control tokens every Monday and flags blocks you didn't intend.

See how Sentinel measures it →

The documented AI robots.txt strings

All entries document robots.txt compliance at the vendor level. The Kind column distinguishes log-visible HTTP crawler UAs from robots.txt usage-control tokens that have no HTTP fetcher of their own. The docs URL column points to the canonical vendor article — always verify against the current docs before quoting in compliance, legal, or commercial materials.

String	Vendor	Kind	Purpose	Docs URL
GPTBot	OpenAI	Crawler UA	Model training	platform.openai.com/docs/bots
OAI-SearchBot	OpenAI	Crawler UA	ChatGPT Search citations	platform.openai.com/docs/bots
ChatGPT-User	OpenAI	Crawler UA	User-triggered fetches	platform.openai.com/docs/bots
ClaudeBot	Anthropic	Crawler UA	Grounding + training	support.claude.com (crawler article)
Claude-User	Anthropic	Crawler UA	User-triggered fetches	support.claude.com (crawler article)
Claude-SearchBot	Anthropic	Crawler UA	Search retrieval	support.claude.com (crawler article)
PerplexityBot	Perplexity	Crawler UA	Citation indexing	docs.perplexity.ai (crawlers)
Perplexity-User	Perplexity	Crawler UA	User-triggered fetches	docs.perplexity.ai (crawlers)
Googlebot	Google	Crawler UA	Google Search indexing (also serves AI features in Search; controlled separately by preview controls like nosnippet/data-nosnippet/max-snippet/noindex)	developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers
Google-Extended	Google	robots.txt token	Controls whether Google may use crawled content for Gemini Apps / Vertex AI Gemini training and grounding. NOT a log-visible user agent; AI Overviews / AI Mode are Search features controlled separately.	developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers
Applebot	Apple	Crawler UA	Apple search and Siri indexing	support.apple.com/en-us/119829
Applebot-Extended	Apple	robots.txt token	Controls whether Apple may use Applebot-crawled data for Apple Intelligence and generative AI training. NOT a log-visible user agent; Applebot does the crawling.	support.apple.com/en-us/119829
Bingbot	Microsoft	Crawler UA	Bing + Copilot retrieval	bing.com/webmasters/help/which-crawlers-does-bing-use
CCBot	Common Crawl	Crawler UA	Open dataset (used by many AI labs)	commoncrawl.org/ccbot

Frequently asked questions

Which AI engines respect robots.txt?

The major engines document robots.txt compliance differently for autonomous crawlers and user-triggered fetchers. OpenAI documents GPTBot and OAI-SearchBot — its autonomous crawlers — as honouring robots.txt allow/disallow rules; ChatGPT-User is a user-triggered fetcher and OpenAI's bot docs call out that robots.txt rules may not apply to user-initiated requests in the same way as to autonomous crawling. Anthropic documents ClaudeBot, Claude-User, and Claude-SearchBot. Perplexity documents PerplexityBot — its autonomous crawler — as honouring robots.txt; Perplexity-User is a user-triggered fetcher and Perplexity has stated that Perplexity-User generally does not treat robots.txt as binding because the fetch is user-initiated. Google documents Googlebot (the actual crawler UA, which serves both Google Search and the underlying fetcher for AI features in Search) and Google-Extended (a robots.txt usage-control token, not a log-visible UA, that controls whether crawled content may be used for Gemini Apps / Vertex AI Gemini training and grounding). Apple documents Applebot (the actual crawler UA) and Applebot-Extended (a robots.txt usage-control token that controls whether Apple may use Applebot-crawled data for Apple Intelligence). Microsoft documents Bingbot. Common Crawl's CCBot also documents robots.txt support. Treat compliance as documented at the vendor level.

What is the difference between Google-Extended and Googlebot in robots.txt?

Googlebot is Google's actual crawler user agent — it makes HTTP requests under that UA, serves Google Search indexing, and (per Google's docs) is the same fetcher whose responses feed AI features in Search. Google-Extended is a separate, AI-specific robots.txt usage-control token with no HTTP fetcher of its own; it controls whether crawled content may be used for Gemini Apps and Vertex AI Gemini training and grounding. Disallowing Google-Extended does not remove pages from Google Search and does not by itself remove pages from AI Overviews / AI Mode in Search — those are controlled by Googlebot + preview controls (nosnippet, data-nosnippet, max-snippet, noindex). Disallowing Googlebot removes pages from Google Search.

Do AI bots actually honour robots.txt in practice, or is it just documented?

Major engines actively enforce robots.txt at their crawler level — this is verifiable by adding a disallow rule and checking server logs for that bot stop hitting blocked paths. Edge cases: (1) some bots cache robots.txt and may take hours to days to honour a new rule; (2) user-triggered fetches (ChatGPT-User, Claude-User) sometimes have different behaviour than the autonomous bots; (3) third-party crawlers that copy content from indexed sources do not honour your robots.txt. For the named major-engine bots, compliance is the documented contract; for anything else, treat compliance as not guaranteed.

Should I block training crawlers like GPTBot and ClaudeBot?

For most commercial sites, no. Blocking removes your content from future model knowledge, which means future AI engine answers about your domain will not cite you. That is the opposite of AI visibility. The cases that justify blocking: regulated content (healthcare, finance) with disclosure constraints, highly proprietary content you do not want associated with the brand at training scale, or specific regulatory requirements. For typical commercial content, blocking training bots is net-negative for visibility.

Can I rate-limit AI bots instead of fully blocking them?

robots.txt does not support rate limiting directly — it is allow/disallow only. Rate limiting is a CDN- or WAF-layer concern. Cloudflare, Vercel firewall, and AWS WAF all support per-user-agent rate limits. If a bot is hitting your origin too hard, rate-limit at the edge layer rather than blocking via robots.txt. Blocking via robots.txt is binary; rate-limiting at the edge preserves the engine's ability to reach the most important pages while protecting your origin.

What about third-party crawlers that resell content to AI engines?

Common Crawl (CCBot) is the most-cited example — it operates as an open dataset that many AI labs ingest. CCBot itself documents robots.txt compliance. Blocking CCBot removes your content from the Common Crawl dataset, which is used by many smaller AI labs. The major engines (OpenAI, Anthropic, Google, Perplexity) primarily crawl directly with their own named bots; blocking CCBot has the largest effect on long-tail engines and research uses, less on the major commercial citation surfaces.

Get a free Sentinel snapshot →

Rotate for Promagen