What is an AI crawler?The bots that read the web for generative answers.

AI crawlers are HTTP bots operated by AI engine companies. They are distinct from classic search engine crawlers because they feed grounded generation — the synthesised answers users read — not the ten blue links of a search results page. This page lists the major crawlers by purpose, explains why each company runs several different bots, and shows how to configure robots.txt for them.

By Martin Yarnold · Updated 10 May 2026

Bot crawl matrix

Sentinel's weekly transparency report shows actual observed AI-crawler hits across the named bots — useful reference for what a real site sees in practice.

See how Sentinel measures it →

AI engine fetcher bots

Ten named fetcher bots make up the practical set in 2026. Each has an HTTP user-agent string and appears in server logs, but robots.txt behaviour differs: autonomous crawlers honour robots.txt, while user-initiated fetchers (ChatGPT-User, Claude-User, Perplexity-User) are vendor-specific because the request is treated as a user action rather than autonomous crawling.

Bot	Company	Purpose	Robots behaviour
GPTBot	OpenAI	Model training	Honours robots.txt
OAI-SearchBot	OpenAI	ChatGPT Search citations	Honours robots.txt
ChatGPT-User	OpenAI	User-triggered fetches	Vendor-specific (user-initiated)
ClaudeBot	Anthropic	Claude grounding + training	Honours robots.txt
Claude-User	Anthropic	User-triggered fetches	Vendor-specific (user-initiated)
Claude-SearchBot	Anthropic	Claude search retrieval	Honours robots.txt
PerplexityBot	Perplexity	Citation indexing	Honours robots.txt
Perplexity-User	Perplexity	User-triggered fetches	Vendor-specific (user-initiated)
Bingbot	Microsoft	Bing + Copilot retrieval	Honours robots.txt
CCBot	Common Crawl	Open dataset (used by many AI labs)	Honours robots.txt

robots.txt usage-control tokens (no HTTP fetcher)

These names look like bots but are not. They have no HTTP fetcher and never appear in server logs. They are valid only as User-agent: lines in robots.txt and act as opt-out signals for AI training corpora — not real-time citation surfaces.

Token	Company	What it controls
Google-Extended	Google	Opts out of Gemini Apps generative training and Vertex AI grounding. Does NOT control Google Search AI Overviews or AI Mode — Googlebot plus snippet preview controls (nosnippet, max-snippet, data-nosnippet) govern those.
Applebot-Extended	Apple	Opts out of Apple Intelligence training. Does NOT block Applebot search and Siri crawling, which uses the separate Applebot user agent.

What blocking each one actually does

Allow (real-time citation fetchers)

OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot. Blocking these is the most direct way to remove a site from those engines' real-time grounded answers.

Decide carefully (training fetchers)

GPTBot, CCBot, and ClaudeBot's training side. Blocking removes content from future model knowledge and from future AI answers about a domain. Only block for a clear regulatory or proprietary-content reason.

Different concept (usage-control tokens)

Google-Extended and Applebot-Extended have no HTTP fetcher. Disallowing them opts a site out of Gemini Apps / Vertex AI training (Google-Extended) or Apple Intelligence training (Applebot-Extended). It does NOT remove pages from Google Search, AI Overviews, or Apple search.

Frequently asked questions

Why does each AI company run multiple crawler bots?

Different bots serve different purposes and need to be controlled independently. OpenAI runs GPTBot (for model training), ChatGPT-User (for user-triggered fetches during a chat), and OAI-SearchBot (for ChatGPT Search citations). A site might want to allow citation traffic but block training collection — which only works if the bots are named distinctly. Most AI companies have settled on this 2–4 bot pattern over the past two years.

What is the difference between GPTBot and ChatGPT-User?

GPTBot crawls the web autonomously to collect training data for OpenAI's future models. ChatGPT-User fetches pages on demand when a ChatGPT user asks a question that requires a live retrieval — for example, "summarise this article" pasted as a URL. The two have different rate limits, different robots.txt obedience patterns, and very different implications: GPTBot informs model training (long-tail brand exposure), ChatGPT-User informs a single user's session (immediate citation surface).

How is Google-Extended different from Googlebot?

Googlebot is an HTTP fetcher that crawls for Google Search indexing — including the content Google Search uses for AI Overviews and AI Mode. Google-Extended is not a fetcher: it is a robots.txt usage-control token that opts a site out of Gemini Apps generative training and Vertex AI grounding. Disallowing Google-Extended does not remove pages from Google Search and does not turn off AI Overviews — those are governed by Googlebot plus snippet preview controls (nosnippet, max-snippet, data-nosnippet). The two operate on different surfaces.

Should I block training crawlers like GPTBot and ClaudeBot?

For most sites, no. Blocking training bots removes your content from future model knowledge, which means future AI engine answers about your domain will not cite you. That is the opposite of AI visibility. The case for blocking: highly proprietary content, regulated industries (healthcare, finance) with disclosure constraints, or content you actively don't want associated with the brand at scale. For commercial content the blocking decision is almost always net-negative for visibility.

Can AI crawlers execute JavaScript?

Increasingly, yes, but unreliably. GPTBot, ClaudeBot, and Perplexity's crawlers all have some JavaScript execution capability, but their patience and fidelity is lower than Googlebot. A page that renders entirely on the client (e.g. a heavy React app with no server-side rendering) is at significant risk of being scraped empty by AI crawlers. Server-side rendering (SSR) or static-site generation (SSG) is currently the safest pattern for AI visibility.

How often do AI crawlers visit a site?

Highly variable. A medium-sized B2B site might see 10–100 AI crawler hits per day across all named bots combined. High-authority news sites see thousands per day. The cadence is unpredictable and changes when engines update retrieval indexes or run training cuts. Sentinel's bot crawl matrix on the weekly transparency report shows actual observed crawl behaviour on Promagen's own pages — a useful reference for what a real site's pattern looks like in practice.

Get a free Sentinel snapshot →

Rotate for Promagen