AI crawler behaviour benchmark — 2026 Q2Twelve named bots, three categories, vendor-documented and Sentinel-observed.
The benchmark covers twelve named AI crawler user agents and two robots.txt usage-control tokens used by OpenAI, Anthropic, Perplexity, Google, Apple, Microsoft, and Common Crawl. Each row distinguishes the kind of bot (autonomous crawler, user-triggered fetcher, robots.txt token), describes its documented robots.txt behaviour, and notes what Sentinel observed during Q2.
By Martin Yarnold · UpdatedThe bot benchmark — Q2 2026
All entries reference each vendor's public crawler docs. The Kind column distinguishes autonomous crawlers (controlled by robots.txt) from user-triggered fetchers (where vendor docs say robots.txt may not apply in the same way) and from robots.txt usage-control tokens (no HTTP fetcher of their own — never appear in access logs).
| Bot | Vendor | Kind | Documented robots.txt behaviour | Observed Q2 2026 | Docs |
|---|---|---|---|---|---|
| GPTBot | OpenAI | Autonomous crawler | Honours robots.txt allow/disallow per OpenAI bot docs. | Active on most monitored sites; stable hit rate across Q2. | platform.openai.com/docs/bots |
| OAI-SearchBot | OpenAI | Autonomous crawler | Honours robots.txt allow/disallow per OpenAI bot docs. | Hit rate continued to ramp relative to GPTBot — consistent with ChatGPT Search expansion. | platform.openai.com/docs/bots |
| ChatGPT-User | OpenAI | User-triggered fetcher | OpenAI documents this as user-initiated; bot docs note robots.txt rules may not apply to user-initiated requests in the same way as to autonomous crawling. | Continued to reach pages whose autonomous-crawler counterpart was disallowed, consistent with vendor documentation. | platform.openai.com/docs/bots |
| ClaudeBot | Anthropic | Autonomous crawler | Honours robots.txt per Anthropic crawler docs. | Active on most monitored sites; stable hit rate. | support.claude.com (crawler article) |
| Claude-User | Anthropic | User-triggered fetcher | Documented by Anthropic as a user-triggered fetcher. | Lower frequency than the autonomous ClaudeBot; appears when users invoke Claude.ai with web access. | support.claude.com (crawler article) |
| Claude-SearchBot | Anthropic | Autonomous crawler | Honours robots.txt per Anthropic crawler docs. | Steady Q2 cadence. | support.claude.com (crawler article) |
| PerplexityBot | Perplexity | Autonomous crawler | Honours robots.txt per Perplexity crawler docs. | Active on most monitored sites. | docs.perplexity.ai/guides/bots |
| Perplexity-User | Perplexity | User-triggered fetcher | Perplexity has stated this generally does not treat robots.txt as binding because the fetch is user-initiated. | Continued to reach pages whose PerplexityBot counterpart was disallowed, consistent with the vendor stance documented since the 2024 Cloudflare reporting. | docs.perplexity.ai/guides/bots |
| Googlebot | Autonomous crawler | Honours robots.txt allow/disallow. Also serves as the underlying fetcher for AI features in Search; AI Overviews / AI Mode controlled separately via preview controls (nosnippet, data-nosnippet, max-snippet, noindex). | Highest baseline hit rate of the named bots; no material change from Q1. | developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers | |
| Google-Extended | robots.txt token | Robots.txt usage-control token; no HTTP fetcher. Controls whether Google may use crawled content for Gemini Apps / Vertex AI Gemini training and grounding. | Zero log hits expected and observed (token, not a UA). Operator misconception that disallowing Google-Extended affects AI Overviews remained the most common Q2 mistake. | developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers | |
| Applebot | Apple | Autonomous crawler | Honours robots.txt per Apple crawler docs. | Lower frequency than the major commercial bots; consistent. | support.apple.com/en-us/119829 |
| Applebot-Extended | Apple | robots.txt token | Robots.txt usage-control token; no HTTP fetcher. Controls whether Apple may use Applebot-crawled data for Apple Intelligence and generative AI training. | Zero log hits expected and observed (token, not a UA). | support.apple.com/en-us/119829 |
| Bingbot | Microsoft | Autonomous crawler | Honours robots.txt per Microsoft webmaster docs. | Active on most monitored sites; serves Bing search and Copilot retrieval. | bing.com/webmasters/help/which-crawlers-does-bing-use |
| CCBot | Common Crawl | Autonomous crawler | Honours robots.txt per Common Crawl docs. | Multi-quarter frequency slowdown continued through Q2. | commoncrawl.org/ccbot |
Methodology
For each named bot Sentinel records: (1) the user-agent string actually appearing in access logs across the monitored site set; (2) the vendor docs page describing the bot at the close of the quarter; (3) the observed robots.txt behaviour relative to the vendor's documented behaviour. Q2 2026 observations describe measurements taken across April–June 2026.
Robots.txt behaviour is verified by adding a temporary disallow rule for a single bot path and watching access logs to confirm that bot stops fetching the disallowed path within its cache window (typically minutes to hours; some bots take up to 24 hours). Bots not verifiable by this method (the user-triggered fetchers and the robots.txt-only tokens) are described per vendor docs only.
Frequently asked questions
Why publish a quarterly benchmark?
AI crawler behaviour shifts on quarterly cadences as vendors release new bot versions, change crawl rates, or update robots.txt documentation. A quarterly view catches these shifts faster than an annual report and slow enough to filter out per-day measurement noise. The 2026 Q2 view describes what each named bot did between April and June 2026, with vendor docs cross-checked at the close of the quarter.
Which bots are in the benchmark?
Twelve named user agents from the four major AI vendors plus Microsoft, Apple, and Common Crawl: GPTBot, OAI-SearchBot, ChatGPT-User (OpenAI); ClaudeBot, Claude-User, Claude-SearchBot (Anthropic); PerplexityBot, Perplexity-User (Perplexity); Googlebot (Google); Applebot (Apple); Bingbot (Microsoft); CCBot (Common Crawl). Plus two robots.txt usage-control tokens that carry no HTTP fetcher of their own: Google-Extended (Google) and Applebot-Extended (Apple). The token rows are kept separate because they do not appear in server logs as a fetching UA.
What is the autonomous-crawler vs user-triggered-fetcher split?
Each major engine operates two distinct kinds of named bot. Autonomous crawlers (GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, PerplexityBot, Googlebot, Applebot, Bingbot, CCBot) crawl on their own schedule and honour robots.txt allow/disallow rules per their vendor docs. User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User) fetch in response to a user action in the engine; OpenAI's bot docs note that robots.txt rules may not apply to user-initiated requests in the same way as to autonomous crawling, and Perplexity has stated Perplexity-User generally does not treat robots.txt as binding because the fetch is user-initiated. The two buckets need different operator strategies — blocking the autonomous crawler does not necessarily block the user-triggered fetcher.
Why are Google-Extended and Applebot-Extended treated separately?
Both Google and Apple publish robots.txt usage-control tokens that look like bot names but are not log-visible HTTP fetchers. Google-Extended is a robots.txt directive that controls whether crawled content may be used for Gemini Apps and Vertex AI Gemini training and grounding; the actual crawling is performed by Googlebot. Applebot-Extended is a directive that controls whether Apple may use Applebot-crawled data for Apple Intelligence; the crawling is performed by Applebot. Looking for either string in access logs returns zero hits even when the corresponding AI usage is allowed. The benchmark treats them as a separate category because operator behaviour differs — these tokens are set in robots.txt but never expected to appear in server logs.
Which bot is most active on a typical site in Q2?
Operationally observable, varies by site profile, not vendor-published. The autonomous crawlers most likely to appear in server logs across general commercial sites are Googlebot, Bingbot, GPTBot, ClaudeBot, and PerplexityBot, with CCBot and Applebot less frequent on most sites. Sites with strong inbound links and frequent content updates see higher AI-engine crawl rates than slow-moving sites. Treat this as a directional pattern; absolute numbers depend entirely on the site's crawlable surface area and update cadence.
Did crawl rates change Q1 → Q2 2026?
Operationally, the autonomous-crawler hit rates remained relatively stable through Q2 across the sites Sentinel monitors. The most notable shift was continued ramp of OAI-SearchBot relative to GPTBot, consistent with ChatGPT Search becoming more central to OpenAI's consumer product surface. CCBot frequency continued the multi-quarter slowdown observed across 2025–2026. None of these rate observations are vendor-confirmed; treat them as Sentinel's server-log observation against a fixed monitoring set.
How can I run my own crawler benchmark?
Three components: (1) extract user-agent strings from your access logs filtered to the named AI bots above; (2) cross-check each observed UA against its vendor docs page (the Q2 link list is in the methodology table on this page); (3) verify robots.txt behaviour by adding a temporary disallow for one bot and watching logs to confirm the bot stops fetching the disallowed path within its cache window. The methodology is straightforward; the discipline is doing it on a repeatable cadence so quarter-on-quarter shifts become visible. Sentinel automates this as part of the weekly Sentinel cycle.
What about third-party crawler aggregators that resell content to AI engines?
Common Crawl (CCBot) is the largest such aggregator — its crawl results are an open dataset that many AI labs ingest as training material. CCBot itself documents robots.txt compliance. The major engines (OpenAI, Anthropic, Google, Perplexity) increasingly crawl directly with their own named bots; for those engines, blocking CCBot has limited effect on direct citation, more effect on long-tail and research uses. Beyond CCBot, smaller commercial scrapers exist that do not honour robots.txt and do not identify themselves with a documented UA — these are out of scope for this benchmark because they cannot be reliably named or attributed.