The five signals AI engines use to cite pagesReachability, entity clarity, substance, authority, freshness — documented vs observed.

AI engines do not publish their citation ranking. What they do publish — combined with what is operationally observable across the four major engines — points to five page-level signals operators can verifiably move. This page describes each signal, separates documented from observed, and gives the concrete operator action that affects each.

By Martin Yarnold · Updated 10 May 2026

Five-signal audit

Sentinel measures all five signals weekly per page across the monitored site set, plus a citation-rate time series across the four major engines.

See how Sentinel measures it →

The five signals, in fix order

Each signal is documented per its vendor support (where it has any), observed per Sentinel's measurement, and tied to a concrete operator action. Reachability comes first because every other signal is irrelevant if the engine cannot fetch the page; entity clarity comes second because reading without disambiguation produces unreliable citations.

Reachability

Documented: Every major engine publishes its bot user agents and documents robots.txt compliance for autonomous crawlers (GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, PerplexityBot, Googlebot, Applebot, Bingbot, CCBot). User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User) get separate documentation per vendor.

Observed: WAF / edge blocks are the most common cause of "engine cannot see us" diagnoses. Slow TTFB (>~300ms) causes retrieval-driven engines to skip pages on-demand at answer time.

Action: Allow GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Googlebot, Bingbot in robots.txt and at WAF / firewall layer. Sub-300ms TTFB on key pages. The single highest-leverage step for AI visibility.

Entity clarity

Documented: Schema.org structured data is documented as a parser input by every major engine for Search-style retrieval. AI engine vendors do not publish schema.org as a citation ranking input specifically.

Observed: Pages with populated JSON-LD (Organization, Article, BreadcrumbList, plus Product / FAQPage / Person where appropriate) cite more consistently than pages without. The mechanism is entity disambiguation — the engine can identify what the page is about and connect it to the rest of the site.

Action: JSON-LD coverage on every page Sentinel monitors: Organization (one node, referenced by @id), page-level Article or WebPage, BreadcrumbList. Commercial pages: Product, FAQPage with stable @id anchors. Canonical URL on every page. Consistent inLanguage and hreflang where multilingual.

Substantive content

Documented: Not vendor-documented as an AI citation input.

Observed: Pages with depth, uniqueness, and clear factual structure cite more consistently than thin or boilerplate pages. The proxy that correlates best with citation rate is FAQ schema coverage with substantive (60-150 word) answers and at least 800 words of unique editorial.

Action: Substantive content on the pages that matter commercially. FAQ blocks with stable @id anchors and answers long enough to lift verbatim into a generated answer. Avoid thin content; avoid generic boilerplate; avoid AI-generated wallpaper.

Authority signals

Documented: Not vendor-documented as an AI citation input.

Observed: Pages with named author bylines, citations to primary sources, and substantive internal linking cite more consistently than anonymous, source-less, isolated pages. The mechanism is presumably entity clarity at the editorial layer — the engine can identify who said what and where it sits in the site's knowledge graph.

Action: Named author byline on every editorial page (Person schema with stable @id, time element with dateTime attribute). Citations to primary sources where claims are made. Internal links from related pages into the page being optimised; avoid orphans.

Freshness signals

Documented: datePublished and dateModified are documented schema.org properties; engines parse them as standard structured data.

Observed: Pages with genuine recent edits are retrieved more frequently than dormant pages, particularly for time-sensitive query types. Date-bumping a page without real edits does not appear to move citation rate; engines seem to detect content-vs-date drift.

Action: Keep important pages updated with real edits (not date-bumped boilerplate). Bump dateModified only on material content changes. For tier-4 / time-stamped intelligence reports, set changeFrequency=weekly in sitemap and update on a real cadence.

Frequently asked questions

Why these five signals?

These are the five operator-controllable signals that consistently appear in vendor crawler documentation, observable AI engine behaviour, or both. They are not the engines' full ranking model — that is not published — but they are the structural surface operators can verifiably move. Other signals exist (the engines' internal retrieval scoring, query-time embedding choices, model-version effects) but they are either invisible to operators or not operator-controllable, so they belong in a different conversation.

Which of the five signals are vendor-published?

Reachability is partially vendor-published: every major engine publishes its bot user agents and documents robots.txt compliance for autonomous crawlers; user-triggered fetcher behaviour is documented per vendor (OpenAI's ChatGPT-User and Perplexity's Perplexity-User get separate treatment in their respective docs). Entity clarity, substantive content, authority signals, and freshness are not vendor-published as ranking inputs — they are operational hygiene that observably correlates with citation rate without being officially documented as a citation contract.

Which signal matters most?

Reachability, by a wide margin. Every other signal is irrelevant if the engine cannot fetch the page. The most common citation failure mode is structural: WAF edge blocks, robots.txt rules, or slow response time stopping the engine from reading the page at all. After reachability, entity clarity matters most because a page the engine can read but cannot disambiguate from generic content cites less consistently than a clearly-identified entity. Substantive content, authority, and freshness are closer to lateral signals — they shift citation probability at the margin once reachability and entity clarity are solved.

How do I measure where I am weak on these signals?

Reachability is measurable from server logs and curl tests against the named bot user agents. Entity clarity is measurable by validating JSON-LD coverage against a schema validator and grepping for canonical, hreflang, and Organization @id consistency. Substantive content is harder to measure directly — proxy with word count, FAQ schema coverage, and unique-vs-boilerplate ratio. Authority signals are measurable by author-byline coverage, internal-link counts, and external-source citation density. Freshness is measurable by datePublished/dateModified spread across the site. Promagen Sentinel runs all five layers weekly per page on the monitored set.

What about backlinks — are they an AI citation signal?

No vendor publishes backlinks as an AI citation input. Operationally, sites with strong inbound link profiles are often cited more, but the likely mechanism is correlation with the underlying signals (such sites also tend to have better reachability, schema, and content depth) rather than backlinks themselves being a documented input. Treat link building as an indirect lever for AI citation: improving the underlying entity quality is what moves the needle, and that improvement also tends to attract better links.

How do these signals shift over time?

The signals themselves are stable across model versions; the relative weighting between them is what shifts. AI engine releases change retrieval behaviour quietly — the same page can become more or less consistently cited from one model version to the next without any operator-side change. This is why Sentinel measures citation rate as a per-engine time series rather than a single absolute number: the trend is the actionable signal, not the per-week absolute count. Operators monitoring this can detect engine-side shifts within a week or two of model release.

What order should I fix these in?

In order: (1) reachability — WAF and robots.txt allows for the autonomous crawlers; (2) entity clarity — JSON-LD coverage and canonical/hreflang correctness; (3) substantive content — depth and uniqueness on the pages that matter commercially; (4) authority signals — author byline + primary-source citations + internal linking on those same pages; (5) freshness — keep important pages updated with real edits, not date-bumped boilerplate. Skip steps and the diagnosis becomes ambiguous; fix in order and each layer's effect is measurable.

How does Sentinel help with the five signals?

Sentinel measures all five signals weekly per page on the monitored site set: reachability via bot fetch tests, entity clarity via schema validation, substantive content via depth proxies, authority via byline + internal-link coverage, and freshness via datePublished/dateModified analysis. The output is a per-signal pass/fail per page plus a citation-rate time series across the four major engines. The combination shows which signal is weakest on which page, and whether weakness correlates with citation drift.

Get a free Sentinel snapshot →

Rotate for Promagen