TL;DR
GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are crawling your site right now. The URLs they 404 on are the articles AI engines expect to find but can't. Map those 404s daily, route each into a redirect or a draft, and you have a free AEO content engine. We built ours on Cloudflare and Notion in one day. May 2026.
What's an AI demand engine?
A pipeline that reads your server logs, filters for verified AI bot traffic from agents like GPTBot, ClaudeBot, and PerplexityBot, captures the 404s, and persists them into a queryable database. Every row is a URL an AI engine tried to fetch on your domain. Every 404 row is a URL the model expected to exist on your site but doesn't.
That gap between what the model expects and what you've shipped is the most actionable signal in AEO. Nobody else has it. Peec, Profound, and every other probability-based AEO tool guesses what models think about you by querying them from outside. Your server logs capture the models querying you directly. One is a sample. The other is the ground truth.
For a primer on AEO itself, see our complete guide to Answer Engine Optimization in 2026. This piece assumes you already buy the premise and want the operational layer underneath it.
Which AI bots actually matter in 2026?
The bot population shifted in early 2026. The current shortlist of agents you want to track on any B2B SaaS domain:
| User agent | Engine | Signal type | What it tells you |
|---|---|---|---|
GPTBot | OpenAI / ChatGPT | Training crawl | What ChatGPT is learning about your domain for future model snapshots |
OAI-SearchBot | OpenAI / ChatGPT Search | Real-time retrieval | What ChatGPT pulled when a user asked about your topic |
ChatGPT-User | OpenAI / ChatGPT | Real-time retrieval | User-triggered fetch during a specific conversation |
ClaudeBot | Anthropic / Claude | Training crawl | Claude's view of your domain |
Claude-User | Anthropic / Claude | Real-time retrieval | Triggered when a Claude user asked something your domain might answer |
PerplexityBot | Perplexity | Real-time retrieval | What Perplexity surfaced for the user's prompt |
Google-Extended | Google / Gemini, AI Overviews | Training + retrieval | Google's AI-surface eligibility crawl |
CCBot | Common Crawl | Training corpus | Bulk dataset that downstream LLMs train on |
Applebot-Extended | Apple Intelligence | Training crawl | Apple's AI surface eligibility |
Bingbot | Microsoft / Copilot | Search + AI | Bing index + Copilot retrieval |
Meta-ExternalAgent | Meta AI | Real-time retrieval | What Meta AI pulled on a user query |
AdsBot-Google-Mobile | Google Ads | Mobile-friendliness probe | Often probes URLs before they ship (useful leakage signal) |
Two categories matter for the 404 demand-signal layer specifically. Real-time retrieval bots (OAI-SearchBot, PerplexityBot, Claude-User, Meta-ExternalAgent) 404 on URLs the model just decided to fetch in a live conversation. Training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) 404 on URLs the model believes should exist based on prior training. Both signal content demand. The real-time ones convert faster into traffic; the training ones compound into your AI brand identity for years.
How does this differ from regular SEO log analysis?
SEO log analysis has been a thing since the 2000s. Botify, Screaming Frog, OnCrawl, JetOctopus all built tools for it. The job was finding crawl-budget waste and broken URLs that hurt Googlebot. The signal was indexation health. The audience was the technical SEO team.
AI log analysis is different in two ways. First, the bot population shifted. ChatGPT-User, PerplexityBot, GoogleOther, Meta-ExternalAgent, Anthropic's ClaudeBot, and Apple's Applebot now generate a real share of traffic on any site that ranks for buyer queries. Second, the signal isn't indexation, it's retrieval. When PerplexityBot fetches your /best-x-comparison URL after a user asked Perplexity "what's the best x," that fetch is the retrieval moment. Logging it gives you ground truth on what was actually retrieved when, for which prompt-shape, by which engine.
Most SEO log tools weren't built for that. Botify and Screaming Frog let you filter by bot user agent but don't reframe 404s as content opportunities. JetOctopus and OnCrawl do better at signaling AI bot patterns but still optimize for crawl health. The pipeline we describe below isn't a substitute for those tools, it's a layer on top: persist the right 404s into a calendar workflow so they become content decisions instead of yet another log-analyzer dashboard.
Why are AI bot 404s the highest-signal layer?
Three reasons.
The model has already done the categorization. When an AI engine fetches a URL, it has already decided this URL is the kind of thing that should exist on your domain for the query it's answering. You don't have to guess if a topic is relevant. The engine guessed for you.
The 404 specifically encodes a gap. A 200 response means you already have the page. A 404 means the engine wanted the page and you don't have it. That's literally latent content demand. The model just told you what to write.
The signal compounds with retrieval position. The more your domain gets cited by AI engines, the more bots probe URLs that don't yet exist. High-retrieval sites get more 404 telemetry because more agents are reading them. The signal scales with the thing you actually want, which is share of answer in AI surfaces.
How do you query Cloudflare for this data?
Cloudflare exposes a GraphQL Analytics API. The dataset you want is httpRequestsAdaptiveGroups. The filters that matter:
verifiedBotCategory_in: restrict to verified AI Crawler, Search Engine Crawler, Advertising bot categoriesedgeResponseStatus = 404: only the demand-signal rowsclientRequestPath: group by URLuserAgent: keep for forensics on which bot is asking
A minimal query, scoped to the past 24 hours:
query AIBot404s($zone: String!, $since: String!, $until: String!) {
viewer {
zones(filter: { zoneTag: $zone }) {
httpRequestsAdaptiveGroups(
limit: 1000
filter: {
datetime_geq: $since
datetime_leq: $until
edgeResponseStatus: 404
verifiedBotCategory_in: ["AI Crawler", "Search Engine Crawler", "Advertising & Marketing"]
}
) {
count
dimensions {
clientRequestPath
userAgent
verifiedBotCategory
}
}
}
}
}
Token scope: Zone Analytics Read on the zones you care about. Free tier is fine for a 24-hour window; longer windows need the paid analytics API. Most sites only need 24-hour rolling.
You get back a list of paths, the bot categories that hit them, the user agents, and the request counts. That's the input.
How do you filter signal from noise at the user-agent level?
Verified-bot categories help, but they bucket too broadly. The cleanest approach is a positive-match regex on user agent. The shortlist we run:
const AI_BOT_PATTERNS = [
/GPTBot/i, // OpenAI training
/OAI-SearchBot/i, // ChatGPT search retrieval
/ChatGPT-User/i, // ChatGPT conversation fetch
/ClaudeBot/i, // Anthropic training
/Claude-User/i, // Claude conversation fetch
/Anthropic-AI/i, // Anthropic alt UA
/PerplexityBot/i, // Perplexity
/Perplexity-User/i, // Perplexity conversation fetch
/Google-Extended/i, // Google AI-surface eligibility
/GoogleOther/i, // Google misc AI crawl
/CCBot/i, // Common Crawl
/Applebot-Extended/i, // Apple Intelligence training
/Meta-ExternalAgent/i, // Meta AI
/Meta-ExternalFetcher/i, // Meta retrieval
/Bytespider/i, // ByteDance / Doubao
];
Drop everything that isn't on this list when computing the demand-signal table. Keep raw user-agent strings in a separate column for forensics. Update the list quarterly; new agents appear and old ones get renamed.
How do you persist the signal into a usable shape?
The query returns a 24-hour snapshot. To turn that into a content engine, you need to deduplicate across days, accumulate request counts, and timestamp the last-seen date per URL. A simple key on upserts cleanly into any KV store, database, or in our case, a Notion database.
We use Notion because the content team already lives there. Every captured 404 lands in an AI Bot 404 Patterns table with five columns: Path, URL, Total Requests, Last Seen Date, Last Bot Category, Last User-Agent. New URLs append. Existing URLs update their counters and last-seen. We run the sync nightly via a scheduled worker.
The Notion side is a single POST to the API. Sketch:
await notion.pages.create({
parent: { database_id: AI_BOT_404_PATTERNS_DB_ID },
properties: {
"Path": { title: [{ text: { content: row.path } }] },
"URL": { url: `https://${domain}${row.path}` },
"Total Requests": { number: row.count },
"Last Seen Date": { date: { start: row.lastSeen } },
"Last Bot Category": { rich_text: [{ text: { content: row.category } }] },
"Last User-Agent": { rich_text: [{ text: { content: row.userAgent } }] },
},
});
For an existing row, replace pages.create with pages.update keyed on the URL. The whole persistence layer is roughly 80 lines of code. The longest part is the upsert logic. The shortest part is the Cloudflare query.
What does day-one data actually look like?
This is the part most playbooks skip. Real data is messier than tutorials suggest. Here's the literal first run from loudface.co on May 24, 2026:
| Path | Hits | Bot category | Implied query | Action taken |
|---|---|---|---|---|
/post/webflow-and-auth0-guide | 1 | Search Engine Crawler (Bingbot) | Old URL pattern indexed | Shipped catch-all 301 from /post/:slug to /blog/:slug |
/blog/cms-for-marketers-2026 | 1 | AI Crawler (PetalBot) | "Best CMS for marketers 2026" (slug guess) | Shipped specific 301 to canonical slug |
/blog/seo-traffic-not-converting-pipeline | 1 | Advertising (AdsBot-Google-Mobile) | Mobile-ad-eligibility probe on draft URL | Drafted full article, queued for ship |
/news-sitemap.xml | 1 | AI Crawler (Meta-ExternalAgent) | News content discovery | Deferred (we don't publish news) |
/security.txt | 2 | Search Engine Crawler (Dataprovider) | Standard security-contact probe | Ignored |
/.well-known/security.txt | 2 | Search Engine Crawler (Dataprovider) | Standard security-contact probe | Ignored |
/humans.txt | 1 | Search Engine Crawler (Dataprovider) | Standard humans-file probe | Ignored |
Seven rows. Three actionable. Three junk. One deferred.
The actionable rate on day one was 43%. That's higher than we expected, and we expect it to climb as the model corpus updates. A site that ranks well in AI engines gets more probes than a site that doesn't, so the absolute volume of useful 404 signal compounds with your retrieval position.
What we did with those three:
/post/webflow-and-auth0-guide: shipped a catch-all 301 from/post/:slugto/blog/:slugwithin 4 hours. Closed every legacy URL still in any AI engine's index. Estimated time: 15 minutes including QA./blog/cms-for-marketers-2026: shipped a specific 301 to the actual slug we published the piece under (/blog/webflow-best-cms-for-marketers). Estimated time: 5 minutes./blog/seo-traffic-not-converting-pipeline: the article was already drafted. Google's ad bot probing the slug before publish was a strong signal that we'd left the URL exposed somewhere (likely a preview environment or sitemap leak). We finished the draft and queued it for ship.
Three signals captured. Three actions taken. Total operator time under an hour. The pipeline runs itself nightly from there.
How do you turn every 404 into an action without a human in the loop?
You can't, fully. But you can structure the table so the human decision is trivial. We use this decision tree, applied weekly when reviewing the table:
- Is the URL a junk probe (
/humans.txt,/security.txt, random path attacks)? → Ignore. Optional: add a filter to suppress in the sync. - Is the URL an old slug that should resolve to a current page? → Ship a 301 redirect. Single-line change in
next.config.tsor your routing layer. Done in minutes. - Is the URL one you could plausibly publish? → Add it as an Idea row in your content calendar. The bot is telling you what to write.
- Is the URL something weird (XML sitemap variants, well-known files, vendor probes)? → Decide once whether to add it. Sites with news content add
news-sitemap.xml. Most don't.
The decision tree fits on a sticky note. The point is to keep human attention on the only step that requires judgment: bucket 3. Everything else is mechanical.
Why server logs beat probability-based AEO tools
The probability-based AEO tools (Peec, Profound, AirOps, BrandRank) work by querying LLMs from their servers, looking at the responses, and reverse-engineering what got cited. The output is a probability distribution: "we estimate ChatGPT cites loudface.co 14% of the time for this prompt."
That's useful for tracking trends. It's not useful for telling you what to write next, because the signal is downstream of retrieval. You see what the engine decided AFTER it decided. You don't see what the engine TRIED to retrieve and failed.
Server logs invert the angle. They capture what the engine actively tried to fetch from your domain, in real time, with the user-agent and timestamp as primary keys. You see the engine's intent before it produces an output. The 404s are the engine's intent meeting a missing page.
Both signals matter. We run both. But if you have to pick one to start with, server logs are higher-fidelity, lower-cost, and require no third-party subscription. The data is sitting in your hosting provider's logs right now.
Common mistakes when running this pipeline
Five failure modes we've watched competitors hit while shipping their own log-analysis tooling. Worth pre-empting.
Treating all 404s as crawl errors. This is the framing every traditional SEO log tool defaults to. A 404 is a redirect-or-fix problem in the old model. In the AI-bot context, a 404 from PerplexityBot is an unfilled query. Don't redirect to your homepage. Don't 410 it. Triage it.
Relying only on Cloudflare's bot category. The verified-bot bucket lumps GPTBot, CCBot, ClaudeBot, and PerplexityBot together as "AI Crawler" but you need finer-grained signal. Always keep the raw user-agent. The category is for prefiltering, not analysis.
Running this on too short a window. A 24-hour pull is fine for daily review, but the action surface depends on accumulated signal. Hold the data for 90 days minimum. We watched one team triage 404s as junk because volume was low on day three; by day forty those same paths had 30+ hits and were unambiguous content opportunities.
Forgetting Google-Extended. Most posts on this topic mention GPTBot and ClaudeBot and stop. Google-Extended controls AI Overviews eligibility, possibly the highest-volume AI surface for B2B SaaS by 2026. Block it and you opt out of Google's AI Overviews entirely.
Building this and not connecting it to the calendar. The pipeline produces signals. The signals need a home in whatever tool your content team actually uses. We chose Notion because that's where our content calendar lives. A spreadsheet works. A Jira board works. What doesn't work is a Slack channel that nobody triages.
Why we're publishing this instead of selling it
This article itself is one of the things the pipeline surfaced. The AI Bot 404 Patterns table flagged that AdsBot-Google-Mobile was probing the URL of a piece we'd drafted but not yet shipped. That meant the URL was leaking somewhere in our infrastructure before the publish window. We tracked the leak (a sitemap publish that ran ahead of the CMS), finished the draft, and now ship both the article AND a playbook on how the article itself was prioritized by an AI bot's curiosity.
That's the meta-loop. The pipeline finds the gap. You write into the gap. The piece itself becomes proof the pipeline works. We document the system rather than gatekeep it because clients we want are the ones who can read this and decide whether they want help shipping it. The ones who can't, won't be the right fit.
For the broader context on how we think about AEO measurement, see our Toku case study (86% Peec visibility on stablecoin-payroll prompts) and TradeMomentum (7x total organic impressions in a vertical category over 6 months).
What it costs to run
Cloudflare API token: free.
Notion database: free tier handles the volume.
Compute: one nightly worker run, sub-second per query, free on most platforms (Cloudflare Workers, Vercel cron, Notion's hosted worker runtime).
Storage: trivial. We're at 7 rows after one day. A site at 100x our citation volume would be at 700 rows after a day, still trivial.
Engineering time to ship the initial version: one focused day. Refining the noise filters and the action-taking workflow took another two days. Total budget for a working system: one engineer-week.
That's the floor. The ceiling depends on how much actionable signal your site produces, which depends on your existing AI retrieval position. A site cited heavily by AI engines gets a lot of probe traffic. A site that doesn't get cited produces a quiet log file. Either way, you learn something.
What you should build first if you only have a weekend
The minimum viable version is three things:
- A cron job that pulls Cloudflare GraphQL data once a day. Bash +
curl+ a few flags is enough. No worker runtime needed. - A flat file or Google Sheet you append to. No database needed. The deduplication can wait until you have more than ~50 rows.
- A weekly 30-minute review window where you walk the new rows and bucket them by the four-step decision tree above.
That's it. The Notion managed database, the persistence layer, the worker runtime, the verified-bot category filter, the multi-tenant routing, all of those are nice. None of them are required to start capturing the signal. The signal is already in your logs. The work is just turning it into a queue.
If you build the weekend version and find no actionable signal, you have a useful answer: your AI retrieval position is too low for log-based signal to compound. Go fix that first. Our free AI audit maps your current retrieval position in 15 minutes, and the SEO + AEO service page covers what we do for clients in the same lane.
The bottom line
Your server logs already contain a list of articles AI engines want you to write. The technology to read them is free, the analysis takes minutes, and the action loop is mechanical. Most agencies aren't doing this because the playbook is new. The ones that do, get cited.
We started ours on May 23, 2026. Day one produced 7 rows, 3 actionable, 2 ready-to-ship redirects, and 1 article (this one). The thing we're most surprised by is how much signal came out of how little setup. The thing we're least surprised by is how few competitors are running an equivalent pipeline.
If you want the working version of this pipeline for your own site, book a 15-minute call below. We'll walk through your logs live, ship one redirect inside the call, and leave you with the GraphQL query and the Notion template. No follow-up sequence.





