Track AI Bot 404s With Cloudflare and Notion

GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are crawling your site daily. The URLs they 404 on are the articles AI engines expect to find but cannot. Pipe Cloudflare bot logs into a Notion database, capture the 404s, and route each into a redirect or a draft. You now have a free AEO content engine running on first-party data, not the probability-based guessing Peec and Profound do from outside. One-day setup. Real signal from day one.

TL;DR

GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are crawling your site right now. The URLs they 404 on are the articles AI engines expect to find but can't. Map those 404s daily, route each into a redirect or a draft, and you have a free AEO content engine. We built ours on Cloudflare and Notion in one day. May 2026.

What's an AI demand engine?

A pipeline that reads your server logs, filters for verified AI bot traffic from agents like GPTBot, ClaudeBot, and PerplexityBot, captures the 404s, and persists them into a queryable database. Every row is a URL an AI engine tried to fetch on your domain. Every 404 row is a URL the model expected to exist on your site but doesn't.

That gap between what the model expects and what you've shipped is the most actionable signal in AEO. Nobody else has it. Peec, Profound, and every other probability-based AEO tool guesses what models think about you by querying them from outside. Your server logs capture the models querying you directly. One is a sample. The other is the ground truth.

For a primer on AEO itself, see our complete guide to Answer Engine Optimization in 2026. Below: the operational layer for teams who already buy the premise.

Which AI bots actually matter in 2026?

The bot population shifted in early 2026. The current shortlist of agents you want to track on any B2B SaaS domain:

User agent	Engine	Signal type	What it tells you
`GPTBot`	OpenAI / ChatGPT	Training crawl	What ChatGPT is learning about your domain for future model snapshots
`OAI-SearchBot`	OpenAI / ChatGPT Search	Real-time retrieval	What ChatGPT pulled when a user asked about your topic
`ChatGPT-User`	OpenAI / ChatGPT	Real-time retrieval	User-triggered fetch during a specific conversation
`ClaudeBot`	Anthropic / Claude	Training crawl	Claude's view of your domain
`Claude-User`	Anthropic / Claude	Real-time retrieval	Triggered when a Claude user asked something your domain might answer
`PerplexityBot`	Perplexity	Real-time retrieval	What Perplexity surfaced for the user's prompt
`Google-Extended`	Google / Gemini, AI Overviews	Training + retrieval	Google's AI-surface eligibility crawl
`CCBot`	Common Crawl	Training corpus	Bulk dataset that downstream LLMs train on
`Applebot-Extended`	Apple Intelligence	Training crawl	Apple's AI surface eligibility
`Bingbot`	Microsoft / Copilot	Search + AI	Bing index + Copilot retrieval
`Meta-ExternalAgent`	Meta AI	Real-time retrieval	What Meta AI pulled on a user query
`AdsBot-Google-Mobile`	Google Ads	Mobile-friendliness probe	Often probes URLs before they ship (useful leakage signal)

Two categories matter for the 404 demand-signal layer specifically. Real-time retrieval bots (OAI-SearchBot, PerplexityBot, Claude-User, Meta-ExternalAgent) 404 on URLs the model just decided to fetch in a live conversation. Training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) 404 on URLs the model believes should exist based on prior training. Both signal content demand. The real-time ones convert faster into traffic; the training ones compound into your AI brand identity for years.

How does this differ from regular SEO log analysis?

SEO log analysis has been a thing since the 2000s. Botify, Screaming Frog, OnCrawl, JetOctopus all built tools for it. The job was finding crawl-budget waste and broken URLs that hurt Googlebot. The signal was indexation health. The audience was the technical SEO team.

AI log analysis is different in two ways. First, the bot population shifted. ChatGPT-User, PerplexityBot, GoogleOther, Meta-ExternalAgent, Anthropic's ClaudeBot, and Apple's Applebot now generate a real share of traffic on any site that ranks for buyer queries. Second, the signal isn't indexation, it's retrieval. When PerplexityBot fetches your /best-x-comparison URL after a user asked Perplexity "what's the best x," that fetch is the retrieval moment. Logging it gives you ground truth on what was actually retrieved when, for which prompt-shape, by which engine.

Most SEO log tools weren't built for that. Botify and Screaming Frog let you filter by bot user agent but don't reframe 404s as content opportunities. JetOctopus and OnCrawl do better at signaling AI bot patterns but still optimize for crawl health. The pipeline we describe below isn't a substitute for those tools, it's a layer on top: persist the right 404s into a calendar workflow so they become content decisions instead of yet another log-analyzer dashboard.

Why are AI bot 404s the highest-signal layer?

Three reasons.

The model has already done the categorization. When an AI engine fetches a URL, it has already decided this URL is the kind of thing that should exist on your domain for the query it's answering. You don't have to guess if a topic is relevant. The engine guessed for you.

The 404 specifically encodes a gap. A 200 response means you already have the page. A 404 means the engine wanted the page and you don't have it. That's literally latent content demand. The model just told you what to write.

The signal compounds with retrieval position. The more your domain gets cited by AI engines, the more bots probe URLs that don't yet exist. High-retrieval sites get more 404 telemetry because more agents are reading them. The signal scales with the thing you actually want, which is share of answer in AI surfaces.

How do you query Cloudflare for this data?

Cloudflare exposes a GraphQL Analytics API. The dataset you want is httpRequestsAdaptiveGroups. The filters that matter:

verifiedBotCategory_in: restrict to verified AI Crawler, Search Engine Crawler, Advertising bot categories
edgeResponseStatus = 404: only the demand-signal rows
clientRequestPath: group by URL
userAgent: keep for forensics on which bot is asking

A minimal query, scoped to the past 24 hours:

query AIBot404s($zone: String!, $since: String!, $until: String!) {
  viewer {
    zones(filter: { zoneTag: $zone }) {
      httpRequestsAdaptiveGroups(
        limit: 1000
        filter: {
          datetime_geq: $since
          datetime_leq: $until
          edgeResponseStatus: 404
          verifiedBotCategory_in: ["AI Crawler", "Search Engine Crawler", "Advertising & Marketing"]
        }
      ) {
        count
        dimensions {
          clientRequestPath
          userAgent
          verifiedBotCategory
        }
      }
    }
  }
}

Token scope: Zone Analytics Read on the zones you care about. Free tier is fine for a 24-hour window; longer windows need the paid analytics API. Most sites only need 24-hour rolling.

You get back a list of paths, the bot categories that hit them, the user agents, and the request counts. That's the input.

How do you filter signal from noise at the user-agent level?

Verified-bot categories help, but they bucket too broadly. The cleanest approach is a positive-match regex on user agent. The shortlist we run:

const AI_BOT_PATTERNS = [
  /GPTBot/i,                  // OpenAI training
  /OAI-SearchBot/i,           // ChatGPT search retrieval
  /ChatGPT-User/i,            // ChatGPT conversation fetch
  /ClaudeBot/i,               // Anthropic training
  /Claude-User/i,             // Claude conversation fetch
  /Anthropic-AI/i,            // Anthropic alt UA
  /PerplexityBot/i,           // Perplexity
  /Perplexity-User/i,         // Perplexity conversation fetch
  /Google-Extended/i,         // Google AI-surface eligibility
  /GoogleOther/i,             // Google misc AI crawl
  /CCBot/i,                   // Common Crawl
  /Applebot-Extended/i,       // Apple Intelligence training
  /Meta-ExternalAgent/i,      // Meta AI
  /Meta-ExternalFetcher/i,    // Meta retrieval
  /Bytespider/i,              // ByteDance / Doubao
];

Drop everything that isn't on this list when computing the demand-signal table. Keep raw user-agent strings in a separate column for forensics. Update the list quarterly; new agents appear and old ones get renamed.

How do you persist the signal into a usable shape?

The query returns a 24-hour snapshot. To turn that into a content engine, you need to deduplicate across days, accumulate request counts, and timestamp the last-seen date per URL. A simple key on : upserts cleanly into any KV store, database, or in our case, a Notion database.

We use Notion because the content team already lives there. Every captured 404 lands in an AI Bot 404 Patterns table with five columns: Path, URL, Total Requests, Last Seen Date, Last Bot Category, Last User-Agent. New URLs append. Existing URLs update their counters and last-seen. We run the sync nightly via a scheduled worker.

The Notion side is a single POST to the API. Sketch:

await notion.pages.create({
  parent: { database_id: AI_BOT_404_PATTERNS_DB_ID },
  properties: {
    "Path": { title: [{ text: { content: row.path } }] },
    "URL": { url: `https://${domain}${row.path}` },
    "Total Requests": { number: row.count },
    "Last Seen Date": { date: { start: row.lastSeen } },
    "Last Bot Category": { rich_text: [{ text: { content: row.category } }] },
    "Last User-Agent": { rich_text: [{ text: { content: row.userAgent } }] },
  },
});

For an existing row, replace pages.create with pages.update keyed on the URL. The whole persistence layer is roughly 80 lines of code. The longest part is the upsert logic. The shortest part is the Cloudflare query.

What does day-one data actually look like?

This is the part most playbooks skip. Real data is messier than tutorials suggest. Here's the literal first run from loudface.co on May 24, 2026:

Path	Hits	Bot category	Implied query	Action taken
`/post/webflow-and-auth0-guide`	1	Search Engine Crawler (Bingbot)	Old URL pattern indexed	Shipped catch-all 301 from `/post/:slug` to `/blog/:slug`
`/blog/cms-for-marketers-2026`	1	AI Crawler (PetalBot)	"Best CMS for marketers 2026" (slug guess)	Shipped specific 301 to canonical slug
`/blog/seo-traffic-not-converting-pipeline`	1	Advertising (AdsBot-Google-Mobile)	Mobile-ad-eligibility probe on draft URL	Drafted full article, queued for ship
`/news-sitemap.xml`	1	AI Crawler (Meta-ExternalAgent)	News content discovery	Deferred (we don't publish news)
`/security.txt`	2	Search Engine Crawler (Dataprovider)	Standard security-contact probe	Ignored
`/.well-known/security.txt`	2	Search Engine Crawler (Dataprovider)	Standard security-contact probe	Ignored
`/humans.txt`	1	Search Engine Crawler (Dataprovider)	Standard humans-file probe	Ignored

Seven rows. Three actionable. Three junk. One deferred.

The actionable rate on day one was 43%. That's higher than we expected, and we expect it to climb as the model corpus updates. A site that ranks well in AI engines gets more probes than a site that doesn't, so the absolute volume of useful 404 signal compounds with your retrieval position.

What we did with those three:

/post/webflow-and-auth0-guide: shipped a catch-all 301 from /post/:slug to /blog/:slug within 4 hours. Closed every legacy URL still in any AI engine's index. Estimated time: 15 minutes including QA.
/blog/cms-for-marketers-2026: shipped a specific 301 to the actual slug we published the piece under (/blog/webflow-best-cms-for-marketers). Estimated time: 5 minutes.
/blog/seo-traffic-not-converting-pipeline: the article was already drafted. Google's ad bot probing the slug before publish was a strong signal that we'd left the URL exposed somewhere (likely a preview environment or sitemap leak). We finished the draft and queued it for ship.

Three signals captured. Three actions taken. Total operator time under an hour. The pipeline runs itself nightly from there.

How do you turn every 404 into an action without a human in the loop?

You can't, fully. But you can structure the table so the human decision is trivial. We use this decision tree, applied weekly when reviewing the table:

Is the URL a junk probe (/humans.txt, /security.txt, random path attacks)? → Ignore. Optional: add a filter to suppress in the sync.
Is the URL an old slug that should resolve to a current page? → Ship a 301 redirect. Single-line change in next.config.ts or your routing layer. Done in minutes.
Is the URL one you could plausibly publish? → Add it as an Idea row in your content calendar. The bot is telling you what to write.
Is the URL something weird (XML sitemap variants, well-known files, vendor probes)? → Decide once whether to add it. Sites with news content add news-sitemap.xml. Most don't.

The decision tree fits on a sticky note. The point is to keep human attention on the only step that requires judgment: bucket 3. Everything else is mechanical.

Why server logs beat probability-based AEO tools

The probability-based AEO tools (Peec, Profound, AirOps, BrandRank) work by querying LLMs from their servers, looking at the responses, and reverse-engineering what got cited. The output is a probability distribution: "we estimate ChatGPT cites loudface.co 14% of the time for this prompt."

That's useful for tracking trends. It's not useful for telling you what to write next, because the signal is downstream of retrieval. You see what the engine decided AFTER it decided. You don't see what the engine TRIED to retrieve and failed.

Server logs invert the angle. They capture what the engine actively tried to fetch from your domain, in real time, with the user-agent and timestamp as primary keys. You see the engine's intent before it produces an output. The 404s are the engine's intent meeting a missing page.

Both signals matter. We run both. But if you have to pick one to start with, server logs are higher-fidelity, lower-cost, and require no third-party subscription. The data is sitting in your hosting provider's logs right now.

Common mistakes when running this pipeline

Five failure modes we've watched competitors hit while shipping their own log-analysis tooling. Worth pre-empting.

Treating all 404s as crawl errors. This is the framing every traditional SEO log tool defaults to. A 404 is a redirect-or-fix problem in the old model. In the AI-bot context, a 404 from PerplexityBot is an unfilled query. Don't redirect to your homepage. Don't 410 it. Triage it.

Relying only on Cloudflare's bot category. The verified-bot bucket lumps GPTBot, CCBot, ClaudeBot, and PerplexityBot together as "AI Crawler" but you need finer-grained signal. Always keep the raw user-agent. The category is for prefiltering, not analysis.

Running this on too short a window. A 24-hour pull is fine for daily review, but the action surface depends on accumulated signal. Hold the data for 90 days minimum. We watched one team triage 404s as junk because volume was low on day three; by day forty those same paths had 30+ hits and were unambiguous content opportunities.

Forgetting Google-Extended. Most posts on this topic mention GPTBot and ClaudeBot and stop. Google-Extended controls AI Overviews eligibility, possibly the highest-volume AI surface for B2B SaaS by 2026. Block it and you opt out of Google's AI Overviews entirely.

Building this and not connecting it to the calendar. The pipeline produces signals. The signals need a home in whatever tool your content team actually uses. We chose Notion because that's where our content calendar lives. A spreadsheet works. A Jira board works. What doesn't work is a Slack channel that nobody triages.

Why we publish this rather than gatekeep it

One row in yesterday's AI Bot 404 Patterns table flagged that AdsBot-Google-Mobile was probing a URL nobody had shipped yet. The slug existed in our sitemap because a publish job had run ahead of the CMS, leaking the URL into Google's crawl before the page was live. We tracked the leak, fixed the sitemap, and finished the draft the bot had been hunting for.

That row was the proof. The pipeline finds the gap. You write into the gap. The page that filled the gap is the one you're reading. We document the system rather than gatekeep it because the clients we want are the ones who can read a playbook like this and decide whether they want help running it. The ones who can't won't be the right fit.

For the broader context on how we think about AEO measurement, see our Toku case study (86% Peec visibility on stablecoin-payroll prompts) and TradeMomentum (7x total organic impressions in a vertical category over 6 months).

What it costs to run

Cloudflare API token: free.

Notion database: free tier handles the volume.

Compute: one nightly worker run, sub-second per query, free on most platforms (Cloudflare Workers, Vercel cron, Notion's hosted worker runtime).

Storage: trivial. We're at 7 rows after one day. A site at 100x our citation volume would be at 700 rows after a day, still trivial.

Engineering time to ship the initial version: one focused day. Refining the noise filters and the action-taking workflow took another two days. Total budget for a working system: one engineer-week.

That's the floor. The ceiling depends on how much actionable signal your site produces, which depends on your existing AI retrieval position. A site cited heavily by AI engines gets a lot of probe traffic. A site that doesn't get cited produces a quiet log file. Either way, you learn something.

What you should build first if you only have a weekend

The minimum viable version is three things:

A cron job that pulls Cloudflare GraphQL data once a day. Bash + curl + a few flags is enough. No worker runtime needed.
A flat file or Google Sheet you append to. No database needed. The deduplication can wait until you have more than ~50 rows.
A weekly 30-minute review window where you walk the new rows and bucket them by the four-step decision tree above.

That's it. The Notion managed database, the persistence layer, the worker runtime, the verified-bot category filter, the multi-tenant routing, all of those are nice. None of them are required to start capturing the signal. The signal is already in your logs. The work is just turning it into a queue.

If you build the weekend version and find no actionable signal, you have a useful answer: your AI retrieval position is too low for log-based signal to compound. Go fix that first. Our free AI audit maps your current retrieval position in 15 minutes, and the SEO + AEO service page covers what we do for clients in the same lane.

The bottom line

Your server logs already contain a list of articles AI engines want you to write. The technology to read them is free, the analysis takes minutes, and the action loop is mechanical. Most agencies aren't doing this because the playbook is new. The ones that do, get cited.

We started ours on May 23, 2026. Day one produced 7 rows, 3 actionable, 2 ready-to-ship redirects, and 1 article (this one). The thing we're most surprised by is how much signal came out of how little setup. The thing we're least surprised by is how few competitors are running an equivalent pipeline.

If you want the working version of this pipeline for your own site, book a 15-minute call below. We'll walk through your logs live, ship one redirect inside the call, and leave you with the GraphQL query and the Notion template. No follow-up sequence.

Working on a B2B SaaS or fintech growth program? We run a free 30-minute AI citation audit. We open the dashboard, walk through the prompt graph for your category, and tell you what's working (or who else can help). See our public pricing first if that helps.

Frequently Asked Questions

Are AI bot 404s a Google ranking factor?

No, not directly. Google's ranking systems don't penalize you for unrelated AI bot 404s. But high 404 rates from Google-Extended specifically can signal your domain isn't structured for AI Overviews eligibility, which is a separate retrieval surface within Google. Track those separately.

How is this different from Botify, Screaming Frog, or JetOctopus?

Those tools were built for crawl-budget analysis. They let you filter by user agent and surface 404s in dashboards, but they don't reframe the 404s as content opportunities or push them into a calendar workflow. The pipeline described here is a lightweight layer on top: keep using your log analyzer for technical SEO, add this for content prioritization. They don't conflict.

What tools do I need besides Cloudflare and Notion?

Nothing strictly required. Cloudflare gives you the GraphQL API. Notion gives you the calendar surface. If you're not on Cloudflare, swap in your hosting provider's log access (Vercel, Fastly, Netlify, AWS CloudWatch). If you're not on Notion, swap in any database or spreadsheet your team already uses. The pattern is the same: filter, persist, review.

Can I do this without a paid plan?

Yes. Cloudflare's free GraphQL API covers 24-hour windows on any zone (including free-tier zones). Notion's free workspace handles thousands of rows. The whole pipeline runs on free tiers for any small site. Paid plans only become necessary if you want longer log retention (Cloudflare's paid analytics API) or higher Notion limits (large teams).

How much traffic do AI bots actually generate?

A small fraction of total traffic on most sites. The point isn't the volume, it's the signal density. A single PerplexityBot 404 on a URL you don't have tells you more about content demand than a thousand human search impressions on a page you do have. Volume is a vanity number. Signal-per-event is what matters.

Does this work without Cloudflare?

Yes. Any reverse proxy or hosting platform that exposes logs works. Vercel has access logs. Fastly has real-time logs. Netlify exposes them. AWS has CloudWatch. The pattern is the same: filter by user agent and response status, persist, review weekly. Cloudflare just happens to have the cleanest verified-bot taxonomy and a free GraphQL API.

What's the next thing to build on top of this?

Two extensions worth considering. First, a similar log capture for 200 responses on AI bot traffic, so you can map which of your pages are being retrieved and by which engine. That's the "what's working" view. Second, a sentiment classifier that takes each retrieved page and asks ChatGPT how it would describe loudface based on that page. That's the "what is the model learning about us" view. Both are second-order. The 404 layer pays back first.

The AI Demand Engine: Build a Free Cloudflare-to-Notion Pipeline That Tells You What to Write Next