The 38,000-to-1 gap between what AI crawls and what it cites.

By Ridho Putradi S'GaraJun 26, 20265 min read

// share

// table_of_contents▸

1.Crawling is only the top of the funnel
2.A crawl and a citation come from different bots
3.Retrieval runs as a pipeline, and every stage sheds sources
4.The citation gate is mostly E-E-A-T
5.What gets cited is not the same set of pages that rank
6.Where the work actually is

Last year, Anthropic's crawlers pulled around 38,000 pages for every single visitor they sent back to a publisher, at least going by Cloudflare's numbers. The gap narrowed through early 2026, but ClaudeBot was still crawling close to 24,000 pages for every referral it returned. You can read that ratio as the whole economics of AI search compressed into one figure, where the bots take an enormous amount and hand back almost nothing you can actually see in your analytics.

Most of the SEO world stops at that ratio and calls it theft, which is a fair enough gut reaction. The more interesting question is what happens to a page in the long stretch between the moment a bot fetches it and the rare moment an AI answer actually cites it. That stretch is a funnel with several stages, and almost nobody has bothered to map it.

Crawling is only the top of the funnel

Getting crawled matters, but it is nowhere near enough on its own. A page can be fetched, parsed, and stored in full and still never show up in a single answer, because the drop-off starts long before anyone decides what to cite.

The first losses happen before retrieval even kicks in. By one analysis, around 60% of ChatGPT queries get answered straight from the model's memory with no live search at all, which means there is no source to win and nothing for optimization to touch. So a large chunk of queries have already routed around every page on the open web before your content even enters the picture. What is left are the queries where the model does go looking, and those are the ones worth fighting for, with the fight running through the stages below.

A crawl and a citation come from different bots

This is the part most teams miss. The bot that trains a model and the bot that fetches a page to cite are usually not the same agent at all. Anthropic, for example, split its crawler in two, so ClaudeBot is the training scraper while Claude-Web is the live-retrieval agent that pulls pages to answer a Claude.ai user in the moment. OpenAI runs the same split with GPTBot on the training side and OAI-SearchBot plus ChatGPT-User on retrieval, and PerplexityBot sits on the retrieval side too.

That distinction has real consequences for your robots.txt. You can block the training scrapers if you want, but if you also block the retrieval agents you have removed yourself from the exact pipeline that produces citations. Cubitrek's advice is to let the live-retrieval agents through and pair the file with an llms.txt, so a well-behaved agent can grab what it needs in a single fetch rather than crawling 200 pages to find it. It is worth sorting out your bot rules before you touch anything else on this list.

Retrieval runs as a pipeline, and every stage sheds sources

Once a query does trigger a search, getting from query to cited answer is not one lookup but a whole sequence of steps. Am I Cited breaks it into phases that begin with intent parsing and query fan-out, then move through evidence extraction and entity linking before weighting, synthesis, and the final response, while ZipTie describes a similar four-stage version covering fan-out, chunking and retrieval, passage selection, and attribution.

Two of those stages do most of the damage, and fan-out is the first. A single user query gets expanded into a dozen or so sub-queries, so a page that answers the literal question but none of the expansions never even makes the shortlist. Then comes chunking, where things get a little counterintuitive, because RAG systems do not retrieve your page so much as fragments of it, what some practitioners have started calling "fraggles." Your content has to make sense when it is pulled out as a standalone 50 to 150 word chunk, and clean section boundaries improve semantic relevance by 9 to 15% in vector space, which is why a passage that only works alongside the three paragraphs above it tends to lose right here. Every one of these phases is somewhere to fall out, and a page can clear the crawl, clear fan-out, and still get cut at passage selection simply because the chunk did not hold up on its own.

The citation gate is mostly E-E-A-T

Even if you survive retrieval, there is one more filter deciding who actually gets named in the answer. A Wellows analysis of 2,400 citations found that 96% of AI Overview citations come from sources with strong E-E-A-T signals, and the more telling part is how that filter behaves rather than the headline number itself. E-E-A-T here acts less like a gentle ranking nudge and more like a threshold you either clear or you do not, to the point where pages ranking sixth to tenth with strong trust signals get cited 2.3 times more often than a number-one page with weak ones. Whatever position you hold on the old search results page stops protecting you once you reach this stage.

What gets cited is not the same set of pages that rank

The evidence that AI citation is its own game, separate from organic search, is now sitting in the academic literature. A large empirical study across 55,936 queries on six LLM search engines and two traditional ones found that 37% of the domains LLM engines cite are unique to them, never surfaced by Google or Bing for the same query. Practitioner data tells the same story, with only about 12% of AI-cited URLs appearing in Google's top 10 for the matching query, and the strongest predictor of getting cited turning out to be brand search volume rather than any technical signal, while backlinks showed weak to neutral correlation. The takeaway is that you cannot assume your existing rankings carry over, because for the most part they simply do not.

Where the work actually is

None of this means the ratio at the top of the page is a tax you just pay for nothing. It is a funnel with named stages, and every stage is something you can act on. Start with your bot rules in robots.txt, letting the retrieval agents through and adding an llms.txt so the agents you want can fetch cleanly, and write so that individual passages stand on their own, since the chunk is what gets retrieved rather than the whole page. From there, build topical depth across a full cluster of related queries instead of betting everything on one hero page, because fan-out rewards breadth, and treat E-E-A-T as the real threshold it is rather than a nice-to-have. Then measure what now matters, which is citations and referral traffic set against crawl cost, rather than the rankings you no longer control.

The 38,000-to-1 figure makes a clean headline, but the real story is the funnel underneath it, and most of that funnel sits unoptimized today for the simple reason that almost nobody has looked closely at the middle, which is exactly where the opportunity is.

// want_this_for_your_brand

See where your brand stands in AI answers today, benchmarked against your competitors, no pitch required.

[ request_an_audit → ]