How Crawl Budget Works on Large Sites in 2026

For most sites, crawl budget never becomes a problem worth thinking about. Search engines find new pages, refresh the ones they already know, and move on without any help from you. The picture changes once a site gets large. Deep catalogs, filtered navigation, and fast content churn all pull against how much a crawler is willing to fetch, and pages start slipping through the cracks. New URLs sit undiscovered for weeks, important edits take days to register, and whole sections quietly fall behind in the index while everything looks fine from the front end.

That lag is almost never a content problem. It is a crawl efficiency problem, with its own mechanics that are worth understanding before you touch a single robots.txt line. The topic tends to get flattened into one of two unhelpful takes, where crawl budget is either a myth that only enterprise teams need or a magic explanation for every ranking drop. The reality is more ordinary and more useful, and it starts with what Google actually documents.

There is one more reason to get this right in 2026. Google is no longer the only crawler that matters to your results. Bingbot feeds the Bing index that grounds Copilot and a large share of ChatGPT's web answers, and the AI crawlers from OpenAI, Anthropic, and Perplexity are now hitting your origin alongside it. Crawl efficiency has stopped being a Google-only concern.

What crawl budget actually is

Google describes crawl budget as the set of URLs that Googlebot can crawl and wants to crawl on a given site. Two forces set that number, and they work independently. One is a ceiling on how hard the crawler is willing to push your server. The other is how much the crawler wants your pages in the first place.

Crawl capacity

Capacity is a politeness limit. Google watches how many simultaneous connections your server can absorb before it slows down or starts throwing errors, and adjusts accordingly. Fast responses and low 5xx rates let the ceiling rise. When latency climbs or the origin starts returning 429s and 500s, Googlebot eases off to avoid knocking the site over. This is self-protective behavior on Google's side, not a penalty aimed at you.

The practical takeaway is that server health sits upstream of every other crawl decision. If your time to first byte is erratic, no amount of robots.txt tuning will buy back the crawl you are losing to slow responses.

Crawl demand

Demand is the half you can actually influence, and Google shapes it from three signals it names openly, which are perceived inventory, popularity, and staleness. Perceived inventory is every URL Google has ever found on your site, whether or not you wanted it crawled. Popularity tracks the internal and external links pointing at a URL. Staleness is Google's running guess at how often a page changes in a way that matters.

Capacity is mostly an engineering job, so for a lot of SEO teams it stays out of reach. Demand is where architecture, linking, and content decisions do their work, and most large sites are leaking far more demand-side crawl than they would guess.

When crawl budget actually matters

Google's own guidance here is refreshingly blunt. You should think about crawl budget if your site runs past a million unique pages that change at least weekly, or past ten thousand pages that change daily. Below that, a fresh sitemap and an eye on the Index Coverage report will usually cover you.

Page count is not the only tell, though. If a large share of your URLs are stuck in the Discovered, currently not indexed bucket in Search Console, you have a crawl efficiency problem no matter how many pages you run. The same goes if your logs show Googlebot burning most of its visits on parameter URLs while your revenue pages get fetched once in a blue moon. The million-page number is really a stand-in for the question that matters, which is whether the crawler is spending its time where you need it.

Where large sites waste crawl budget

The waste almost always clusters into a handful of familiar shapes. None of them are exotic, and once you have seen them on one site you start spotting them everywhere.

A category page with five filters, each holding ten options, generates hundreds of thousands of URL combinations on its own. Very few of them are unique content and fewer still have any commercial value, yet every one is crawlable until you step in. Parameters like ?color=red, ?sort=price_asc, and ?size=M are ordinary front-end features that multiply the inventory a crawler thinks it has to work through.

There is no single switch that fixes this. What works is a layered policy that blocks the filter combinations with no search value in robots.txt, sets canonical tags on the duplicate filter states you keep, and points parameter-aware sitemaps only at the variants you genuinely want indexed.

Soft 404s and infinite crawl spaces

A soft 404 is a page that returns HTTP 200 while showing nothing of value, like a no-results screen, an out-of-stock product with no replacement, or an expired event that still resolves. Google keeps crawling these because nothing in the response tells it the page is dead. Infinite spaces are the nastier cousin. Calendar widgets that page forward forever, on-site search results that accept any input, and tag pages that combine into endless permutations all create holes a crawler can pour requests into and never climb back out of.

The catch with both is that they look harmless in a browser. You find them by pattern in the logs, not by clicking around, which is why log analysis tends to surface far more of them than any manual review does.

Redirect chains and noindex traps

Every hop in a redirect chain is a separate request, and every URL in that chain stays in the crawler's queue waiting its turn. The bigger trap is using noindex as if it saved crawl. Google has to fetch a noindex page before it can read the directive, so noindex cuts a page out of the index without cutting the crawl that reaches it. When the goal is to free crawl for other URLs, robots.txt is the tool that does it, since it stops the request before it happens.

This distinction trips up a lot of teams who apply noindex broadly and then wonder why crawl allocation never improves. If a section genuinely should not be fetched, block it. If it should be fetched but kept out of the index, noindex is right. Mixing up the two is where a lot of wasted effort goes.

A framework for diagnosing the waste

When a large site lands on the workbench, the sequence stays roughly the same every time.

Start with server logs, ideally thirty days of them, filtered down to verified Googlebot. The Crawl Stats report in Search Console gives a decent aggregate, but the raw logs are the ground truth and they are where the surprises live.

From there, bucket the requests by URL pattern, so product pages, category pages, faceted variants, blog content, static assets, redirects, 4xxs, and 5xxs each get their own tally. Set those buckets against the URLs you actually care about ranking. The ratio of crawl spent on important pages versus everything else is usually the first thing that raises eyebrows.

Then rank the waste patterns by how much crawl share they eat. Soft 404s, parameter explosions, and redirect chains almost always sit near the top. Fix them in order of impact rather than order of convenience, since server-level returns and robots.txt rules tend to move more crawl faster than canonical or sitemap changes do.

The step teams skip is the quantifying. Jumping to fixes without measuring which ones reclaim the most crawl is how you end up six months in with a pile of changes and barely any indexation to show for it.

A worked example

Picture a mid-sized fashion retailer with 60,000 unique products. Layer on five filter dimensions per category, plus pagination and sort orders, and the crawlable URL space runs past two million. In a case shaped like this, logs will often show Googlebot spending the bulk of its requests on filtered or sorted variants while only a thin slice ever reaches an actual product page, and new products taking a week or two to get discovered at all.

The fix is a short list of moves. Block every filter combination beyond a single active filter in robots.txt. Point the canonical on the remaining filtered pages back at the clean category. Return 410 for products that are gone for good instead of leaving soft 404 placeholders behind. Once the crawler re-settles, the share of crawl landing on product pages climbs substantially and discovery time for new products drops to a couple of days.

The tactics matter less than the sequence that produced them, because without the log data that same team would have spent the whole quarter arguing about which filters to block on gut feel.

What does not actually help

A few popular fixes do less than their reputation suggests.

Small content edits will not speed up crawling, because Google can tell the difference between a meaningful change and a cosmetic one, and nudging a date or reshuffling whitespace reads as the latter. Stripping third-party scripts does nothing for crawl budget either, since those are fetched by the browser, not by Googlebot hitting your origin. Reaching for noindex to save crawl actively backfires, because the page still has to be crawled before the directive is even seen. And the crawl-delay directive in robots.txt is ignored by Googlebot outright, so if you truly need to slow it down, the crawl rate setting in Search Console is the lever that works.

The pattern underneath all of these is that surface-level tactics do not reclaim crawl. What reclaims crawl is reshaping the URL space the crawler sees, and that is mostly a matter of information architecture and how you handle faceted navigation, with redirect cleanup and server performance sitting close behind.

How Bing and AI crawlers fit in

Googlebot gets the attention, but it is not the only crawler with an opinion about your URL space anymore. Bingbot runs on the same underlying logic of capacity and demand, and it matters more than its search share suggests, because the Bing index is what grounds Copilot and a meaningful portion of ChatGPT's web answers. If Bingbot is wasting its budget on your parameter URLs, that shows up as your brand being thinner in AI answers, not just lower in Bing's own results. Bing Webmaster Tools exposes its crawl information the same way Search Console does for Google, and it is worth watching for exactly this reason.

The AI-specific crawlers from OpenAI, Anthropic, and Perplexity behave along the same lines. They lean toward popular, well-linked, frequently updated pages, they respect the capacity limits your origin sets, and they burn cycles on the same soft 404s, parameter sprawl, and infinite spaces that drain Googlebot. A site with healthy traditional crawl efficiency almost always has a healthy AI crawl posture too, because the waste that hurts one hurts all of them at once. Fixing crawl efficiency is the groundwork that pays off across every engine now reading your site, which is why it belongs ahead of any AI-specific optimization.

Crawl budget behaves like any other budget. Whatever a crawler wastes on dead URLs comes straight out of what is left for the pages that earn you revenue. For a large site the real question is not whether to manage it but whether your team has the log data, the authority over architecture, and the discipline to keep spending it on the URLs that matter. The sites that hold their crawl gains are the ones that build the habit into how engineering and SEO work day to day, rather than treating it as a cleanup they run once and forget.

How Crawl Budget Really Gets Spent on Large Sites.