Bots Now Outnumber Humans on the Web. What Your Server Logs Should Tell You
Automated systems crossed 57.5% of all HTTP requests this week, the first time bots have outnumbered humans online, and Cloudflare Radar attributes the tip-over mainly to agentic AI (TechTimes, June 5, 2026). For most sites the practical meaning is blunt. More than half of what your servers answer is now machines, and a large and growing slice of those machines are AI crawlers that may take thousands of pages from you for every visitor they send back.
That ratio is the number worth managing. Not every bot is a cost and not every bot is a channel, and your own logs are the only place the difference shows up for your specific site. Industry averages will point you in roughly the right direction and mislead you on the details, because the economics swing hard by vendor and by vertical.
Three kinds of AI bot, three different jobs
Lumping all AI traffic into one "AI bots" bucket is the first mistake, because the three purposes behave differently and deserve different policies. Cloudflare's own crawl-purpose breakdown found training traffic responsible for nearly 80% of AI crawling, with search-indexing and live user-action fetches together under 5%.
| Purpose | What the bot is doing | Example agents | What it returns to you |
|---|---|---|---|
| Training | Scraping content to train a model | ClaudeBot, GPTBot, Meta-ExternalAgent | Nothing measurable |
| Search indexing | Building an index for an AI search product | OAI-SearchBot, PerplexityBot | Citations, some referral clicks |
| Live retrieval | Fetching a page in response to a user prompt | ChatGPT-User, Perplexity-User | Direct referral, a real visit |
The split matters because a training crawl and a retrieval fetch can come from the same vendor under different user agents. OpenAI alone runs GPTBot for training, OAI-SearchBot for its search index, and ChatGPT-User for live prompts. Treating them as one entity in your logs throws away the only distinction that affects revenue.
Compute your own crawl-to-referral ratio
One metric turns log noise into a decision. Crawl-to-referral, the pages a vendor crawled divided by the visits its product sent back. SEOmator's analysis of Cloudflare Radar data for January through March 2026 shows how wide the spread runs.
| Operator | Crawl-to-referral ratio | Read |
|---|---|---|
| Anthropic (ClaudeBot) | 23,951 to 1 | 23,951 pages taken per visit returned |
| OpenAI (GPTBot) | 1,276 to 1 | Heavy take, thin return |
| Perplexity | 111 to 1 | Moderate, and it cites sources |
| Microsoft (Copilot) | 33 to 1 | Reasonable exchange |
| Google (Gemini, AI Overviews) | 5 to 1 | Search economics still hold |
| DuckDuckGo | 1.5 to 1 | Near parity |
Anthropic's number reflects a business model, not inefficiency. ClaudeBot trains a model. Anthropic runs no search product that links back, so the referral side is close to zero by design. Read the table that way. A bot near the top takes content for training; a bot near the bottom participates in the old crawl-for-traffic bargain.
Your own ratios will differ from these averages, sometimes by an order of magnitude, because vertical changes everything. In the same dataset, Perplexity returned a 42 to 1 ratio for finance sites and 182 to 1 for shopping, a four-fold swing driven by which queries send users looking for an authoritative source to click. Compute yours from your logs, not from a chart built on someone else's traffic.
Why a training crawler is safe to block and a retrieval bot isn't
Once you can separate the three purposes and score each vendor, the policy almost writes itself. Training-only crawlers that send no referrals can be blocked with little downside to your traffic today, since blocking them costs you visits that weren't coming anyway. Meta-ExternalAgent is the cleanest case, the single largest AI crawler by volume with no referral mechanism attached.
Live-retrieval bots are the opposite. ChatGPT-User and Perplexity-User fetch your page because a real person asked a question your page can answer, and blocking them removes you from the answer at the exact moment a buyer is deciding. Small in volume, outsized in value. That gap is the whole reason to segment before you block.
Genuine tension sits with the training crawlers of vendors who also run a growing search product. Block GPTBot today and you save bandwidth. You also keep your content out of the model that increasingly fronts ChatGPT's answers. ChatGPT referrals are tiny now at roughly 0.2% of all referral traffic, and they're growing fast off that base (Cloudflare Radar, March 2026). The call comes down to your horizon. This quarter's server bill, or next year's citation share.
robots.txt is a request, and some crawlers ignore it
A blocking strategy built only on robots.txt has a hole in it. The file is a polite request, honored by the major Western vendors and ignored by a class of aggressive crawlers, with ByteDance's Bytespider the name that comes up most. Cloudflare has documented AI crawlers that fetch content despite robots.txt directives telling them not to.
So the enforcement ladder has three rungs, and most teams only build the first.
robots.txt disallow, which the compliant bots respect and the rest read as a suggestion.
Edge or WAF rules that block by verified user agent and IP, which stop the non-compliant ones at the door before they cost you bandwidth.
Verification on the bots you do allow, since the user agent is a text field and scrapers impersonate Googlebot and GPTBot to slip past filters.
A robots.txt line and a Cloudflare or Fastly rule are not interchangeable. One states a preference, the other enforces it.
What 57.5% does to your analytics and your bill
Two quieter costs ride along with the bot majority. Your analytics baseline is now built on a minority of your actual traffic, and any server-side metric that doesn't filter bots, raw request counts, bandwidth dashboards, log-based "visits", is measuring machines as if they were customers. Capacity planning off unfiltered logs overprovisions for an audience that will never convert.
The infrastructure cost is the one a CFO notices. When 80% of AI crawling is training traffic that returns nothing, the bandwidth and compute spent serving it is a direct subsidy to model vendors. For a content-heavy site at scale, that line is real money, and it's the easiest part of this to quantify in a single afternoon with your logs and your CDN bill.
A monthly review one engineer can own
None of this needs a standing project. It needs thirty minutes a month and an owner.
Pull the last 30 days of logs, filter to verified bot traffic, and group AI agents by the three purposes above.
Cross-reference each vendor's crawl volume against referrals from its product in GA4 (chatgpt.com, perplexity.ai, and the rest), and recompute your per-vendor ratios.
Move any training-only, zero-referral crawler that's grown since last month onto the edge block list.
Confirm your allowed retrieval bots are still verified and still being served cleanly.
Run it on the first of the month and the trend line does the analysis for you. A vendor whose crawl volume climbs while its referrals stay flat is shifting from channel to cost. You'll see the turn a quarter early.
This sits one layer beneath the robots.txt and sitemap decisions most teams have already made. Those decide what to allow. The log review measures what those decisions actually cost and earn, on your traffic rather than an industry average.
If you want a second read on which AI crawlers are worth your bandwidth in your specific vertical, that is exactly what we measure. Our free AI Visibility Audit scores your brand and three competitors across ChatGPT, Gemini, Perplexity, and Google AI Overviews, and the walkthrough covers the crawl-economics breakdown for your own logs. Request your audit and we'll tell you which bots to feed and which to block.