Blocking every AI bot with one line costs you the citations you actually want.

By Ridho Putradi S'GaraJun 27, 20266 min read

// share

// table_of_contents▸

1.The three jobs, and which bot does which
2.The pattern OpenAI and Anthropic built for you
3.Google-Extended is not what most people think
4.Whether to allow training at all is a real decision
5.robots.txt only works on bots that obey it
6.robots.txt EXAMPLE
7.Watch the logs, because the labels can lie

ai crawler cover

The reflex, when AI crawlers show up in your logs, is to write one line and be done. User-agent: * then Disallow: / for the AI bots, or a blanket block of every agent with "GPT" or "AI" in the name. It feels decisive but it is the wrong move, because it treats crawlers that do completely different jobs as one thing.

An AI crawler is doing one of three jobs. It is collecting content to train a model, or it is fetching your page right now to answer a user's question and cite you, or it is acting on a single user's instruction to read a page. Those are not the same decision. Blocking the training crawler keeps your content out of a foundation model. Blocking the retrieval crawler removes you from AI answers, often within hours. A blanket disallow does both at once, which means you can delete yourself from ChatGPT and Perplexity results while thinking you only opted out of training.

The three jobs, and which bot does which

Training crawlers fetch content to build or refine model training sets. Once your page is in the training data, the model carries a fuzzy memory of it without fetching again. This is GPTBot from OpenAI, ClaudeBot from Anthropic, CCBot from Common Crawl whose open dataset feeds many model builders, and Google-Extended, which is a control token rather than a crawler. Blocking these governs whether your words become model weights.

Retrieval crawlers fetch in real time to answer a live query and link back to you. This is OAI-SearchBot, which surfaces and links sites inside ChatGPT search, Anthropic's Claude-SearchBot, and PerplexityBot. These are the crawlers that earn you citations in AI answers. Block one and your pages stop being eligible to appear in that engine's responses, usually fast.

User-triggered agents fetch a specific page because a person asked the assistant to read it. This is ChatGPT-User, Claude-User, and Perplexity-User. They are not bulk crawlers, they are one person clicking a link through an assistant, and blocking them mostly just breaks that experience for someone who was already trying to reach you.

User-agent	Operator	Job	What blocking it does
`GPTBot`	OpenAI	Training	Keeps content out of OpenAI model training
`OAI-SearchBot`	OpenAI	Retrieval for citation	Removes you from ChatGPT search results
`ChatGPT-User`	OpenAI	User-triggered	Breaks user-initiated page reads in ChatGPT
`ClaudeBot`	Anthropic	Training	Keeps content out of Claude training
`Claude-SearchBot`	Anthropic	Retrieval for citation	Removes you from Claude's cited results
`PerplexityBot`	Perplexity	Retrieval for citation	Removes you from Perplexity answers
`Google-Extended`	Google	Training and grounding control	Opts you out of Gemini training, does not touch Search
`Applebot-Extended`	Apple	Training control	Opts you out of Apple model training
`CCBot`	Common Crawl	Training dataset	Keeps you out of a dataset many trainers use

The pattern OpenAI and Anthropic built for you

The vendors deliberately split these so you can make different calls. OpenAI's documentation spells out the canonical move directly, allow OAI-SearchBot while disallowing GPTBot. That gives you presence in ChatGPT search without contributing your content to model training. Anthropic mirrors it, Claude-SearchBot is independently controllable from ClaudeBot, so you can be cited in Claude's answers while staying out of its training data. If you only ever learn one thing about AI crawler control, learn that retrieval and training are separate switches and the vendors expect you to set them separately.

Google-Extended is not what most people think

Google needs its own paragraph because it behaves unlike the others. Google-Extended is not a crawler that fetches pages. It is a permission token that controls whether content Googlebot already crawled can be used to train Gemini and to ground Vertex AI. Blocking Google-Extended does not reduce how Google Search crawls or ranks you. It only removes you from Gemini's training and grounding.

The consequence people miss is the reverse. You cannot opt out of Google's AI Overviews through robots.txt, because AI Overviews are part of Search and run on the same Googlebot crawl that powers your blue links. The only way to keep your content out of AI Overviews is to block Googlebot, which also deletes you from Search. There is no clean separation there, and anyone promising you one is selling something. Google-Extended controls Gemini. It does not control AI Overviews.

Whether to allow training at all is a real decision

The retrieval call is easy. If you want to appear in AI answers, allow the retrieval crawlers. The training call is genuinely contested, and reasonable operators land on opposite sides.

The case for allowing training is that being in the model's baseline knowledge builds brand recall that does not depend on a live fetch. When someone asks an assistant about your category and your brand surfaces from training memory rather than a retrieval lookup, that is durable presence you did not have to win query by query. For brands trying to become a default answer in their space, training inclusion is a long game worth playing.

The case against is that you are handing your content to a commercial model for free, with no attribution at the point of training, no traffic back, and no compensation, while that model may go on to answer questions your content would otherwise have earned the click for. Publishers with a paywall or a licensing strategy have every reason to block training and negotiate terms instead. Both positions are defensible. The point is that it is a decision to make deliberately per crawler, not a default to inherit from a blanket rule.

robots.txt only works on bots that obey it

Here is the limit you have to design around. robots.txt is a request, not a fence. It works because reputable operators choose to honor it, and it does nothing to a crawler that decides not to. In August 2025 Cloudflare published evidence that Perplexity was using stealth, undeclared crawlers to evade no-crawl directives, rotating user agents and source networks to fetch pages on brand-new domains whose robots.txt blocked all bots. Cloudflare de-listed Perplexity as a verified bot and added rules to block the behavior. Perplexity disputed the characterization.

Take the lesson regardless of how that specific dispute resolves. If your content genuinely must not be fetched, robots.txt is necessary but not sufficient. You need enforcement at the edge, a WAF or bot-management layer that verifies a crawler is who it claims to be by checking its published IP ranges, and blocks the ones that lie. robots.txt expresses your preference. The firewall enforces it.

robots.txt EXAMPLE

This configuration keeps you eligible for AI citations across the major assistants while opting out of foundation-model training. Adjust the training block to taste once you have made that call.

# Allow retrieval crawlers, you want these citations
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Two things to remember about the syntax. Each crawler reads only the group that names its user agent, or the wildcard group if none matches, so per-bot rules need their own blocks. And these directives never touch Googlebot or Bingbot, so your traditional search crawling is untouched by everything above.

Watch the logs, because the labels can lie

Once the rules are live, confirm them against reality. Your server logs show which AI user agents are actually hitting you and how often, which is the only way to know whether your rules are being respected and whether anything is impersonating a known crawler from an IP range that does not belong to its operator. Verify the big retrieval bots against their operators' published IP lists, and treat a crawler claiming to be OAI-SearchBot from an unlisted network as exactly the kind of thing your edge rules exist to catch.

Segment first, enforce second, monitor always. One line of robots.txt was never going to carry a decision this layered.

// want_this_for_your_brand

See where your brand stands in AI answers today, benchmarked against your competitors, no pitch required.

[ request_an_audit → ]