·6 min read

Which AI Bots to Block in robots.txt (and Which to Keep)

Not all AI crawlers are the same. Some scrape your content to train models. Others power AI search results that send you traffic. Blocking the wrong ones makes your site invisible to ChatGPT, Perplexity, and Gemini. Here is how to tell them apart.

Two types of AI crawlers

AI companies use separate bots for separate purposes. The distinction matters because blocking a training bot protects your intellectual property, while blocking a retrieval bot removes you from AI-powered search results entirely.

Training bots download your content to build datasets for training future models. Your content becomes part of the model weights. You get no attribution and no traffic. Retrieval bots fetch your content in real time to answer a specific user query. They cite your page and often link back to it.

Training bots (safe to block)

These crawlers collect content for model training. Blocking them has no effect on your search visibility. If you do not want your content in training datasets, block all of these.

User AgentOperatorPurpose
GPTBotOpenAITrains future GPT models
ClaudeBotAnthropicTrains future Claude models
CCBotCommon CrawlOpen dataset used by many AI labs
Google-ExtendedGoogleTrains Gemini models
BytespiderByteDanceTrains TikTok/Doubao models
AmazonbotAmazonTrains Alexa and internal models

Retrieval bots (think twice before blocking)

These crawlers fetch your content live when a user asks a question. They power the AI search results in ChatGPT, Perplexity, Google AI Overviews, and Apple Intelligence. Blocking them means your pages will never appear in those results.

User AgentOperatorPurpose
ChatGPT-UserOpenAIPowers ChatGPT search (live results)
PerplexityBotPerplexityPowers Perplexity search answers
GoogleOtherGoogleUsed for AI Overviews and other AI features
Applebot-ExtendedAppleUsed for Apple Intelligence search features

Recommended: block training, allow retrieval

This is the configuration most sites should use. It prevents your content from being used to train models while keeping you visible in AI search results.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# Allow AI search/retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

# Allow regular search engines
User-agent: *
Allow: /
Disallow: /dashboard/
Disallow: /api/

Sitemap: https://yoursite.com/sitemap.xml

Alternative: block all AI crawlers

If you want to block all AI access, training and retrieval, use this. Be aware that your content will not appear in ChatGPT search, Perplexity answers, or Google AI Overviews.

# Block all known AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GoogleOther
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow regular search engines
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Implementing this in Next.js

If your site runs on Next.js App Router, you can generate robots.txt programmatically using the app/robots.ts file:

// app/robots.ts
import type { MetadataRoute } from "next"

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // Block AI training crawlers
      { userAgent: "GPTBot", disallow: ["/"] },
      { userAgent: "ClaudeBot", disallow: ["/"] },
      { userAgent: "CCBot", disallow: ["/"] },
      { userAgent: "Google-Extended", disallow: ["/"] },
      { userAgent: "Bytespider", disallow: ["/"] },
      { userAgent: "Amazonbot", disallow: ["/"] },
      // Allow everything else (including retrieval bots)
      {
        userAgent: "*",
        allow: "/",
        disallow: ["/dashboard/", "/api/"],
      },
    ],
    sitemap: "https://yoursite.com/sitemap.xml",
  }
}

This approach keeps your robots.txt in version control and type-checked. No static file to forget about.

Common mistakes

Blocking GPTBot and thinking you blocked ChatGPT search

GPTBot and ChatGPT-User are separate user agents. GPTBot is for training. ChatGPT-User is for live search queries. Blocking GPTBot does not remove you from ChatGPT search results.

Using a blanket wildcard block

Adding User-agent: * / Disallow: / blocks everything, including Googlebot. Never do this unless you genuinely want zero search traffic.

Not having a robots.txt at all

If there is no robots.txt, all bots (training and retrieval) assume they have full access. That is fine for retrieval bots, but it means your entire site is open for AI training scraping.

Pair robots.txt with llms.txt

While robots.txt controls who can crawl, llms.txt tells AI models what your site is about in plain language. It is a simple text file at your root that helps ChatGPT, Perplexity, and Gemini understand your product without parsing your entire HTML. Think of robots.txt as the bouncer and llms.txt as the welcome mat.

FAQ

Should I block GPTBot in robots.txt?

It depends on your goals. GPTBot is used by OpenAI to train future models. If you do not want your content used for training, block it. But note that GPTBot is separate from ChatGPT-User, which powers ChatGPT search. You can block one without blocking the other.

What happens if I block all AI bots in robots.txt?

You prevent your content from appearing in AI-powered search results (ChatGPT, Perplexity, Gemini). This means fewer citations, less referral traffic, and reduced visibility in the fastest-growing search surfaces. Only block training bots, not retrieval bots.

Does blocking AI crawlers affect my Google rankings?

No. Googlebot is separate from AI training crawlers. Blocking GPTBot, CCBot, or ClaudeBot has zero effect on your Google search rankings. However, blocking GoogleOther may affect your visibility in Google AI Overviews.

How do I check if AI bots are crawling my site?

Check your server access logs for user agent strings like GPTBot, ClaudeBot, CCBot, PerplexityBot, and ChatGPT-User. You can also use SEOLint to scan your robots.txt and flag any misconfigured AI bot rules.

Check your robots.txt automatically

SEOLint scans your site and flags misconfigured AI bot rules, missing sitemaps, and more.