AI Bot Rules
Rules in robots.txt and HTTP headers to control AI crawler behavior.
AI bot rules are instructions that allow website owners to control AI crawler behavior. This is done primarily via robots.txt — bibliotheekterm, but also via HTTP headers and meta tags. You determine which AI bots may index — bibliotheekterm, scrape, or use your content for training.
Known AI bots
The most important AI bots are: GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google AI training), Applebot-Extended (Apple Intelligence), CCBot (Common Crawl), and Amazonbot. Each bot has its own user-agent string.
Strategic choices
Do you block AI bots entirely, or allow them selectively? Many businesses choose a middle ground: they allow crawling — bibliotheekterm for AI visibility but block training-specific bots. The right strategy depends on your goals: do you want to be cited by AI, or do you want to protect your content?
Reference table: known AI bots
| User-agent | Owner | Purpose | Respects robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Web crawling for ChatGPT and AI products | Yes |
| OAI-SearchBot | OpenAI | ChatGPT Search (real-time search results) | Yes |
| ChatGPT-User | OpenAI | Fetching pages when a user shares a URL in ChatGPT | Yes |
| ClaudeBot | Anthropic | Training and improvement of Claude models | Yes |
| PerplexityBot | Perplexity AI | Real-time search results in Perplexity | Yes |
| Google-Extended | AI training (Gemini), not for regular Google Search | Yes | |
| Googlebot | Regular search index (incl. AI Overviews) | Yes | |
| Applebot-Extended | Apple | Apple Intelligence and Siri training | Yes |
| Applebot | Apple | Siri and Spotlight suggestions | Yes |
| CCBot | Common Crawl | Open dataset, used by many AI models for training | Yes |
| Amazonbot | Amazon | Alexa answers and Amazon AI products | Yes |
| Bytespider | ByteDance | TikTok search and AI training | Partially |
| FacebookBot | Meta | Content preview and AI training | Yes |
| Diffbot | Diffbot | Structured data — bibliotheekterm extraction for AI Knowledge Graphs | Yes |
| cohere-ai | Cohere | Training of Cohere's language models | Yes |
| anthropic-ai | Anthropic | Web research for Claude | Yes |
Robots.txt templates for AI bots
Strategy 1: Allow all (maximum AI visibility)
# Allow all AI bots for maximum visibility
# in AI answers and search results
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Amazonbot
Allow: /
Strategy 2: Selective access (balance visibility/protection)
# Allow AI search engines, block training bots
# Balance between visibility in AI answers
# and protection against unauthorized training
# Allow: bots that cite your content with source attribution
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Amazonbot
Allow: /
# Block: bots that primarily train without citation
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Strategy 3: Block all (maximum content protection)
# Block all known AI bots
# Note: this significantly reduces your visibility
# in AI answers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
What does our scanner check?
The scanner analyzes your robots.txt for AI bot rules. We check which AI crawlers you allow and which you block, and whether you have a deliberate strategy (rather than no rules at all). This is part of both the AEO — bibliotheekterm score and the Agent Readiness score.
Frequently asked questions
Should I allow or block AI bots?
This depends on your goals. If you want to be cited in AI answers (AEO), allow at minimum GPTBot, PerplexityBot, and ClaudeBot. If you want to protect your content from AI training without citation, block Google-Extended, CCBot, and Bytespider. Most businesses choose a middle ground: allow search-related bots, block training-only bots.
Do AI bots actually respect robots.txt?
Major AI companies (OpenAI, Anthropic, Google, Perplexity) respect robots.txt. This is in their own interest: websites blocking AI bots because they ignore robots.txt would harm the ecosystem. Smaller or lesser-known bots are less reliable. Robots.txt is a convention, not legal protection.
Can I block AI bots with HTTP headers instead of robots.txt?
Yes. You can use the X-Robots-Tag HTTP header with directives like "noai" or "noimageai" for specific pages. This provides more granular control than robots.txt, which only works at the path level. The meta tag <meta name="robots" content="noai"> works similarly at the page level.
What if I have no AI bot rules in my robots.txt?
If you have no specific rules for AI bots, they follow the default User-agent: * rules. If you have no restrictions there either, all bots (including AI bots) may crawl your entire site. It's wise to make a deliberate choice and document it in your robots.txt.
Do AI bot user agents change regularly?
Major AI companies document their user agents and announce changes in advance. However, new bots are regularly added as more companies launch AI products. It's advisable to review your robots.txt at least quarterly and add new AI bots to your policy.