TECHNICAL SEO PROTOCOLS & STANDARDS 19 Jan 2026 9 min read

Robots.txt for AI: more than just crawl instructions

Admin AEO Expert

Robots.txt in the AI era

The robots.txt file has existed since 1994 and was originally designed as a simple mechanism to tell web crawlers which parts of your website they may and may not visit. Thirty years later, this modest text file has gained an entirely new dimension: it is the first line of defense and simultaneously the gateway for AI bots that want to read and process your content.

The explosive growth of AI models has led to a new generation of crawlers. Alongside the well-known Googlebot and Bingbot, GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot and GoogleExtended (for Gemini training) now visit millions of websites daily. How you configure your robots.txt directly determines whether these AI systems can index your content and use it as a source.

Robots.txt is one of the fundamental building blocks of Answer Engine Optimization (AEO). Without correct configuration, even the best-written content can remain invisible to AI models. In combination with an llms.txt file, it forms the foundation of your technical AI strategy.

Key AI bots and their user agents

To correctly configure your robots.txt, you need to know which AI bots exist and how they identify themselves. Here are the most important AI crawlers you should know in 2026.

GPTBot (OpenAI): crawls for ChatGPT's browsing function and for training data. User agent: GPTBot.
ChatGPT-User (OpenAI): the browsing agent that fetches pages in real-time during a conversation. User agent: ChatGPT-User.
ClaudeBot (Anthropic): crawls for Claude's knowledge base. User agent: ClaudeBot.
PerplexityBot: crawls for Perplexity's real-time search answers. User agent: PerplexityBot.
GoogleExtended: separate user agent Google uses for AI training (Gemini). User agent: Google-Extended.
Bytespider (ByteDance): crawls for TikTok's AI services. User agent: Bytespider.
CCBot (Common Crawl): open dataset widely used for AI training. User agent: CCBot.
Applebot-Extended (Apple): crawls for Apple Intelligence features. User agent: Applebot-Extended.
Meta-ExternalAgent (Meta): crawls for Meta AI products. User agent: Meta-ExternalAgent.

Training versus browsing: a crucial distinction

Not all AI crawlers are equal. It is essential to understand that there are two fundamentally different categories. Training crawlers collect data to train AI models. They read your content once and use it to improve the model. Browsing agents fetch your content in real-time when a user asks a question. These agents are directly responsible for citations in AI answers.

This distinction is crucial for your strategy. If you want to appear in AI answers, you must allow browsing agents. Blocking training crawlers has less direct impact on your visibility, although it may influence how well AI models know your domain in the longer term.

Dive deeper: llms.txt: the robots.txt for AI models | MCP Servers for AI agents | How AI models use your content

A complete robots.txt configuration

Below you will find an example of a robots.txt configuration that deliberately manages AI bots. This configuration allows the most important AI crawlers to index your public content, while privacy-sensitive sections are shielded.

# Standard crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI crawlers: access to public content
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /products/
Disallow: /account/
Disallow: /admin/
Disallow: /api/internal/

User-agent: ChatGPT-User
Allow: /
Disallow: /account/
Disallow: /admin/

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Allow: /products/
Disallow: /account/
Disallow: /admin/
Disallow: /api/internal/

User-agent: Google-Extended
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /admin/

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Allow: /products/
Disallow: /account/
Disallow: /admin/

# Restrict AI training crawlers
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# All other bots
User-agent: *
Allow: /
Disallow: /account/
Disallow: /admin/

# Sitemap and llms.txt
Sitemap: https://www.example.com/sitemap.xml
Llms-Txt: https://www.example.com/llms.txt

The logic behind this configuration

In this example, we make a deliberate choice per bot type. Browsing agents (ChatGPT-User, PerplexityBot) that fetch content during a user conversation get broader access, because citation in real-time answers delivers direct value. Training crawlers (Bytespider, CCBot) are blocked because they use content for model training without returning direct visibility.

Crawl-delay and rate limiting

Some AI crawlers are more aggressive than traditional bots. If you notice an AI crawler overloading your server, you can use the Crawl-delay directive to throttle the pace. Note: not all bots respect Crawl-delay, but most serious crawlers do.

# Rate limiting for AI crawlers
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Crawl-delay: 2

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Crawl-delay: 2

Common mistakes in AI bot configuration

When configuring robots.txt for AI bots, we regularly see the same mistakes that unintentionally make businesses invisible to AI systems.

Blocking all AI bots with a wildcard: some businesses reflexively block all AI bots, but this makes you invisible in ChatGPT, Claude, Perplexity and other AI search results.
Not distinguishing between training and browsing: block training crawlers if you wish, but allow browsing agents if you want to appear in AI answers.
Not updating robots.txt: new AI bots appear regularly. Check at least every quarter whether there are new user agents you need to configure.
Not including an Llms-Txt directive: the Llms-Txt line in robots.txt helps AI crawlers discover your llms.txt file.
Inconsistency between robots.txt and meta robots tags: ensure your robots.txt rules and your HTML meta robots tags do not conflict with each other.

The "block everything" trap

A reaction we often see from businesses hearing about AI crawlers for the first time is panic: "Block everything!" This is understandable from a privacy perspective, but strategically often a poor choice. By blocking all AI crawlers, you make yourself invisible to ChatGPT, Perplexity and other AI search engines. Your competitor who doesn't do this takes your place as a cited source. The better approach is a nuanced configuration: allow browsing, restrict training, protect sensitive data.

NOTE

Robots.txt is based on trust, not enforcement. Not all AI crawlers respect robots.txt. It is nevertheless essential to configure it correctly, because the major, trustworthy AI platforms (OpenAI, Anthropic, Google) do respect the rules.

Testing and monitoring robots.txt

After modifying your robots.txt, it is important to verify that the configuration works correctly. Use Google Search Console to check whether Googlebot and Google-Extended can reach the right pages. Monitor your server log files to see which AI bots visit your website and whether they respect the robots.txt rules.

Also consider a monitoring tool that alerts you if your robots.txt is accidentally modified. An inadvertently removed Allow rule can cause you to become invisible in AI answers overnight.

Useful testing methods

Google Search Console: use the URL inspection tool to check whether pages are accessible to Googlebot and Google-Extended.
Server log analysis: search your access logs for AI crawler user agents and verify they are accessing the correct pages.
Robots.txt testers: online tools like Google's robots.txt tester can detect syntax errors.
Manual verification: open your robots.txt in the browser (yourdomain.com/robots.txt) and check whether the rules are correct.
Our AEO scanner: automatically tests whether the right AI bots are configured and provides improvement recommendations.

Robots.txt and its connection to other AI standards

Robots.txt does not stand alone. It works together with other standards to give AI systems a complete picture. Your llms.txt file tells AI models which content is most important. Your MCP server provides structured access to functionality. And your OAuth discovery endpoints handle secure authentication. Together, these elements form the complete agent-ready stack.

The future: robots.txt and the agent standard

The robots.txt standard is evolving. There are proposals for extensions specifically aimed at AI agents, such as specifying rate limits per bot, indicating data usage rights and defining authorization for automated actions. By already being deliberate about your robots.txt configuration for AI, you are prepared for these developments.

Key takeaways

Robots.txt is the first gateway in the AI era that determines whether AI models can read and cite your content.
Deliberately distinguish between browsing agents (which determine your visibility in AI answers) and training crawlers (which collect data for model training).
Configure each important AI bot separately with targeted Allow and Disallow rules instead of blocking or allowing everything.
Always add an Llms-Txt directive to your robots.txt so AI crawlers can discover your llms.txt file.
Monitor and test your robots.txt regularly, as new AI bots appear constantly and a mistake can make you invisible.

Frequently asked questions

Should I allow all AI bots on my website?

Not necessarily. The best approach is a nuanced strategy. Allow browsing agents (ChatGPT-User, PerplexityBot, ClaudeBot) so you can appear in AI answers. Consider restricting training crawlers if you want to maintain control over how your content is used for model training. Block crawlers that you know provide no value, such as Bytespider.

How do I know which AI bots visit my website?

The most reliable method is analyzing your server log files. Search for user agents that match known AI crawlers. Tools like GoAccess or AWStats can visualize this. You can also look at traffic from known AI referrers in Google Analytics 4, although not all AI traffic is visible there.

What happens if I don't have a robots.txt?

Without a robots.txt, all crawlers (including AI bots) may access your entire website. This is not necessarily bad if you want AI models to use your content. The downside is that you have no control over which sections are or are not indexed. Sensitive sections such as admin pages and internal API endpoints are then also visible.

Can I use robots.txt to protect specific pages?

Robots.txt is a request, not enforcement. Trustworthy bots respect it, but malicious bots do not. Never use robots.txt as a security measure for truly sensitive data. Combine it with authentication, IP restrictions and server-side access control for real security.

How often should I update my robots.txt for AI bots?

Check your robots.txt at least every quarter. The AI landscape changes rapidly: new bots appear, existing bots change names and new standards are introduced. Set a reminder or include it in your regular website maintenance cycle.

Robots.txt is the gateway to your website. In the AI era, it determines not only who gets in, but also who may cite your content in answers to millions of users.

How does your website score on AI readiness?

Get your AEO score within 30 seconds and discover what you can improve.

▸ Free scan

SHARE THIS ARTICLE

LINKEDIN X