TECHNICAL SEO 4 min read

Robots.txt

A file in a website's root that instructs crawlers which pages they may and may not visit.

Bas Vermeer
Bas Vermeer SEO/AEO Specialist

Robots.txt is a text file located in a website's root (example.com/robots.txt) that instructs web crawlers which parts of the site they may crawl. It follows the Robots Exclusion Protocol, a standard from 1994 that still forms the basis for communication between websites and bots.

How does robots.txt work?

The file contains rules per user-agent (bot). You can block specific paths with Disallow or allow them with Allow. Additionally, you can reference your sitemap — bibliotheekterm. Bots are not required to respect robots.txt, but all major search engines do.

Robots.txt and AI bots

With the rise of AI bots like GPTBot and ClaudeBot, robots.txt is more relevant than ever. You can determine per bot which content is accessible for AI training and scraping. This is a core part of your AI bot rules — bibliotheekterm strategy.

Complete robots.txt template

Below is a comprehensive robots.txt example for a business website, including rules for AI bots:

# ==============================================
# Robots.txt for business website
# ==============================================

# Default rule: all regular bots welcome
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /tmp/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

# Google
User-agent: Googlebot
Allow: /

# Bing
User-agent: Bingbot
Allow: /

# --- AI bot rules ---

# OpenAI GPTBot (used for ChatGPT, browsing)
User-agent: GPTBot
Allow: /blog/
Allow: /knowledge-base/
Disallow: /

# OpenAI ChatGPT-User (live browsing in conversations)
User-agent: ChatGPT-User
Allow: /

# Anthropic ClaudeBot
User-agent: ClaudeBot
Allow: /blog/
Allow: /knowledge-base/
Disallow: /

# Google Extended (AI training, not search engine)
User-agent: Google-Extended
Disallow: /

# Common Crawl (dataset for AI training)
User-agent: CCBot
Disallow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Meta AI crawler
User-agent: FacebookBot
Disallow: /

# Bytedance AI crawler
User-agent: Bytespider
Disallow: /

# Apple AI crawler
User-agent: Applebot-Extended
Disallow: /

# Sitemap reference
Sitemap: https://example.com/sitemap.xml

Common mistakes

  • Accidentally blocking the entire site: Disallow: / under User-agent: * blocks all crawlers. This can remove your entire indexation — bibliotheekterm.
  • No robots.txt present: without a robots.txt, the server returns a 404, which means bots may crawl everything. Always create an explicit file, even if you want to allow everything.
  • Typos or spacing errors: Dissallow or incorrect formatting are silently ignored. The syntax is case-sensitive for directive names.
  • "Hiding" sensitive URLs via robots.txt: robots.txt is publicly readable. Never rely on robots.txt alone to block sensitive pages; use authentication or noindex.
  • Forgetting the sitemap path: always reference your sitemap in robots.txt. This is one of the primary ways search engines discover your sitemap.
  • Conflicting rules: if you have Disallow: / and Allow: /blog/, the most specific rule wins. But the order and specificity can be confusing.
  • Ignoring AI bots: many sites still have no rules for GPTBot, ClaudeBot, and other AI crawlers. Make an explicit decision about which AI bots you want to allow.

Frequently asked questions

Is robots.txt binding for bots?

No, robots.txt is a directive, not an enforceable security mechanism. All reputable search engines and AI bots respect robots.txt, but malicious bots can ignore it. Use it for crawl management, not security.

Does robots.txt also block indexing?

Not necessarily. If a blocked URL receives links from other sites, Google can still index the URL (without knowing its content). Use a noindex meta tag if you want complete de-indexation.

How quickly do bots pick up robots.txt changes?

Google typically caches robots.txt for 24 hours. After a change, it may take a day for the new rules to become active. You can request a re-check via Google Search Console.

Should I block AI bots?

That depends on your strategy. If you want AI chatbots to cite your content, allow them. If you want to protect your content from AI training, block bots like GPTBot and Google-Extended. A common middle ground is allowing search/browse bots (ChatGPT-User, PerplexityBot) while blocking training bots.

What is the difference between Disallow and noindex?

Disallow in robots.txt prevents a bot from crawling — bibliotheekterm the page, but does not prevent the URL from being indexed. noindex (as a meta tag or X-Robots-Tag header) instructs a bot not to index the page. For full control, use both: the page is then neither crawlable nor indexable.

What does our scanner check?

Our scanner checks whether your website has a valid robots.txt file, whether it can be correctly parsed, and whether there are specific rules for AI bots (GPTBot, ClaudeBot, Google-Extended, and more). We analyze whether you've made a deliberate decision about AI bot access. Test your robots.txt configuration.

RELATED TERMS

Crawling

The automated scanning of websites by search engines and AI bots to discover content.

Reinier Sierag Reinier Sierag

Indexing

The storing and cataloging of web content by search engines so it becomes findable.

Bas Vermeer Bas Vermeer

RELATED SCANNER CHECKS

robots.txt present

RELATED ARTICLES

Bas Vermeer
Bas Vermeer

SEO/AEO Specialist

My career started by manually combing through server log files. I wanted to understand how Googlebot crawls websites. That fascination with the technical side of discoverability? Never faded. At Koba...