AEO for content publishers: protecting your business model

Marieke van Dale
Marieke van Dale Content & AI Specialist

The content publisher's dilemma

Content publishers, from news media and trade publications to niche blogs and knowledge platforms, find themselves in a paradoxical situation. AI models use their content to generate answers, which provides visibility. But those same models can also reduce visits to the original source, putting the business model (advertising, subscriptions, lead generation) under pressure.

This is the central tension of AEO for publishers: how do you ensure AI models know and cite your content, without giving away so much information that the reader has no reason to visit your website? Finding this balance requires a combination of technical measures, smart content structuring and strategic choices about what you do and do not share with AI crawlers. We discussed the basics of this trade-off in our article on robots.txt for AI.

In this article we cover the practical tools and strategies available to content publishers to protect their business model while simultaneously benefiting from AI visibility.

Robots.txt strategy for publishers

The first line of defense is your robots.txt file. This determines which AI crawlers have access to your content and which do not. For publishers this is not an all-or-nothing choice: you can differentiate per crawler and even per content type.

# Strategy: differentiate per AI crawler and content type

# Google AI (Gemini) - allow for AI Overviews visibility
User-agent: Google-Extended
Allow: /articles/
Allow: /opinion/
Disallow: /premium/
Disallow: /archive/

# OpenAI (ChatGPT) - selectively allow
User-agent: GPTBot
Allow: /articles/
Allow: /opinion/
Disallow: /premium/
Disallow: /research/
Disallow: /archive/

# Perplexity - allow (best traffic return)
User-agent: PerplexityBot
Allow: /articles/
Allow: /opinion/
Allow: /research/
Disallow: /premium/

# Anthropic (Claude) - limited access
User-agent: ClaudeBot
Allow: /articles/
Disallow: /premium/
Disallow: /research/
Disallow: /archive/

# Block all training-specific bots
User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /
IMPORTANT

Distinguish between crawlers that cite content (and thus send traffic back) and crawlers that only use content for training. Perplexity sends the most measurable traffic back through citations. Crawlers like CCBot are primarily used for training data and do not return traffic.

Segmenting your content for AI access

Not all content on your platform has the same value and purpose. An effective strategy is segmenting your content into layers with different AI access levels.

  • Freely available: opinion articles, how-to content, definition articles. This content builds authority and generates AI citations that drive traffic back.
  • Selectively available: news articles, analyses. Available for citation crawlers (Perplexity, Google-Extended) but not for training crawlers.
  • Restricted: premium content, in-depth research, exclusive interviews. Blocked for all AI crawlers via robots.txt.
  • Teaser content: summaries of premium content that give AI models enough context to reference, but not enough to provide the complete answer.

TDM headers: control over text and data mining

The EU has created a specific legal framework for text and data mining (TDM) through the DSM Directive (Digital Single Market). Through TDM Reservation Protocol headers, publishers can indicate which rights they reserve regarding the use of their content for AI training.

# HTTP response headers for TDM control

# Option 1: All TDM rights reserved
TDM-Reservation: 1

# In combination with a TDM policy page
TDM-Policy: https://example.com/tdm-policy.json

The TDM-Reservation header with value 1 indicates that you as a publisher reserve your text and data mining rights. This is particularly relevant for AI companies that use content to train their models. Combined with a TDM-Policy page you can specify in detail under which conditions TDM is permitted.

Setting up the TDM policy

// tdm-policy.json
{
  "@context": "https://www.w3.org/2022/tdm/",
  "@type": "TDMPolicy",
  "tdm-reservation": true,
  "tdm-policy": [
    {
      "assigner": "https://example.com",
      "assignee": "https://perplexity.ai",
      "permission": [
        {
          "action": "tdm-mine",
          "target": "https://example.com/articles/",
          "constraint": {
            "attribution": true,
            "link-back": true
          }
        }
      ]
    },
    {
      "assigner": "https://example.com",
      "assignee": "*",
      "prohibition": [
        {
          "action": "tdm-reproduce",
          "target": "https://example.com/premium/"
        }
      ]
    }
  ]
}

The legal enforceability of TDM headers is still evolving, but they are increasingly respected by responsible AI companies. By implementing them now, you establish a clear position. This fits into the broader story of protocols for communicating with AI systems.

Content strategies that combine visibility and protection

The most effective protection is not technical but strategic: structure your content so that AI citation leads to more website visits rather than fewer.

The teaser-and-depth method

For each in-depth article, publish a publicly available summary of 300 to 500 words that introduces the topic and answers the core question at a high level. The in-depth analysis, data, expert interviews and practical tools reside in the full article that is only available to subscribers or after registration.

AI models cite the summary and refer to the full article for more detail. The reader gets enough context to recognize the value, but must visit your site for the complete picture. This also strengthens your E-E-A-T signals: you demonstrate expertise in the summary and prove depth by referencing the full piece.

Proprietary data as a differentiator

Content based on proprietary research data, exclusive interviews or original analyses is the hardest to replace with AI-generated content. Invest in this type of content as the core of your publishing strategy.

  • Proprietary research and surveys among your audience that are not available elsewhere.
  • Exclusive interviews with industry experts sharing unique insights.
  • Data analyses and visualizations based on your own datasets.
  • Expert panels and roundtable reports with original viewpoints.
  • Annual benchmark reports that become the standard in your niche.

Schema markup for publisher content

Despite protective measures, you want the content you do share to be maximally recognized and cited. Implement comprehensive Schema.org markup on all your public articles.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Supply Chain Trends 2026: five developments that will transform the industry",
  "description": "Our annual analysis of the five most important supply chain trends, based on interviews with 50 industry experts.",
  "datePublished": "2026-04-15",
  "dateModified": "2026-04-20",
  "author": {
    "@type": "Person",
    "name": "Jan Bakker",
    "url": "https://example.com/authors/jan-bakker"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Supply Chain Magazine",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png"
    }
  },
  "isAccessibleForFree": false,
  "hasPart": {
    "@type": "WebPageElement",
    "isAccessibleForFree": true,
    "cssSelector": ".article-summary"
  },
  "copyrightHolder": {
    "@type": "Organization",
    "name": "Supply Chain Magazine BV"
  },
  "copyrightYear": 2026
}
</script>
TIP

The isAccessibleForFree field combined with hasPart is particularly powerful for publishers with a paywall. It explicitly tells AI models which part of the article is freely available (the summary) and that the full article is not free. Responsible AI models respect this signal.

Monitoring and enforcement

Implementing robots.txt rules and TDM headers is step one. Step two is monitoring whether AI crawlers respect your rules and taking action when they do not.

  1. Regularly analyze your server access logs for AI crawler activity. Verify that blocked bots are actually being denied.
  2. Monitor AI answers for your brand name and content. Ask the same questions your audience would ask and check if citations are correct.
  3. Document violations of your robots.txt or TDM policy for potential legal action.
  4. Maintain a contact page for AI companies where they can request licensing agreements.
  5. Consider participating in collective initiatives by publisher organizations that negotiate with AI companies.

Key takeaways

  • Differentiate your robots.txt per AI crawler: allow citation bots (Perplexity, Google-Extended) and block pure training crawlers (CCBot).
  • Implement TDM-Reservation headers and a TDM-Policy to establish your legal position on text and data mining.
  • Use the teaser-and-depth method: public summaries for AI citation, premium content behind registration.
  • Invest in proprietary data and exclusive content that cannot be replaced by AI.
  • Use isAccessibleForFree and hasPart in your Article schema to tell AI models which content is freely available and which is not.

Frequently asked questions

Should I as a publisher block all AI crawlers?

No, completely blocking is counterproductive in most cases. You become invisible in AI answers while competitors who are visible attract the audience. The better strategy is differentiation: allow crawlers that return citation traffic (especially Perplexity) and block crawlers that only use content for training without attribution.

Are TDM headers legally enforceable?

In the EU, the DSM Directive (Article 4) provides a legal framework for reserving TDM rights. The exact enforceability varies by country and is still being developed through case law. Nevertheless, implementing TDM headers is wise: it establishes a clear declaration of intent and responsible AI companies increasingly respect these headers. It is comparable to how robots.txt is not technically enforceable but is widely respected.

How do I measure how much traffic AI citations generate?

Configure your analytics to recognize referral traffic from AI platforms. Perplexity appears as a recognizable referrer. ChatGPT browsing is harder to track but sometimes appears as "chatgpt.com" in your referrer data. Set up a custom dashboard that separates AI referral traffic from regular search traffic. Also measure indirectly: does your brand awareness or newsletter signups increase after AI citations?

Can I negotiate with AI companies about content licenses?

Yes, and this is happening more frequently. Major publishers like the New York Times, Associated Press and News Corp have signed licensing agreements with OpenAI and other AI companies. Smaller publishers can negotiate collectively through industry organizations. Ensure you have TDM headers and a clear policy as a basis for negotiations. A contact page specifically for licensing inquiries lowers the barrier.

How do I protect my content against unauthorized use?

A combination of measures provides the best protection: robots.txt for crawl control, TDM headers for legal positioning, paywall for premium content and monitoring of AI output for enforcement. No single measure is watertight, but together they form a robust protection framework. Document everything carefully in case legal action becomes necessary.

The publishers that perform best in 2026 are not those that fully embrace AI or fully block it. They are those that employ a nuanced strategy: visible where it drives traffic, protected where the business model is at risk.

How does your website score on AI readiness?

Get your AEO score within 30 seconds and discover what you can improve.

Free scan

SHARE THIS ARTICLE

LINKEDIN X

RELATED ARTICLES