Is Your Content Being Scraped by AI? Here's How to Find Out

If you manage a website for an international organization, NGO, or government agency, there's a question you should be asking right now: is your content being used to train large language models?

The honest answer for most organizations is: you don't know. And that's a problem.

The invisible harvest

Since 2022, AI companies have been systematically crawling the web to feed their models. Common Crawl, the open dataset that underpins many LLMs, contains over 250 billion pages. Your annual report, your policy briefs, your carefully crafted programme descriptions—chances are they're already in there.

But the scraping goes beyond Common Crawl. Companies like OpenAI (GPTBot), Google (Google-Extended), Anthropic (ClaudeBot), and others operate their own crawlers. Each has a different user-agent string, different crawling behaviors, and different opt-out mechanisms.

For most organizations, nobody is watching.

What you can check right now

Step 1: Audit your server logs. Look for user-agent strings associated with known AI crawlers. The major ones include GPTBot, ClaudeBot, Google-Extended, CCBot (Common Crawl), Bytespider (ByteDance), and FacebookBot (Meta). If your IT team can pull access logs from the past 90 days, you'll have a baseline picture.

Step 2: Check your robots.txt. Navigate to yoursite.org/robots.txt. Does it mention any AI-specific crawlers? If your robots.txt hasn't been updated since 2023, the answer is almost certainly no. Most legacy configurations only account for traditional search engines.

Step 3: Test what AI models know about you. Ask ChatGPT, Claude, and Gemini specific questions about your organization. If they can accurately describe your programmes, quote your reports, or summarize your policy positions, your content has been ingested.

Why this matters more than you think

This isn't just a technical curiosity. For organizations that operate in regulated environments, handle sensitive information, or maintain strict brand guidelines, uncontrolled AI ingestion creates real risks.

Your content may be reproduced out of context. Your data could surface in AI outputs attributed to your organization—or worse, misattributed. And if your content includes information about vulnerable populations, the ethical implications are serious.

The governance gap

The reason most organizations haven't addressed this is simple: nobody owns it. IT thinks it's a communications issue. Communications thinks it's an IT issue. Legal hasn't been asked. And leadership doesn't know it's happening.

This is a content governance problem, and it needs a governance solution—not just a technical patch to robots.txt.

What to do next

Start with visibility. You can't govern what you can't see. Run the three checks above and document what you find. Then bring the results to a conversation that includes IT, communications, legal, and leadership.

The organizations that act now will have a strategic advantage. Those that wait will be playing catch-up—or dealing with consequences they didn't anticipate.

Is Your Content Being Scraped by AI? Here's How to Find Out

The invisible harvest

What you can check right now

Why this matters more than you think

The governance gap

What to do next

More from the blog

AI-Powered Search Is Here. Is Your Content Ready to Be Found?

Why Your Content Governance Framework Wasn't Built for AI (And How to Fix It)

Is Your Content Being Scraped by AI? Here's How to Find Out