If you manage a website for an international organization, NGO, or government agency, there's a question you should be asking right now: is your content being used to train large language models?
The honest answer for most organizations is: you don't know. And that's a problem.
The invisible harvest
Since 2022, AI companies have been systematically crawling the web to feed their models. Common Crawl, the open dataset that underpins many LLMs, contains over 250 billion pages. Your annual report, your policy briefs, your carefully crafted programme descriptions—chances are they're already in there.
But the scraping goes beyond Common Crawl. Companies like OpenAI (GPTBot), Google (Google-Extended), Anthropic (ClaudeBot), and others operate their own crawlers. Each has a different user-agent string, different crawling behaviors, and different opt-out mechanisms.
For most organizations, nobody is watching.
What you can check right now
Step 1: Audit your server logs. Look for user-agent strings associated with known AI crawlers. The major ones include GPTBot, ClaudeBot, Google-Extended, CCBot (Common Crawl), Bytespider (ByteDance), and FacebookBot (Meta). If your IT team can pull access logs from the past 90 days, you'll have a baseline picture.
Step 2: Check your robots.txt. Navigate to yoursite.org/robots.txt. Does it mention any AI-specific crawlers? If your robots.txt hasn't been updated since 2023, the answer is almost certainly no. Most legacy configurations only account for traditional search engines.
Step 3: Test what AI models know about you. Ask ChatGPT, Claude, and Gemini specific questions about your organization. If they can accurately describe your programmes, quote your reports, or summarize your policy positions, your content has been ingested.
Why this matters more than you think
This isn't just a technical curiosity. For organizations that operate in regulated environments, handle sensitive information, or maintain strict brand guidelines, uncontrolled AI ingestion creates real risks.
Your content may be reproduced out of context. Your data could surface in AI outputs attributed to your organization—or worse, misattributed. And if your content includes information about vulnerable populations, the ethical implications are serious.
The governance gap
The reason most organizations haven't addressed this is simple: nobody owns it. IT thinks it's a communications issue. Communications thinks it's an IT issue. Legal hasn't been asked. And leadership doesn't know it's happening.
This is a content governance problem, and it needs a governance solution—not just a technical patch to robots.txt.
What to do next
Start with visibility. You can't govern what you can't see. Run the three checks above and document what you find. Then bring the results to a conversation that includes IT, communications, legal, and leadership.
The organizations that act now will have a strategic advantage. Those that wait will be playing catch-up—or dealing with consequences they didn't anticipate.