Saturday, September 13, 2025

How to seed content into AI crawlers?


- llms.txt is proposed to control whether your site’s content is seeded into LLM training pipelines.

- It acts as an opt-in/opt-out mechanism for AI crawlers.


“llms.txt” is a proposed standard (initiated around September 2024 by Jeremy Howard) meant to let web owners provide a machine-readable, curated guide of their most important content (docs, APIs, canonical pages, etc.) so that LLMs / AI crawlers can better understand what to ingest.  


Here are the facts / findings so far:

Very Low Adoption Among Top Sites

Scans of top 1,000 websites show only about 0.3% of sites have an llms.txt file.  

Some community directories list hundreds of domains using it, but many are smaller docs sites, startups, or developer-platforms.  

Major LLM Providers Do Not Officially Support It Yet

A key point repeated in many sources: OpenAI, Anthropic, Google, Meta etc. have not publicly committed to parsing or respecting llms.txt in their crawling / ingestion pipelines.  

For example, John Mueller (from Google) has said he is not aware of any AI services using llms.txt.  

Some Early Adopters / Use Cases

A number of documentation sites, developer platforms, and SaaS/digital product companies have published llms.txt (and sometimes llms-full.txt) in their docs or marketing domains. Examples include Cloudflare, Anthropic (for its docs), Mintlify etc.  

Also, tools and plugins are emerging (for WordPress, SEO tools, GitBook) to help create llms.txt files.  

Unclear Real-World Impact So Far

There is little evidence that having llms.txt causes an LLM to pick up content more correctly, or improves traffic / retrieval / citation by LLMs. Because major LLMs do not appear to check it. Also server logs from sites with llms.txt show that AI services do not seem to be requesting it.  

Emerging Tools & Community Momentum

Although official adoption is lacking, community interest is growing: directories of implementations, write-ups, generators, documentation, and discussion.  

There are files like llms-full.txt (a more exhaustive content dump) being used, which in some cases appear to get more parser / crawler traffic (or at least more visits) than just llms.txt in some documentation contexts.  


No comments:

Post a Comment