The rise of Large Language Model-powered search engines is getting popular. Millions of users use tools such as Perplexity.ai, ChatGPT web search, etc. Gartner predicts search engine traffic will drop by 25% by next year. Modern-day users abandon traditional search engines based on keywords and prefer a ChatGPT-like interface that can respond accurately to their questions. This has put a dent in traditional practices such as Search Engine Optimization (SEO), as global search volumes are expected to drop significantly.
To help LLM-powered search engines take advantage of content, a proposal has been made to have all content in a single file called llms.txt in markdown format. Given the bigger context window of newer LLMs, LLM-powered search engines can ingest and process these LLMs.txt files at runtime rather than parsing website content. This llms.txt file can be added as part of the root structure of your website, like robots.txt and sitemap.xml.
Purpose of LLMs.txt
The main purpose of the llms.txt file is to provide LLM-friendly content to the LLM-powered search engine provider. Given that the LLM-powered search engine providers have to use web crawlers or bots to scan your website content periodically, parse the content, format it, and store it for retrieval, there is a lot of wastage, such as
- Storage cost
- Increased latency to serve customers because of increased time in parsing content
- Content might not be up to date, thus requiring consistent pooling of resources
This also put pressure on CMS vendors and website administrators to make their infrastructure scale to the web crawlers and bots.
To help the LLM-powered search engine provider use your content effectively, the llms.txt file provides all your content in LLM-friendly markdown format along with other metadata. This helps your content be used in the generated response, thus getting a citation link back to your website.
How to Produce LLMs.txt
Vitepress plugin offers an out-of-the-box toolkit to generate the llms.txt file from your website or documentation site content that adheres to the specifications of llms.txt. There are a few commercial tools available that can generate llms.txt once you supply a URL of your website. There are some documentation and Content Management System (CMS) providers that offer the llms.txt file in addition to sitemap.xml.
Value of LLMs.txt
The real value of llms.txt is delivered when the content is being used by an LLM-powered search engine in inference time. This means that llms.txt is queried once the customer enters a prompt, and a valid response can be generated using the content from your website or documentation site. The LLM-powered search engine loads your entire llms.txt file in the context windows, given that many LLMs support millions of tokens. The content of the lms.txt file is used to generate a response. Citations to the right article source pointing to the website or documentation site can also be produced. This helps customers to cross-validate answers if required. Once the customer clicks on the citation link, the LLM-powered search engines append the UTM parameters, such as source (shown in the figure below). This gets picked by Google Analytics and shown as AI traffic.
As modern customers flock to AI-powered search engines, brands must increase their visibility by providing trusted information from their sites and using it to drive traffic to their websites or documentation sites
The llms.txt file is updated as soon as new content is created, old content is updated, or content is deleted. This helps AI-powered search engines to get more value and offer high-accuracy responses with minimal latency to their customers.
Uptake of LLMs.txt
The uptake of llms.txt is slow. A few documentation platform vendors and CMS providers offer llms.txt as part of their product offering to their customers. This llms.txt is not accredited by W3C or any other web standards community. It is also not clear whether AI-powered search engines are using the llms.txt at inference time. The lack of an analytics toolkit from LLM-powered search engine providers constrains many website administrators and documentation teams from measuring the uptake of providing the llms.txt file. Attribution is also harder with llms.txt to quantify, as only the source is appended in the URL parameter. The LLM-powered search engine providers need to provide more information and incentives to help website owners give their whole content in markdown format.
The lack of Google Search Console-like products for LLM-powered search engine providers means that investing in optimizing content for the GenAI era is essential. The llms.txt is a way forward to providing accurate and up-to-date content to LLMs, as customers thrive on finding accurate responses in a short time.
The Future of LLMs.txt in the GenAI Era
The LLM-powered search engine providers are innovating rapidly and offering more services. The lack of analytics and attribution will be addressed soon so that llms.txt will become a norm for the GenAI world. As traction for customers to use LLM-powered search engine providers increases and agentic workflows are scaled, then llms.txt will play an indispensable role in the modern web.