By the Legal Policy Generator team · Published 2026-02-09

What is a Robots.txt File and Why Does Your Website Need One?

Every website has invisible visitors: search engine bots, or "crawlers," that constantly scan the internet to index content for search results. The robots.txt file is how you communicate with these bots, telling them which parts of your site they can and cannot access. It is also, increasingly, a file with legal significance — in the European Union it is one of the recognised ways to tell artificial intelligence companies that they may not use your content to train their models.

What is Robots.txt?

A robots.txt file is a plain text file placed in the root directory of your website (e.g., yoursite.com/robots.txt). It implements the Robots Exclusion Protocol, the convention crawlers use to learn which pages or sections of your site they should or shouldn't access.

The protocol is not just an informal habit. In September 2022 the Internet Engineering Task Force (IETF) published it as a formal internet standard, RFC 9309, which had originally been defined by Martijn Koster back in 1994. That standard specifies, among other things, that the rules "MUST be accessible in a file named '/robots.txt' (all lowercase) in the top-level path of the service" — which is why the file only works at your root domain and not in a subfolder.

Important: robots.txt is a directive, not a security measure. RFC 9309 is explicit that these "rules are not a form of access authorization" (RFC 9309, §2). Well-behaved bots (like Googlebot) honour it, but a crawler that chooses to ignore it can still reach every URL listed. Never use robots.txt to hide sensitive information.

Why Do You Need One?

Control crawl budget: Search engines allocate a limited amount of time to crawl your site. By blocking unimportant pages, you ensure they spend that time on your most valuable content.
Manage how bots access private-but-not-secret areas: Keep crawlers away from staging environments, internal search results, faceted-navigation URLs, or low-value account pages so they don't waste crawl budget on them.
Avoid wasted crawling of duplicate URLs: Stop crawlers from repeatedly fetching parameterized URLs, print-friendly versions, or other near-duplicate pages.
Point to your sitemap: The robots.txt file is the standard place to declare the location of your XML sitemap.
Signal an AI training opt-out: As covered below, blocking AI crawlers here is one of the machine-readable ways to reserve your rights against text-and-data-mining under EU law.

What Robots.txt Is Not: It Won't Keep a Page Out of Google

This is the single most common misunderstanding, so it is worth stating plainly. Blocking a page in robots.txt stops it from being crawled — it does not reliably stop it from being indexed. Google's own documentation says robots.txt "is not a mechanism for keeping a web page out of Google," and warns that "a page that's disallowed in robots.txt can still be indexed if linked to from other sites" (Google Search Central).

If your goal is to keep a page out of search results, Google's guidance is to "block indexing with noindex or password-protect the page" instead. Use robots.txt to manage crawling; use a noindex meta tag or HTTP header (on a page that is not blocked, so the tag can be seen) or genuine access control to manage indexing and confidentiality.

Basic Syntax

The robots.txt file uses a simple syntax with just a few directives, defined in the standard as "rules" grouped under one or more user-agent lines:

User-agent: Specifies which crawler the rules apply to. Use * for all crawlers.
Allow: Indicates that a matching URL path may be crawled.
Disallow: Indicates that a matching URL path should not be crawled.
Sitemap: Declares the location of your XML sitemap.

Two practical limits are worth knowing. First, file size: RFC 9309 says a parser "MUST be at least 500 kibibytes," and Google enforces exactly that — it "enforces a robots.txt file size limit of 500 kibibytes (KiB)," and content after that point is ignored (Google robots.txt specification). Second, caching: under the standard, crawlers "SHOULD NOT use the cached version for more than 24 hours" (RFC 9309, §2.4), so a change you make today may take up to a day to take effect.

Common Examples

Allow everything: If you want all crawlers to access your entire site (the most common setup for small sites):

User-agent: * Allow: /

Block a specific directory: Prevent crawlers from accessing your admin panel:

User-agent: * Disallow: /admin/

Block specific bots: You can target specific crawlers by name. For example, to ask OpenAI's training crawler to stay off your entire site:

User-agent: GPTBot Disallow: /

Robots.txt, AI Crawlers, and the Law

This is where robots.txt stops being a purely technical file and starts touching legal compliance — the reason it matters for site owners thinking about their policies. Generative AI companies increasingly send their own crawlers (such as GPTBot, Google-Extended, CCBot, and others) to collect training data. Whether you can stop them, and whether that "stop" has legal weight, depends on where you are.

In the European Union, the relevant law is the Directive on Copyright in the Digital Single Market. Article 4 creates a general text-and-data-mining exception that any person or organisation can rely on for any purpose, including commercial AI training — but Article 4(3) makes that exception conditional on rightholders not having opted out. By its terms, the exception "shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online" (Directive (EU) 2019/790, Article 4). In plain terms: in the EU, your publicly available content can be mined for AI training by default unless you opt out — and a robots.txt rule blocking AI crawlers is one of the machine-readable ways to express that reservation.

Two important caveats. First, robots.txt is a request, not a lock: as RFC 9309 itself states, it is "not a form of access authorization," so a crawler can ignore it, and a court may treat it more like a "no trespassing" sign than a technical barrier. Second, an opt-out generally only affects future crawling — it does not pull your content back out of models that were already trained. Whether a natural-language reservation in your Terms of Use is enough on its own, or whether it must be machine-readable, is being actively litigated and varies by jurisdiction, so a belt-and-braces approach (a robots.txt signal and a clear statement in your legal pages) is what many site owners adopt.

Common Mistakes to Avoid

Blocking your entire site: Disallow: / blocks everything. Make sure this is intentional!
Blocking CSS and JavaScript: Google's crawler needs these to render your pages — its own example robots.txt notes that "Google needs them for rendering" — so blocking them can harm how your pages are indexed (Google robots.txt specification).
Expecting it to hide a page from search: As noted above, a disallowed page can still be indexed if other sites link to it — use noindex or password protection to keep a page out of results (Google Search Central).
Using robots.txt for security: It won't hide content from determined visitors — use authentication instead.
Forgetting the sitemap: Always include a Sitemap: directive pointing to your XML sitemap.
Placing the file in the wrong location: It must be at the root domain level (e.g., yoursite.com/robots.txt), not in a subfolder.

Generate Yours Instantly

Creating a robots.txt file is simple, but getting it right matters. Use our Free Robots.txt Generator to create a properly formatted file tailored to your needs.

This article is general information about how robots.txt works and how it relates to web crawling and copyright rules. It is not legal advice. Laws on text and data mining and AI training differ by country and are evolving quickly; consult a qualified lawyer about your specific situation before relying on robots.txt as a legal opt-out.