gora.
note · May 14, 2026 · 11 min read

robots.txt: what to block, what to allow

One of my clients last winter complained: "The site can't be found on Google at all. Search for our own brand name and we're not in the results. What did you do?" The site had migrated to a new server a month earlier, everything was working, the design was new, the content was in place. But in the index — zero pages. A clean zero.

I opened https://site.com/robots.txt. There were four lines:

User-agent: *
Disallow: /

That's it. The site was completely closed off to every search engine. This setting had been left over from the staging environment — back when the site was still in development and they didn't want it indexed. During the production deploy the file was simply copied as-is. Nobody checked.

Fixing it took ten seconds: remove the / after Disallow. For Google to re-index the site — three weeks. For organic traffic to come back — six months. A month without indexing means lost signals, lost positions, lost users.

robots.txt is one of the most dangerous files on a site. Not because it's complex. The opposite — because it's too simple. It's plain text, four directives, no validator in the IDE, no automated tests. Misplace a slash, drop a wildcard in the wrong spot, and half the site disappears from search results. Nothing warns you. Nothing highlights the error in red. Google quietly stops crawling where you told it not to, and you find out two weeks later when the rankings have already collapsed.

In this post — what to do so you don't repeat my story.


What robots.txt is

A text file that sits at the root of the domain: https://example.com/robots.txt. Not in a subfolder, not in a subdirectory — at the root, otherwise bots won't find it. One file per host. If you have subdomains — each one has its own robots.txt.

It's part of the REP standard — Robots Exclusion Protocol. The standard is ancient, from 1994, born before search engines became what they are today. In 2022 the IETF finally turned it into RFC 9309 — a formally documented standard. Before that, every bot interpreted robots.txt however it wanted.

In essence it's an instruction for crawlers: which paths on the site they should not scan. Google, Bing, Yandex, DuckDuckGo, and most well-behaved bots respect it. ChatGPT now looks at robots.txt too — via directives for GPTBot and ClaudeBot. Scam crawlers and scrapers usually ignore it, because it isn't a ban — it's a request.

Two things robots.txt does NOT do:

robots.txt is not security. Anyone can open site.com/robots.txt and read the list of what you've blocked. If you write Disallow: /secret-admin-panel/ there — you've just announced the address of your secret admin panel to the entire internet. Hiding private paths with robots.txt is the same as hanging a "safe is here" sign on a house with no lock.

robots.txt is not noindex. Disallow forbids visiting a page. But if there are links to that page from other sites, Google can still show it in search results — with an empty snippet and the note "No information is available for this page." If you want to remove a page from the index permanently — you need noindex, not robots.txt.


Basic syntax

The file consists of blocks. Each block starts with User-agent, followed by directives.

User-agent: *
Disallow: /admin/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

User-agent: * — rules for all bots at once. The asterisk means "any crawler."

User-agent: Googlebot — rules only for Google. If you want different rules for different bots, you create separate blocks. Googlebot reads its own block and ignores the one with the asterisk. If there's no block for a specific bot — it reads the general one with the asterisk.

Disallow: /admin/ — forbids everything that starts with /admin/. That covers /admin/, /admin/users, and /admin/settings/permissions. The trailing slash matters: Disallow: /admin without the slash will also close /admin/users and /admin-panel-secret. Not always what you wanted.

Allow: /public/ — permits a path. It only makes sense when you have a broad Disallow above it and you want to carve out an exception. Without a Disallow, an Allow directive is meaningless — everything is allowed by default anyway.

Sitemap: https://example.com/sitemap.xml — the path to the sitemap. You can list several lines if you have multiple sitemaps. Specify the full URL, not a relative one.

Wildcards: * — any sequence of characters, $ — end of the line.

Disallow: /*.pdf$
Disallow: /*?session=

The first line will block all PDF files. The second — all URLs with a session parameter in the query string.

URL paths are case-sensitive. /Admin/ and /admin/ are two different paths. If your site has such a split, you need to close both.

Comments — via #. Use them liberally, so that six months from now you understand what you wrote here and why.


What to typically block

Not everything in sight. Only what shouldn't be cluttering the index or burning crawl budget for nothing.

/admin/, /wp-admin/, /login/, /dashboard/ — internal pages for administrators and signed-in users. Google has nothing to do there; those pages either require authentication or exist for you alone.

/cart/, /checkout/, /account/, /order/ — private e-commerce pages. Each user has their own cart, and their own personal account too. Indexing them means breeding duplicates and draining crawl budget.

/search/ — internal site search pages. Each search query a user types is a new URL like /search?q=.... If you don't block them, Google will find a million such URLs through links and try to index them all. The result — thousands of thin-content pages that drag the whole site down.

/api/ — JSON endpoints, REST APIs. These aren't pages, they're data. The crawler should not see them as content.

?session=, ?utm_source=, ?ref= — URLs with tracking parameters. They create duplicates of the same page. The best fix is canonical, but if setting up canonical is tricky, you can partially cover this with robots.txt. Just be careful: if you block ?utm_*, you'll also block pages with real content reached through those URLs.

/staging/, /preview/, /test/, /dev/ — separate environments. Often these are subdomains (staging.site.com), in which case they have their own robots.txt and that's the cleaner approach. But if staging lives on the main domain in a subfolder — block it.

Old expired landing pages. You ran an ad campaign, built /black-friday-2024/, it has served its purpose — block it until you delete it.

Files with duplicate content. A PDF version of a page, an RSS feed with full text, print versions — if they duplicate the main content, keep them out of the index.


What NOT to block

This is the part where people most often shoot themselves in the foot.

CSS and JavaScript. Since 2014 Google renders pages like a browser — it loads the HTML, then pulls in CSS and JS, runs the scripts, and looks at the final result. If you block /wp-content/themes/ or /static/js/ — Google sees the site without styles and without JS logic. The mobile-friendly test fails. The layout breaks. Content rendered by JS doesn't make it into the index. Never block static assets.

/sitemap.xml itself. I once saw a case where someone added Disallow: /sitemap.xml because "I don't want competitors to see my structure." The sitemap is precisely what Google reads on purpose to know what you have. By blocking the sitemap, you've amputated your own indexing.

Images, if you want them indexed in Google Images. Blocking /images/ or /uploads/ is a common mistake after migrating away from a website builder. Images get their own stream of traffic via Google Images, and blocking them kills that channel.

PDFs and files with unique content. If you have PDF documentation, reports, technical whitepapers — they can rank on their own and bring in traffic. Blocking them in robots.txt means losing visibility.

Pages with noindex in meta. This is the counterintuitive one. If you want to remove a page from the index via a noindex meta tag, Google has to visit the page, read the HTML, and see that tag. If you additionally block it in robots.txt — Googlebot won't go there, won't see the noindex, and the page may stay in the index thanks to external links. The logic is inverted: to remove a page through noindex, you need to leave it OPEN in robots.txt.

JSON-LD and structured data stored as separate files. If your schema.org markup lives in /schema/, keep that folder open.


robots.txt vs noindex vs canonical

Three different tools. People confuse them all the time, and that's where the most beautiful SEO bugs come from.

robots.txt Disallow means "don't go there." The crawler won't visit the page. That doesn't mean the page is out of the index: if external links point to it, Google knows it exists and can show it in search — with the "No information is available for this page" note, or even with a title taken from the anchor text of external links. Disallow saves crawl budget, but it does not guarantee deindexing.

<meta name="robots" content="noindex"> means "come, read, but don't index." Googlebot visits the page, sees the meta tag, and the page drops out of the index. This is the only reliable way to remove a page from search results. The key condition: the page must be ACCESSIBLE to the crawler. If it's blocked in robots.txt, Google won't visit, won't see the noindex, and the page will keep showing in results via external links.

<link rel="canonical" href="..."> means "index that version instead of me." Used when you have several URLs with the same content and you want to tell Google which one to treat as the main one. Duplicates from parameters, print versions, AMP pages — all of this is solved with canonical, not robots.txt. See when you need a canonical URL.

A cheat sheet for typical tasks:

— You want to remove a page from the index permanently: open it in robots.txt + add a noindex meta tag + wait two to four weeks while Google recrawls and drops it.

— You want to save crawl budget on unnecessary pages: block them in robots.txt. They may stay in the index, but Google will stop crawling them.

— You want to consolidate duplicates: canonical. Not robots.txt.

— You want to fully remove a page: 410 Gone in the HTTP status + remove from the sitemap. robots.txt only gets in the way here.

The most common mistake: "I blocked the page through robots.txt AND added noindex for safety." That isn't "for safety" — it's a conflict. Google can't read the noindex and the page gets stuck in the index.


How to check

The simplest way — open it in a browser or with curl:

curl https://site.com/robots.txt

You see immediately what the server returns. If you get a 404 — you don't have a robots.txt, which isn't a catastrophe on its own, but it's better to create one anyway.

Google Search Console has a robots.txt Tester (it moved — now under Settings → Crawling). You can paste a URL there and ask: is it allowed for Googlebot? Useful when the rules are complex and use wildcards.

The URL Inspection tool in GSC shows, for a specific URL, whether it's blocked and by which rule. Perfect for diagnosing "why isn't this page indexing."

screaming-frog or similar tools, when crawling the site, will immediately flag anything blocked in robots.txt. If you run Frog before a deploy — you'll catch the problem before it reaches production.

After every change to robots.txt, manually re-check the URLs of your key pages. And submit the updated robots.txt through GSC — Google won't reread it instantly, usually within a day.


Bottom line

robots.txt — five lines of text that can cost you a year of organic traffic. It's not a place for experiments and not a place for "I'll do it the way the neighbor did."

The basic rule: only block what actually shouldn't be in the index. Not out of paranoia, not "just in case." Every directive must have a reason, and that reason must be recorded in a comment.

Before every deploy — open robots.txt and read it with your own eyes. It's five lines. It takes ten seconds. It saves you months of rebuilding rankings.

And remember: robots.txt blocks crawling, not indexing. For indexing there's noindex. For duplicates — canonical. Don't mix up the tools. See also 30+ SEO factors and hreflang: subdomain vs subdirectory — those areas also have plenty of spots where a single line can erase six months of work.

robots.txt: what to block, what to allow · hiregora.com