How do I see every URL that's disallowed in my robots.txt?
robots.txt only shows patterns (e.g. Disallow: /admin/), not a list of every URL. To see every URL that’s disallowed you have to (1) parse the file for Disallow (and Allow) rules, then (2) turn those patterns into real URLs. You can expand patterns into example URLs, check your sitemap against the rules, or both. This guide explains how to do each step, how pattern matching works, and how to get a single list you can act on.
What robots.txt actually gives you
A typical block looks like:
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /*?*
Sitemap: https://example.com/sitemap.xml
That tells crawlers “don’t request these patterns.” It does not list individual URLs. So “every URL disallowed” means: every URL that matches those patterns. To “see” them you either:
- Expand each pattern into one or more example URLs (so you see “every type of URL disallowed”), or
- Take a known set of URLs (e.g. from your sitemap) and test each against the rules (so you see “every sitemap URL that’s disallowed”), or
- Do both — expanded patterns plus sitemap URLs that match — for the most complete picture.
How Disallow (and Allow) matching works
Crawlers treat Disallow as a prefix match in most implementations: a URL is disallowed if its path (and optionally query) starts with or matches the Disallow value. Allow can override a broader Disallow (e.g. Disallow: / with Allow: /public/). When parsing:
- Use the User-agent: * block (or the block for the crawler you care about, e.g. Googlebot).
- Collect every
DisallowandAllowin that block. Order can matter: some implementations use “first match wins” or “most specific wins.” - For a given URL, check if it matches any Disallow; if so, check if any Allow overrides it. If it matches Disallow and is not overridden by Allow, it’s disallowed.
Wildcards: the original robots.txt spec doesn’t define *, but Google and others support it (e.g. Disallow: /*?* blocks URLs with a query string). When expanding patterns, you need to handle both literal paths and wildcards.
Option 1: Expand patterns into example URLs
For each Disallow value, produce at least one concrete URL so you can “see” what’s blocked:
- Literal path (e.g.
/admin/) → The disallowed URL ishttps://yoursite.com/admin/(and everything under it). List that URL and note “covers this path and below.” - Literal file (e.g.
/config.php) → Listhttps://yoursite.com/config.php. - Wildcard (e.g.
/*?*) → Every URL with a query string is disallowed. You can’t list them all; list a representative (e.g.https://yoursite.com/?x=1) and note “all URLs with query string.” - Multiple rules → Combine them. Deduplicate so the same URL isn’t listed twice (e.g.
/admin/and/admin/login/might both match; list one and note the scope).
Result: a list of representative or example URLs (one per pattern or path), with a note that some rules cover infinitely many URLs. That answers “what kinds of URLs are disallowed.” For “every actual URL” you need a source of URLs (e.g. sitemap).
Option 2: Check your sitemap against the rules
If you have a sitemap, you can list every sitemap URL that’s disallowed:
- Fetch robots.txt and parse Disallow/Allow for
User-agent: *. - Fetch every sitemap (from robots.txt
Sitemap:lines and common paths). Collect all<loc>URLs. - For each sitemap URL, test whether it matches any Disallow (and whether any Allow overrides). If it matches Disallow, mark it as “disallowed.”
- List those. Also keep the expanded pattern URLs from Option 1 for paths that have no sitemap (e.g.
/admin/).
Result: “Every sitemap URL that’s disallowed” plus “every disallowed pattern with at least one example URL.” That’s the closest to “every URL disallowed” without crawling the entire site.
Option 3: Use a tool that does both
A tool that fetches robots.txt, parses the rules, expands them into example URLs, and checks your sitemap URLs against those rules gives you one list: “these URLs (or paths) are disallowed.”
Hidden Pages does this: enter your site, and you get a list of disallowed URLs (from expanded patterns and, where applicable, sitemap URLs that match Disallow). You see every type of blocked URL and every sitemap URL that’s blocked in one place.
Common pitfalls
- Ignoring Allow — Some sites use
Disallow: /withAllow: /blog/. If you only look at Disallow, you’ll think the whole site is blocked. Implement Allow when testing URLs. - Wrong User-agent block — robots.txt can have multiple blocks (e.g.
User-agent: *andUser-agent: Googlebot). Use the block that applies to the crawler you care about (usually*). - Assuming robots.txt is the only signal — Disallow controls crawling. Indexing can also be controlled by noindex on the page. To see “what’s blocked from indexing” you need both robots.txt and per-URL noindex checks.
- Not deduplicating — Several Disallow rules can match the same URL. Deduplicate so each URL appears once in your list.
Frequently asked questions
Can I get a list of every URL on my site that’s disallowed?
Only if you have a list of every URL (e.g. from a full crawl). robots.txt doesn’t list URLs; it lists patterns. So you can get (1) every pattern expanded to example URLs, and (2) every sitemap URL that’s disallowed. For a full-site list you’d need a full crawl and then filter by the rules.
Does Disallow block the URL from being indexed?
Disallow tells crawlers not to request the URL. Google typically won’t index URLs it doesn’t crawl, so in practice disallowed URLs are usually not indexed. But the formal “don’t index” signal is noindex on the page or in headers.
What if my sitemap lists URLs that are disallowed?
That’s a conflict: the sitemap says “consider these URLs” and robots.txt says “don’t crawl them.” Resolve it by either allowing those URLs in robots.txt (if you want them crawlable) or removing them from the sitemap.
Summary
You see “every URL disallowed” by (1) parsing robots.txt for Disallow/Allow in the right User-agent block, (2) expanding each pattern into at least one example URL, and (3) optionally checking your sitemap URLs against the rules to list every sitemap URL that’s disallowed. Handle Allow and wildcards correctly, and deduplicate. For a single combined list without scripting, use a tool that does the parsing and checking for you.