How to find pages blocked by robots.txt
If you’ve ever opened a site’s robots.txt and seen a long list of Disallow: rules, you’ve probably wondered: what URLs does this actually block? The file only shows patterns (like /admin/ or /*?sort=), not a full list of live URLs. Getting that list by hand is tedious. This guide explains why you need it, how robots.txt pattern matching works, how to turn patterns into a concrete list (and how to cross-check with your sitemap), and what to do with the results so you can fix mistakes and keep sitemap and crawler instructions in sync.
Why you need a list of blocked pages
robots.txt tells crawlers which paths they shouldn’t request. It doesn’t hide pages from people who have the URL; it only gives crawlers instructions. Common uses:
- Blocking admin, staging, or internal tools from being indexed
- Hiding duplicate or parameter-heavy URLs (e.g.
?sort=price,?utm_) - Keeping search engines out of thin or low-value sections
For SEO and audits, you often need to know:
- Which real URLs match those rules (not just the rules themselves)—so you can review and fix accidental blocks.
- Whether any of those URLs are also in your sitemap (contradiction: “don’t crawl this” vs “here’s a URL to consider”). Resolving that keeps your signals consistent.
- Whether something important was accidentally disallowed and never indexed. A list lets you spot and fix it.
So “find pages blocked by robots.txt” usually means: turn the Disallow patterns into a concrete list of URLs (or representative URLs) you can review, and optionally list every sitemap URL that’s blocked.
What robots.txt actually gives you
A typical block looks like:
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /*?*
Allow: /blog/
Sitemap: https://example.com/sitemap.xml
That tells you patterns, not a sitemap. To know “what’s blocked” you have to:
- Parse every
Disallow(andAllow) for theUser-agent: *(or your target bot) section. Order and specificity matter: Allow can override Disallow in many implementations. - For each pattern, figure out which URLs on the site match. Literal paths like
/admin/match that path and everything under it. Wildcards like/*?*match any URL with a query string. You need to expand patterns into example URLs and/or test a known set of URLs (e.g. sitemap) against the rules. - Cross-check with your sitemap — Fetch all sitemaps, extract every
<loc>URL, and test each against the rules. Sitemap URLs that match Disallow are “in sitemap but blocked.” Those are the conflicts to fix.
Doing that manually means reading the file, understanding pattern matching, and either guessing example URLs or crawling the sitemap and testing each URL against the rules. It’s slow and error-prone without a script or tool.
How to find pages blocked by robots.txt (method 1: expand patterns)
For each Disallow value, produce at least one concrete URL:
- Literal path (e.g.
/admin/) →https://yoursite.com/admin/(and everything below). List that URL and note “covers this path and below.” - Literal file (e.g.
/config.php) →https://yoursite.com/config.php. - Wildcard (e.g.
/*?*) → You can’t list every URL; list a representative (e.g.https://yoursite.com/?x=1) and note “all URLs with query string.”
Deduplicate so the same URL isn’t listed twice. Result: “Every type of URL disallowed” with at least one example per pattern. For “every sitemap URL disallowed” you need method 2.
How to find pages blocked by robots.txt (method 2: check sitemap)
- Fetch robots.txt and parse Disallow/Allow for
User-agent: *. - Fetch every sitemap (from
Sitemap:lines and common paths). Extract every<loc>URL. - For each sitemap URL, test whether it matches any Disallow (and whether any Allow overrides). If it matches Disallow, it’s “blocked.”
- List those. Also keep the expanded pattern URLs from method 1 for paths that don’t appear in the sitemap (e.g.
/admin/).
Result: “Every sitemap URL that’s blocked” plus “every disallowed pattern with an example.” That’s the most complete list without a full-site crawl.
How to find pages blocked by robots.txt (the fast way)
Use a tool that reads robots.txt, resolves the rules (including Allow), expands Disallow into example URLs, and checks your sitemap URLs against those rules. You get one list: blocked patterns and blocked sitemap URLs.
Hidden Pages does exactly that:
- Enter the site URL (yours or any site you’re auditing).
- The tool fetches and parses robots.txt, then expands the Disallow rules into concrete URLs (or representative URLs for wildcard patterns).
- It pulls your sitemap (from robots.txt or common paths) and marks which sitemap URLs are disallowed—so you see “in sitemap but blocked by robots.txt” in one list.
- You get a single report: clearly blocked URLs, plus the raw robots.txt so you can double-check.
Run a scan, review the list, and fix any rules or sitemap entries that don’t match your intent.
What to do with the list
Once you have the list of pages blocked by robots.txt:
- Audit for mistakes — Should that path really be disallowed? If not, remove or narrow the rule (e.g. allow a subpath with Allow).
- Fix sitemap vs robots conflicts — If a URL is in your sitemap but disallowed, either remove it from the sitemap or allow it in robots.txt (and ensure the page is worth indexing).
- Document — Keep the list for compliance, handoffs, or future crawler audits.
- Re-run periodically — After changing robots.txt or the sitemap, run the check again to confirm nothing new is accidentally blocked.
Common pitfalls
- Ignoring Allow — Sites often use
Disallow: /withAllow: /blog/. If you only look at Disallow, you’ll think the whole site is blocked. Implement Allow when testing URLs. - Wrong User-agent block — robots.txt can have multiple blocks. Use the one that applies to the crawler you care about (usually
User-agent: *). - Not deduplicating — Several Disallow rules can match the same URL. List each blocked URL once.
- Assuming “blocked” means “not indexed” — Disallow controls crawling. Google typically won’t index what it doesn’t crawl, but the formal “don’t index” signal is noindex on the page. For a full picture, combine robots.txt with noindex checks if needed.
Frequently asked questions
Does robots.txt block users from visiting the page?
No. It only instructs crawlers. Anyone with the URL can visit the page unless you protect it with auth or server rules.
Can I get a list of every URL on my site that’s blocked?
You can get (1) every pattern expanded to example URLs, and (2) every sitemap URL that’s blocked. For every URL on the site you’d need a full crawl and then to filter by the rules.
What if I want to block crawling but still have the URL in the sitemap?
That’s inconsistent. Crawlers that respect robots.txt won’t request the URL, so submitting it in the sitemap doesn’t help. Prefer: either allow crawling (and use noindex if you don’t want indexing) or remove from the sitemap.
Summary
Finding every page blocked by robots.txt means: (1) parsing Disallow (and Allow) for the right User-agent block, (2) expanding each pattern into at least one example URL, and (3) optionally checking your sitemap URLs against the rules to list every sitemap URL that’s blocked. Use the list to fix mistakes, resolve sitemap conflicts, and document what’s intentionally blocked. For a single report without scripting, use a tool that does the parsing and sitemap check for you.