How to find hidden pages on any website
Hidden pages are URLs that exist on a site but are hard to find through normal browsing or search-engine crawling. They might be blocked by robots.txt, marked noindex, not linked from anywhere, or only referenced in sitemaps, JavaScript, or old archives. Finding them matters for SEO audits, security reviews, and cleaning up your site. This guide explains what counts as hidden, how to find each type, how to combine everything into one report, and what to do with the results—so you can run the same process on your own site or any site you’re auditing.
What counts as a “hidden” page?
A page is “hidden” when it’s not easily discoverable by users or crawlers. For a practical audit we care about these cases:
- Blocked by robots.txt — Disallow rules tell crawlers not to request certain paths. The URLs still work if you have the link; crawlers are instructed to skip them. You see patterns in the file, not a list of URLs—you have to expand or match patterns to get a real list.
- Noindex — The page has a
noindexdirective (meta tag or header), so search engines are told not to index it even if they crawl it. Noindex is per-URL; there’s no single file that lists all noindex pages. - Unlinked (orphaned) — The URL exists and returns 200, but no other page on the site links to it. Crawlers and users won’t find it by following links. You need a full URL set (e.g. sitemap) and a “linked” set from a crawl, then subtract.
- Only in sitemap — Listed in an XML sitemap but not linked from the main site. Google may still discover it via the sitemap, but it’s “hidden” from normal navigation and gets no internal link equity.
- In scripts or comments — URLs appear only in JavaScript, HTML comments, or external archives (e.g. Wayback Machine), not as normal crawlable links. These are candidates; many will be 404 or not meant to be public, so they need verification.
For audits, you usually want a single list that combines these sources so you can decide what should stay hidden, what should be fixed, and what should be removed.
Why you’d want to find hidden pages
- SEO — Find accidental blocks or noindex on pages you want indexed; find sitemap/robots conflicts; discover thin or duplicate content that’s still reachable. Cleaning this up keeps crawl budget and indexing aligned with your goals.
- Security and compliance — Discover admin paths, old backups, or staging URLs that shouldn’t be reachable or should be locked down. A full list supports access control and documentation.
- Content cleanup — Identify orphaned or legacy pages to redirect, consolidate, or remove. Unlinked pages are often forgotten or duplicate.
- Due diligence — When auditing a site you’re acquiring or partnering with, see what’s really there beyond the public nav. One report (disallowed, noindex, unlinked, possible) gives you the picture.
How to find hidden pages (step by step)
1. Get robots.txt and list disallowed URLs
Fetch the site’s robots.txt (e.g. https://example.com/robots.txt). Parse every Disallow (and Allow, if present) for User-agent: *. Those are patterns, not a full URL list. To get a usable list you can:
- Expand patterns — Turn
/admin/intohttps://example.com/admin/(and note “covers this path and below”). For wildcards like/*?*, list a representative URL (e.g. homepage with?x=1) and note “all URLs with query string.” - Cross-check with the sitemap — Fetch all sitemaps, extract every
<loc>URL, and test each against the Disallow/Allow rules. Sitemap URLs that match Disallow are “in sitemap but blocked.” That’s a conflict to fix.
Result: “These URLs (or paths) are blocked by robots.txt” and optionally “these sitemap URLs are blocked.”
2. Check which pages are noindex
Noindex is in the HTML (<meta name="robots" content="noindex">) or in HTTP headers (X-Robots-Tag: noindex). To find noindex pages you must request the URLs and inspect the response. You can’t get a full list from robots.txt—noindex is per-URL. Options:
- Sample from sitemap — Fetch sitemap URLs, request each (or a sample), check header and first few KB of HTML for noindex. Throttle (batches, delays) to avoid overloading the server.
- Sample from crawl — Crawl from the homepage, collect URLs, then check a subset for noindex. Broader coverage if you combine with sitemap.
Result: “These URLs send noindex.” Review for mistakes (wrong template) and for sitemap consistency (don’t submit noindex URLs in the sitemap).
3. Find unlinked (orphan) pages
Orphans are pages that return 200 but have no inlinks from the same site. To find them:
- Crawl the site from the homepage and key entry points; collect every same-site URL that appears as a link target. Normalize (e.g. origin + pathname, one rule for trailing slash).
- Full URL set = sitemap URLs + optionally crawl-discovered URLs. Linked set = what you collected. Orphans = full set − linked set (minus homepage/entries you exclude).
Sites that render most links with JavaScript need a JS-aware crawler; otherwise the “linked” set is incomplete and you’ll see more false orphans.
Result: “These pages exist but aren’t linked from anywhere.” Decide: link, redirect, or remove.
4. Compare sitemap to robots and links
Fetch all sitemaps (from robots.txt Sitemap directives and common paths like /sitemap.xml). List every URL. Then:
- Mark which sitemap URLs are disallowed by robots.txt (contradiction: “don’t crawl” vs “here’s a URL”).
- Mark which are not linked from the main site (sitemap-only / hidden from nav).
Result: “Sitemap URLs that are blocked” and “sitemap URLs that are unlinked.” Both are actionable.
5. Optional: common paths, Wayback, and in-page references
You can extend the list by:
- Common paths — Probe
/admin/,/wp-admin/,/login/,/backup/,/staging/, etc. List which return 2xx or 4xx. These might be secret or legacy; verify manually. - Wayback / archives — Query the Wayback Machine (or similar) for the domain. List historical URLs; cross-check with the live site if needed.
- In-page — Scan HTML and JS for URLs in comments or script variables. Many will be 404 or not public; treat as “worth checking.”
Result: “These might be hidden or legacy; verify manually.” Add to the “possible” bucket.
Put it together in one report
Doing all of the above manually is time-consuming and error-prone. A practical approach is to use a single tool that:
- Parses robots.txt and expands Disallow into a list of blocked URLs (or representative URLs for patterns).
- Fetches sitemaps and flags sitemap URLs that are disallowed or unlinked.
- Samples sitemap (or discovered) URLs for noindex.
- Optionally reports unlinked URLs, common-path hits, and URLs from archives or in-page sources.
You then get one report: “hidden” (high confidence: disallowed or noindex) and “worth checking” (unlinked, wayback, in-JS, etc.), so you can fix mistakes and document what’s intentionally hidden.
Hidden Pages runs this kind of scan: enter a URL, get robots.txt plus disallowed URLs, sitemap vs robots vs noindex and unlinked, plus optional common paths and archive-based URLs, in one place.
What to do after you have the list
- Fix contradictions — If a URL is in your sitemap but disallowed or noindex, either allow and index it (if you want it in search) or remove it from the sitemap.
- Fix mistakes — If an important page is accidentally disallowed or noindex, update robots.txt or the page’s meta/headers.
- Clean up — Redirect or remove orphaned or legacy pages you don’t need; lock down or remove admin/backup paths that shouldn’t be public.
- Document — Keep the list for future audits and handoffs. Re-run after big changes.
Common mistakes
- Only checking robots.txt — You’ll miss noindex and unlinked pages. A full “hidden pages” audit covers disallow, noindex, and unlinked at minimum.
- Only checking noindex — You’ll miss disallowed and unlinked. Combine all signals.
- Not throttling — Requesting hundreds of URLs with no delay can get you blocked or overload the server. Use batches and short delays.
- Treating “possible” as “hidden” — Wayback and in-JS URLs are candidates. Verify before changing anything.
Frequently asked questions
Can I find hidden pages on a site I don’t own?
Yes. You can fetch robots.txt, sitemaps (if public), and crawl the public site. You’ll get disallowed URLs, sitemap conflicts, and (if you run a crawler) unlinked and noindex samples. You can’t change the site, but you can produce the audit report.
What’s the difference between “hidden” and “secret” pages?
In this guide they’re the same: pages that exist but are hard to discover (blocked, noindex, unlinked, or only in scripts/archives). Different articles use different wording; the method is the same.
How often should I run this?
After major launches, migrations, or CMS changes. For ongoing audits, quarterly or semi-annual is often enough.
Summary
Finding hidden pages on any website means combining: robots.txt (disallowed URLs, expanded from patterns and/or from sitemap), noindex checks (per-URL), unlinked/orphan discovery (crawl vs sitemap), and optionally common paths and archive/JS references. Do it manually with crawlers and scripts, or use a dedicated tool to get one report you can act on. Then fix conflicts, correct mistakes, and clean up or secure what shouldn’t be exposed. Re-run after big changes.