← Back to blog

How to audit my website for secret pages

Auditing your website for “secret” pages means finding every URL that exists but is hard to discover: blocked by robots.txt, marked noindex, not linked from anywhere, only in the sitemap, or only in scripts/comments/archives. A practical audit follows a fixed set of checks and produces one list you can act on. This guide explains what to check, how to run each check, and how to merge the results so you can fix mistakes, document what’s intentional, and keep the site clean and consistent.

What “secret” means in this context

A page is “secret” if it’s intentionally or accidentally hard to find. For the audit we care about:

  • Blocked — Matches a robots.txt Disallow rule. Crawlers are told not to request it.
  • Noindex — The page sends noindex (meta or header), so it won’t be indexed even if crawled.
  • Unlinked (orphan) — No other page on the site links to it; only direct URL or sitemap.
  • Sitemap-only — In the sitemap but not linked from the main site (a subset of unlinked).
  • In scripts, comments, or archives — URL appears in JavaScript, HTML comments, or archives (e.g. Wayback Machine) but not as a normal crawlable link. These are “might be secret or legacy”; often need manual verification.

Your audit should cover all of these so you get one picture: what’s clearly hidden (disallow/noindex), what’s worth checking (unlinked, wayback, in-JS), and what to fix or document.

Audit step 1: List every URL disallowed by robots.txt

  • Fetch https://yoursite.com/robots.txt (or the correct origin).
  • Parse every Disallow (and Allow) for User-agent: *. Use the block that applies to general crawlers.
  • Expand each Disallow into at least one concrete URL (or a representative for wildcards). E.g. /admin/https://yoursite.com/admin/ with a note “covers this path and below.”

Result: “These URLs (or paths) are blocked from crawlers.” This is your “disallowed” list. Keep it for the next step.

Audit step 2: List sitemap URLs and flag conflicts

  • Fetch all sitemaps — From robots.txt Sitemap: lines and common paths (/sitemap.xml, etc.). If you have an index, fetch child sitemaps. Extract every <loc> URL. Normalize and deduplicate.
  • For each sitemap URL, test whether it matches any Disallow (and whether any Allow overrides). If it matches Disallow, mark it as “in sitemap but disallowed.”

Result: “These sitemap URLs are disallowed” (contradiction) and “these sitemap URLs are allowed.” The disallowed ones need a fix: either allow them in robots.txt or remove them from the sitemap.

Audit step 3: Check which URLs are noindex

  • Take a URL set — Sitemap URLs (and optionally URLs from a crawl). For large sets, you can sample by section or priority.
  • For each URL, request the page and check X-Robots-Tag and <meta name="robots" content="...">. If either contains noindex, mark the URL as noindex. Throttle requests (batches, delays) to avoid overloading the server.

Result: “These URLs tell search engines not to index them.” Review for mistakes (wrong template, plugin default). If intentional, document; consider removing them from the sitemap for consistency.

Audit step 4: Find unlinked pages

  • Crawl from the homepage and key entry points. Collect every same-site URL that appears as a link target. Normalize the same way as your full URL set.
  • Full URL set = sitemap URLs + optionally crawl-discovered URLs. Linked set = what you collected from the crawl. Unlinked = full set − linked set (minus homepage/entries you exclude).

Result: “These pages exist but aren’t linked from anywhere.” Decide for each: link (if important), redirect or remove (if obsolete), or leave as-is and document (e.g. thank-you pages).

Audit step 5 (optional): Common paths, Wayback, in-page references

  • Common paths — Probe paths like /admin/, /wp-admin/, /login/, /backup/, /staging/. List which return 2xx or 4xx (not 404). These might be secret or legacy; verify manually.
  • Wayback / archives — Query the Wayback Machine (or similar) for your domain. List historical URLs that might still exist. Cross-check with your live site if needed.
  • In-page — Scan HTML and JS for URLs in comments or in script (e.g. API paths, old links). These are candidates, not proof; many will be 404 or not meant to be public.

Result: “These might be secret or legacy; verify manually.” Add to the “worth checking” bucket.

Put it in one report

Merge the results into two buckets:

  • Clearly hidden — Disallowed and/or noindex. Review for mistakes (wrong block, accidental noindex) and fix. Document what’s intentional.
  • Worth checking — Unlinked, sitemap-only, common-path hits, wayback, in-JS. Decide: link, redirect, remove, or document. Many will be false positives or intentional; the list ensures you don’t miss anything.

Document the final list (spreadsheet, internal doc) and re-run the audit after big changes (launches, migrations, CMS updates).

Common audit mistakes

  • Skipping one of the steps — If you only do robots.txt, you’ll miss noindex and unlinked. A full audit covers disallow, sitemap conflicts, noindex, and unlinked at minimum.
  • Not throttling — Requesting hundreds of URLs with no delay can get you blocked or overload the server. Use batches and delays.
  • Ignoring Allow — robots.txt Allow can override Disallow. Implement both when testing URLs.
  • Treating “possible” as “hidden” — Wayback and in-JS URLs are candidates. Verify before changing anything; many are 404 or not relevant.

Frequently asked questions

How often should I run this audit?
After major launches, migrations, or CMS changes. For stable sites, quarterly or semi-annual is often enough.

What’s the difference between “secret” and “hidden”?
In this guide they’re the same: pages that exist but are hard to discover (blocked, noindex, unlinked, or only in scripts/archives). “Secret” is the wording you used; the method is the same as “find hidden pages.”

Should I fix everything the audit finds?
No. Fix mistakes (wrong disallow, accidental noindex). For intentional cases (thank-you pages, admin), document them. For unlinked or “possible” pages, decide case by case: link, redirect, remove, or leave and document.

Run the audit with one tool

Doing each step manually is slow: fetch robots.txt, fetch sitemaps, request URLs for noindex, crawl for links, optionally probe common paths and archives. A single tool that runs all of this and produces “disallowed / noindex / unlinked / other to check” in one report saves time.

Hidden Pages runs this audit: enter your URL, and you get disallowed URLs, sitemap-vs-robots conflicts, noindex URLs, unlinked pages, and optional common-path and archive-based URLs in one scan. Use it for the initial audit and for periodic re-checks.

Summary

Auditing your website for secret pages means: (1) list disallowed URLs from robots.txt, (2) list sitemap URLs and flag which are disallowed, (3) check those and others for noindex, (4) find unlinked/orphan pages, (5) optionally add common paths and archive/JS URLs. Combine into one report with “clearly hidden” and “worth checking” buckets, fix mistakes, document the rest, and re-run after big changes. A dedicated scanner can run all steps and give you the list in one run.

Run a free audit →