← Back to blog

How to audit my website's sitemap for hidden pages

Auditing your sitemap for “hidden” pages means finding every sitemap URL that is blocked, noindexed, or not linked from the site. The sitemap tells search engines “these URLs exist,” but if those URLs are also disallowed in robots.txt, send noindex, or have no inlinks, they’re effectively hidden or contradictory. This guide walks you through a structured audit: what to check, how to check it, and what to do with the results so your sitemap and crawling/indexing stay aligned.

What “hidden” means for sitemap URLs

A sitemap URL is “hidden” or problematic when:

  • Disallowed — It matches a Disallow rule in robots.txt. Crawlers are told not to request it, but the sitemap says “consider this URL.” That’s a conflict: you’re submitting URLs you’ve also blocked.
  • Noindex — The page returns noindex (meta or header). Search engines won’t index it even if they crawl it. Often intentional (thank-you, filters); sometimes a mistake. Either way, listing it in the sitemap is inconsistent.
  • Unlinked — No other page on the site links to it. The only way to discover it is the sitemap (or direct URL). It’s “hidden” from normal navigation and gets no internal link equity.

The audit lists which sitemap URLs fall into each bucket so you can fix conflicts, correct mistakes, and decide which unlinked entries to link, redirect, or remove.

Step 1: Get every sitemap URL

  • Find sitemaps — Read robots.txt for every Sitemap: line. Also try common paths: /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml.
  • Fetch each sitemap — If it’s an index (references other sitemaps), fetch those too. Extract every <loc> URL.
  • Normalize — Use one canonical form (e.g. origin + pathname, no fragment, one rule for trailing slash). Deduplicate. This is your “sitemap URL set.”

Without this list, you can’t audit. Keep it as your source of truth for “what we’ve declared to search engines.”

Step 2: Mark which sitemap URLs are disallowed

  • Parse robots.txt — For User-agent: * (or the crawler you care about), collect every Disallow and Allow. Implement matching: a URL is disallowed if it matches a Disallow and isn’t overridden by an Allow.
  • Test each sitemap URL — For each URL in your sitemap set, check whether it matches any Disallow. If yes (and not overridden by Allow), mark it as “disallowed.”

Result: “These sitemap URLs are blocked by robots.txt.” Fix by either allowing them in robots.txt (if you want them crawlable) or removing them from the sitemap. Keeping them in the sitemap while disallowing them is confusing for crawlers and for you.

Step 3: Check which sitemap URLs are noindex

  • Request each URL (or a representative sample if the sitemap is huge) and check the response: X-Robots-Tag header and <meta name="robots" content="..."> in the HTML. If either contains noindex, mark the URL as noindex.
  • Throttle — Use small batches and short delays so you don’t overload the server.

Result: “These sitemap URLs tell search engines not to index them.” If that’s wrong (e.g. template mistake), remove noindex. If it’s right (e.g. thank-you page), consider removing them from the sitemap so you’re not submitting URLs you don’t want indexed.

Step 4: Find which sitemap URLs are unlinked

  • Crawl the site — From the homepage and key entry points, collect every same-site URL that appears as a link target. Normalize the same way as the sitemap set.
  • Unlinked sitemap URLs = sitemap set − linked set. Optionally exclude the homepage and known entry points.

Result: “These sitemap URLs aren’t linked from anywhere.” Decide for each: add a link (if the page should be discoverable), redirect (if obsolete), or remove from the sitemap (if you’re cleaning up). Unlinked doesn’t mean “bad,” but it’s worth reviewing.

Step 5: Combine into one report and act

Merge the results into clear buckets:

  • In sitemap but disallowed — Fix robots.txt or sitemap so they’re consistent.
  • In sitemap but noindex — Fix if accidental; else consider dropping from sitemap.
  • In sitemap but unlinked — Link, redirect, or remove from sitemap as appropriate.

Re-run the audit when you change sitemaps, robots.txt, or site structure.

Common mistakes when auditing the sitemap

  • Only checking one signal — If you only check disallow, you’ll miss noindex and unlinked. A full audit covers all three.
  • Skipping Allow in robots.txt — Some sites use Allow to carve out exceptions. If you only look at Disallow, you’ll misclassify URLs.
  • Not normalizing — Sitemap and crawl may list the same page under different forms (trailing slash, www). Normalize so you don’t double-count or miss matches.
  • Checking too few URLs for noindex — If the sitemap is large, at least sample across sections so you don’t miss systematic noindex (e.g. a whole section with a wrong template).

Frequently asked questions

Should I remove noindex pages from my sitemap?
Yes, for consistency. Submitting a URL in the sitemap while sending noindex is contradictory. Either you want it considered for indexing (remove noindex) or you don’t (remove from sitemap).

What if a sitemap URL is disallowed and noindex?
Fix both for clarity: either allow and remove noindex (if you want it indexed) or remove from sitemap and keep disallow/noindex (if you don’t). The goal is one clear intent per URL.

How often should I audit the sitemap?
After any change to sitemaps, robots.txt, or major site structure. For stable sites, a quarterly or semi-annual audit is often enough.

What about sitemap index files?
Fetch every child sitemap referenced in the index and include all <loc> URLs from those. The “sitemap URL set” should include every URL from every child sitemap.

Run the audit with one tool

Doing each step manually (fetching sitemaps, parsing robots.txt, requesting URLs for noindex, crawling for links) is time-consuming. A tool that does all of this and reports “sitemap URLs that are disallowed / noindex / unlinked” gives you the same audit in one run.

Hidden Pages does this: enter your site, and the scan shows which sitemap URLs are disallowed, which return noindex, and which aren’t linked from the main site—so you can audit your sitemap for hidden pages in one place.

Summary

To audit your website’s sitemap for hidden pages: (1) collect every sitemap URL from all sitemaps, (2) mark which are disallowed by robots.txt, (3) check which return noindex, (4) find which are unlinked. Combine into one report and fix conflicts (disallow vs sitemap), mistakes (wrong noindex), and unlinked entries (link, redirect, or remove). Re-run after changes. A dedicated scanner can produce this report in one scan.

Audit your sitemap →