← Back to blog

How to find noindex pages on your site

To find noindex pages on your site you have to request each URL and check the response. Noindex is sent per-URL in HTML (<meta name="robots" content="noindex">) or HTTP headers (X-Robots-Tag: noindex). There is no single file or sitemap that lists them—you need to crawl or sample URLs and inspect each response. This guide walks you through a reliable method, common pitfalls, and how to act on the results so you can fix mistakes and keep intentional noindex under control.

Why finding noindex pages matters

Noindex tells search engines not to index that URL. It’s the right choice for thank-you pages, duplicate or filter views, internal tools, and staging. Problems start when:

  • A page you want indexed has noindex — Often from a template, plugin, or CMS default. Until you list noindex URLs, you won’t spot it.
  • You have no inventory — Large or inherited sites can have hundreds of noindex URLs. Audits and handoffs need a concrete list.
  • Sitemap and noindex conflict — URLs in the sitemap that also send noindex send a mixed signal; listing them helps you clean the sitemap or fix the page.

Knowing exactly which URLs return noindex lets you correct errors, document what’s intentional, and keep sitemap and indexing in sync.

How noindex is sent (and where to look)

Search engines respect noindex in two places. You must check both.

1. HTML meta tag

In the <head> of the page:

<meta name="robots" content="noindex">

Variants you’ll see:

  • content="noindex" — Don’t index this page.
  • content="noindex, nofollow" — Don’t index and don’t follow links.
  • content="noindex, follow" — Don’t index but do follow links.

The tag can also appear with content before name; parsers should handle both orders. Check the first few KB of HTML (many noindex tags are near the top). If the page is huge or slow, you can stop after finding the first <meta name="robots" or after a reasonable byte limit.

2. HTTP header

The server can send:

X-Robots-Tag: noindex

(or noindex, nofollow, etc.). This applies to the URL you requested. You get it from the response headers, so a HEAD request is enough for this part—no need to download the full HTML if the header is present and says noindex.

Important: If the header is missing or doesn’t include noindex, you still need to fetch the body and check the meta tag. Many sites use only the meta tag.

Where to get the list of URLs to check

You need a candidate set of URLs. There’s no “list of noindex URLs” anywhere; you build the list of “URLs that might be noindex” and then test each one.

Use your sitemap (best starting point)

Your sitemap is the set of URLs you’ve declared to search engines. It’s the right place to start:

  1. Find sitemap URLs from robots.txt (Sitemap: lines) or common paths like /sitemap.xml, /sitemap_index.xml.
  2. Fetch each sitemap. If it’s an index, fetch the child sitemaps.
  3. Extract every <loc> URL. Normalize (e.g. origin + pathname, no fragment) and deduplicate.

Every URL in that set is worth checking for noindex. If a sitemap URL returns noindex, you have a conflict to fix (either remove noindex or remove it from the sitemap).

Add URLs from a crawl (for broader coverage)

Crawling from the homepage and key entry points gives you URLs that might not be in the sitemap (e.g. old or forgotten pages). Combine sitemap + crawl so you don’t miss noindex on unlinked pages. Sitemap URLs that aren’t linked are especially good candidates—they’re often noindex or orphaned.

Step-by-step: request each URL and detect noindex

1. Prefer HEAD first for X-Robots-Tag

For each URL, send a HEAD request and read the X-Robots-Tag header. If it’s present and contains noindex, mark the URL as noindex and you’re done for that URL. No need to fetch the body.

2. If no header or no noindex, fetch the page and check meta

If HEAD didn’t return a noindex header (or the server doesn’t support HEAD), send a GET and scan the start of the response. Look for:

  • <meta name="robots" content="...">
  • <meta content="..." name="robots">

If the content value includes the substring noindex, the page is noindex. You can use a case-insensitive check.

3. Throttle and handle errors

  • Rate limiting — Send requests in small batches (e.g. 5 at a time) with a short delay between batches so you don’t overload the server or get blocked.
  • Timeouts — Set a timeout per request (e.g. 10 seconds). If a URL times out or returns 5xx, you can retry once or mark it as “unknown” and revisit later.
  • Redirects — Follow redirects and check the final URL’s response; that’s what search engines use for noindex.
  • 4xx — If the URL returns 404 or 410, it’s not a “noindex page” in the indexing sense; you can skip or list it separately for cleanup.

Common mistakes when finding noindex pages

  • Only checking the meta tag — You’ll miss pages that use only X-Robots-Tag. Always check headers first, then HTML.
  • Only checking robots.txt — robots.txt controls crawling (Disallow), not indexing. Noindex is per-URL and lives on the page or in the response headers.
  • Checking too few URLs — If you only spot-check, you’ll miss many noindex pages. Use the full sitemap (and optionally crawl) as your candidate set.
  • Ignoring redirects — The noindex signal is on the final URL. If /old redirects to /new, check /new’s response.
  • Not normalizing URLs — Use a consistent scheme (e.g. one canonical form per URL) so you don’t check the same page twice under different URLs.

What to do after you have the list

  • Fix mistakes — If a page should be indexed, remove noindex (meta and/or header) and ensure it’s not blocked in robots.txt if you want it crawled.
  • Leave intentional noindex — Thank-you pages, filters, internal tools can stay noindex; document them so the next person knows.
  • Clean up — If a noindex page is obsolete, redirect it or remove it and drop it from the sitemap.
  • Re-run — After big launches or CMS changes, run the check again to catch new noindex.

Frequently asked questions

Does noindex stop crawlers from requesting the page?
No. Noindex only says “don’t index this URL.” Crawlers can still request it (unless robots.txt Disallow applies). To reduce crawling, you’d use noindex plus consider not linking to the URL or removing it from the sitemap.

Should I remove noindex pages from my sitemap?
It’s cleaner to remove them. Submitting a URL in the sitemap while sending noindex is contradictory. Either you want it considered for indexing (then don’t use noindex) or you don’t (then don’t put it in the sitemap).

Can noindex be applied to a whole section via robots.txt?
No. robots.txt has Disallow (crawl) and Allow, but no “noindex” directive. Noindex is always per-URL (meta or header).

What if the same page has both X-Robots-Tag and meta noindex?
Both are valid. If either says noindex, search engines treat the page as noindex. Checking both ensures you don’t miss one.

Use a tool when you need the list at scale

Doing this for dozens or hundreds of URLs is tedious: scripting fetches, parsing sitemaps, throttling, and handling both header and meta. A tool that fetches your sitemap, samples or crawls URLs, and checks each response for noindex gives you a ready-made list.

Hidden Pages does this: enter your site URL, and the scan reports which sitemap (and discovered) URLs return noindex, alongside disallowed and unlinked URLs, so you get one audit without writing code.

Summary

Finding noindex pages on your site requires requesting each URL and inspecting the response: first the X-Robots-Tag header (HEAD is enough), then the <meta name="robots" content="..."> tag in the HTML if needed. Get your candidate URLs from the sitemap (and optionally a crawl), throttle requests, and avoid the common mistakes (only checking meta, only checking robots.txt, or checking too few URLs). Use the list to fix mistakes, document intentional noindex, and keep the sitemap consistent. For a ready-made list at scale, use a dedicated scanner.

Run a free scan →