← Back to blog

How to find orphaned pages on your website

Orphaned pages are URLs that return a normal page (e.g. 200) but are not linked from any other page on your site. Crawlers and users won’t find them by following links. To find them you compare two sets: “every URL we know about” (e.g. from a sitemap or full crawl) and “every URL that is linked from the site.” URLs in the first set but not in the second are orphan candidates. This guide explains how to build those sets, avoid common pitfalls, and what to do with the list so you can clean up or properly link important content.

Why orphaned pages matter

  • SEO — Google can still index orphans if it discovers them (e.g. via sitemap or old external links), but they get no internal link equity. They’re easy to miss in audits and can dilute crawl budget or create duplicate/thin content issues.
  • Maintenance — Orphans are often legacy, duplicate, or test pages. Finding them lets you redirect, consolidate, or remove.
  • Handoffs and audits — A list of “pages we have but don’t link to” makes site reviews and due diligence clearer.

What “orphan” means (and what it doesn’t)

An orphan here means: the URL exists (returns 2xx), but no other page on the same site links to it. So:

  • In scope: Pages in your sitemap or discovered by crawl that never appear as the target of an <a href="..."> (or equivalent) on any other page on your site.
  • Out of scope: Pages that don’t exist (404), or pages that are linked from somewhere on the site (those aren’t orphans). Also, “orphan” in some tools means “no external links”; here we mean no internal links.

You need two sets: a full URL set (“all known URLs”) and a linked URL set (“every URL that appears as a link target”). Orphans = full set − linked set.

Where to get the full URL set

You need as complete a list as possible of “URLs that exist” on your site.

Sitemap (best base)

  • Fetch robots.txt and read every Sitemap: line. Also try /sitemap.xml, /sitemap_index.xml, etc.
  • Fetch each sitemap. If it’s an index, fetch child sitemaps. Extract every <loc> URL.
  • Normalize (e.g. origin + pathname, no fragment, one rule for trailing slash). Deduplicate. This is your “all known URLs” set.

Sitemap URLs are the ones you’ve declared to search engines, so they’re the most important to check for orphan status.

Optional: add from a crawl

If you run a crawler that discovers URLs by following links, you can add those to the full set. That helps catch pages that exist but aren’t in the sitemap (e.g. old URLs, forgotten sections). Without a crawl, your full set is only the sitemap, so you’ll only find “sitemap URLs that aren’t linked”; that’s still very useful.

How to build the linked URL set

You need every same-site URL that appears as a link target on pages you’ve crawled.

Crawl from entry points

  • Start at the homepage and other important entry URLs (e.g. main section indexes, key landing pages).
  • For each HTML page, parse all links: <a href="..."> and any other attributes you use (e.g. data-href). Resolve relative URLs to absolute using the page’s URL. Keep only same-site URLs (same scheme and host; some tools also treat www vs non-www as same site).
  • Add each linked URL to a set. Normalize the same way as the full set (same trailing-slash rule, no fragment).
  • Crawl in waves (e.g. breadth-first): take all links from the first batch of pages, add new URLs to the queue, fetch those pages, repeat. Stop when you’re not discovering new same-site URLs or when you hit a limit.

Normalization is critical

If your full set uses https://example.com/page and your crawl records https://example.com/page/, they’re the same page but different strings. Pick one canonical form (e.g. strip trailing slash, or always add it) and apply it to both sets so the subtraction is correct. Also strip fragments (#section) and optionally normalize query strings if you treat ?x=1 and ?x=2 as the same URL or not—be consistent.

Edge cases and common mistakes

JavaScript-rendered links — If most of your links are injected by JavaScript, a simple HTML-only crawler won’t see them. Your “linked” set will be too small and you’ll get many false orphans. Use a crawler that executes JavaScript, or accept that your list is “URLs not linked in raw HTML” (which can still be useful for static or mostly-static sites).

Same page under multiple URLs — Redirects, ?utm_ parameters, or trailing-slash variants can make the same content appear under several URLs. Normalize so you don’t count the same page twice. If /old redirects to /new, only /new is in the crawl; that’s correct—you care about “what’s linked,” and the final URL is what’s linked.

Homepage and entry points — The homepage often isn’t “linked” in the same way (e.g. it’s the start URL). Exclude it (and any other entry points you don’t expect to be linked) from the orphan list so you don’t flag them by mistake.

External links — Only same-site links count. If another site links to your page, the page can still be an internal orphan (no link from your own site). That’s what we’re measuring.

Step-by-step summary

  1. Full URL set — Sitemap(s) + optional crawl; normalize and deduplicate.
  2. Linked URL set — Crawl from homepage and key entries; extract every same-site link target; normalize the same way.
  3. Orphans — Full set − linked set. Optionally remove homepage and known entry URLs.
  4. Act — Redirect or remove obsolete orphans; add links to important ones; document intentional orphans (e.g. thank-you pages).

Re-run after major content or structure changes.

Frequently asked questions

Are orphan pages bad for SEO?
They’re not inherently “bad,” but they get no internal link equity and are discoverable only via sitemap or external links. If they’re important, link to them. If they’re duplicate or thin, redirect or remove them.

Will Google index orphan pages?
Yes, if it discovers them (e.g. via sitemap or old backlinks). Orphan doesn’t mean “not indexed”; it means “not linked from your site.”

How often should I check for orphans?
After big launches, CMS changes, or migrations. For large sites, a quarterly or semi-annual run is often enough.

What’s the difference between “orphan” and “unlinked”?
In this guide they’re the same: a page that exists but isn’t linked from anywhere on your site. Some tools use “unlinked from nav” or “not linked from crawl” to mean the same thing.

Use a tool to get the list without building a crawler

Building a crawler, normalizing URLs, and subtracting sets is doable but time-consuming. A tool that crawls your site, merges in sitemap URLs, and reports “unlinked from nav” (or “orphan”) URLs gives you the list without scripting.

Hidden Pages does this: enter your site, run a scan, and get a report that includes URLs that appear in the sitemap or crawl but aren’t linked from the main site—so you can fix or document orphans in one place.

Summary

Orphaned pages are URLs that exist but aren’t linked from anywhere on your site. Find them by building a full URL set (sitemap + optional crawl) and a linked URL set (from a crawl of your site), then taking the difference. Normalize URLs consistently, handle JS-rendered links if needed, and exclude entry points from the orphan list. Use the list to redirect, link, or document. For a ready-made list, use a scanner that reports unlinked pages.

Run a free scan →