Course → Module 3: Why Most Websites Are Structurally Invisible
Session 3 of 7

Before Google can evaluate your content, rank your pages, or recognize your entity, it must be able to reach your website. Crawlability is the absolute foundation. If Googlebot cannot access your pages, nothing else matters. Your structured data, your content, your internal links are all invisible if the crawler is blocked at the door.

How Googlebot Crawls

Googlebot follows a specific sequence when visiting your site. Understanding this sequence reveals where crawlability failures typically occur.

graph TD A["Googlebot discovers URL
(sitemap, link, or direct)"] --> B["Check robots.txt
Is crawling allowed?"] B -->|Blocked| X["Page never crawled"] B -->|Allowed| C["Fetch HTML"] C --> D["Parse HTML
Extract links, text"] D --> E["Enter render queue
(for JavaScript)"] E --> F["Render with headless Chrome"] F --> G["Extract additional content
from rendered page"] G --> H["Decide: index or discard?"] style X fill:#2a2a28,stroke:#c47a5a,color:#ede9e3 style A fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style H fill:#2a2a28,stroke:#6b8f71,color:#ede9e3

Common Crawlability Failures

Failure What Happens How to Diagnose
robots.txt blocks important pages Googlebot obeys the block and never crawls the page Check robots.txt at yourdomain.com/robots.txt
JavaScript-rendered content HTML is empty; content loads via JS after page load View page source (not Inspect Element) and check for content
Broken or missing sitemap Googlebot has no roadmap to your pages Check yourdomain.com/sitemap.xml
Noindex tags on important pages Page is crawled but explicitly excluded from index Check source for meta robots noindex
Server errors (5xx responses) Googlebot gets an error instead of content Check GSC Coverage report for server errors
Slow server response Googlebot times out or reduces crawl rate Check server response time in GSC or PageSpeed Insights
Redirect chains (3+ hops) Googlebot may abandon the crawl mid-chain Use Screaming Frog or similar crawler to detect chains

The robots.txt Problem

The robots.txt file is a plain text file at the root of your domain that tells crawlers what they can and cannot access. It is powerful and often misused. A single misplaced rule can block your entire site from crawling.

Common robots.txt mistakes include blocking CSS and JavaScript files (which prevents Google from rendering your pages), blocking entire directories that contain important content, and using overly broad wildcard rules that catch pages you intended to be crawlable.

One critical distinction: robots.txt blocks crawling, not indexing. If Google finds a link to a blocked page from another source, it may still index the URL without crawling the content. The result is a thin, contentless index entry. This is worse than not being indexed at all.

robots.txt controls access, not visibility. Blocking a page from crawling does not remove it from Google. It removes Google's ability to see what is on the page.

The JavaScript Rendering Gap

Modern websites built with frameworks like React, Angular, or Vue.js often render content entirely in the browser using JavaScript. When Googlebot first fetches the page, it receives an empty HTML shell. The actual content is only visible after JavaScript execution.

Google does render JavaScript, but it happens in a second pass. The initial HTML fetch extracts links and basic content. JavaScript rendering happens later, sometimes days later, in a separate queue. This delay means JavaScript-dependent content is discovered and indexed more slowly. For content that changes frequently, this delay can mean Google's version of your page is always outdated.

AI crawlers from platforms like Perplexity, OpenAI, and Anthropic generally do not execute JavaScript at all. Content that depends on client-side rendering is completely invisible to these crawlers.

Sitemaps as Crawl Insurance

An XML sitemap is a list of every URL on your site that you want Google to crawl. It does not guarantee crawling or indexing, but it ensures Google knows about every page. For sites with complex architecture, JavaScript-rendered content, or frequent updates, a sitemap is essential.

The sitemap should be referenced in your robots.txt file, submitted through Google Search Console, and kept up to date automatically. A stale sitemap that lists deleted pages or missing new pages is worse than no sitemap because it wastes crawl budget on dead URLs.

Diagnosing Your Crawlability

Google Search Console provides direct diagnostics. The URL Inspection tool lets you test specific pages: is the URL crawlable? Is it indexed? When was it last crawled? What did Google see when it crawled?

For a site-wide view, the Pages report shows how many URLs are indexed, how many are excluded, and the specific reason for each exclusion. Common exclusion reasons like "Crawled, currently not indexed" and "Blocked by robots.txt" point directly to crawlability problems.

Further Reading

Assignment

Run your website through Google Search Console's URL Inspection tool for five key pages: your homepage, About page, Contact page, one service page, and one blog post. For each page, document:

  1. Is it indexed?
  2. Is it crawlable (no robots.txt blocks)?
  3. When was it last crawled?
  4. Does the rendered HTML contain all your visible content, or is content missing?

Any "not indexed" results are invisible pages. Any pages last crawled more than 30 days ago have a freshness problem.