XML Sitemaps and Robots.txt
Session 7.6 · ~5 min read
An XML sitemap is a file that lists the URLs on your site that you want search engines to discover and index. A robots.txt file tells search engines which URLs they may or may not crawl. Together, these two files form the communication layer between your website and search engine crawlers.
For entity authority, the strategic question is not just "does my sitemap exist?" It is "does my sitemap prioritize my entity pages?" Most sitemaps are auto-generated by CMS tools and treat all pages equally. A well-configured sitemap explicitly signals to Google which pages carry your most important entity information.
XML Sitemap Fundamentals
An XML sitemap is an XML file, typically located at https://yourdomain.com/sitemap.xml, that lists URLs along with optional metadata like last modification date, change frequency, and priority.
Google uses sitemaps as a discovery mechanism. If a page is in your sitemap, Google knows it exists and will add it to the crawl queue. Sitemaps do not guarantee indexation (that depends on page quality), but they do ensure discovery.
| Sitemap Element | What It Does | Entity Authority Strategy |
|---|---|---|
<loc> | The URL of the page | Include all entity-critical pages. Exclude admin, duplicate, and thin pages. |
<lastmod> | Date the page was last modified | Keep accurate. Google uses this to decide crawl priority. |
<changefreq> | How often the page changes (daily, weekly, monthly) | Google largely ignores this. Set it honestly but do not rely on it. |
<priority> | Relative importance (0.0 to 1.0) | Google largely ignores this too. Set entity pages to 0.8-1.0 and other pages lower. |
Sitemap Structure for Entity Pages
Rather than a flat list of every URL on your site, organize your sitemap to highlight your entity-critical pages. If your site is large, use a sitemap index file that references multiple sub-sitemaps.
(Sitemap Index)"] --> B["sitemap-entity.xml
Entity-Critical Pages"] A --> C["sitemap-content.xml
Blog Posts, Articles"] A --> D["sitemap-media.xml
Images, Videos"] B --> B1["Homepage
priority: 1.0"] B --> B2["About Page
priority: 0.9"] B --> B3["Contact Page
priority: 0.8"] B --> B4["Service Pages
priority: 0.8"] B --> B5["Team / Founder Page
priority: 0.8"] C --> C1["Recent articles
priority: 0.6"] C --> C2["Older articles
priority: 0.4"] D --> D1["Image sitemap entries"] D --> D2["Video sitemap entries"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style B1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style B2 fill:#222221,stroke:#6b8f71,color:#ede9e3
By separating entity-critical pages into their own sub-sitemap, you make it easy to see at a glance whether your most important pages are included and up to date. It also helps you audit entity page changes over time.
robots.txt Configuration
Your robots.txt file should work in tandem with your sitemap. It tells search engines where to find your sitemap and which parts of your site to avoid crawling.
A well-configured robots.txt for entity authority:
# robots.txt for entity-focused site
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /search/
Disallow: /tag/
Disallow: /?s=
# Reference the sitemap
Sitemap: https://yourdomain.com/sitemap.xml
Notice what is allowed and what is blocked. All entity-critical pages (homepage, about, contact, services) are in the Allow scope. Admin pages, internal search results, and tag pages (which often create duplicate content) are blocked.
| robots.txt Rule | Purpose | Entity Impact |
|---|---|---|
Allow: / | Allows crawling of everything by default | Ensures entity pages are accessible |
Disallow: /admin/ | Blocks admin area from crawling | None. Admin pages are not entity content. |
Disallow: /search/ | Blocks internal search results pages | Positive. Prevents Google from crawling thin, duplicate search result pages. |
Disallow: /tag/ | Blocks tag archive pages | Positive. Tag pages are often thin and duplicate. |
Sitemap: [URL] | Points Google to your sitemap | High. Ensures Google discovers your sitemap. |
Key concept: Your robots.txt and sitemap work as a pair. The sitemap says "here are the pages I want you to index." The robots.txt says "here are the areas you should avoid." Together, they guide Google's crawl budget toward your entity-critical pages.
Submitting Your Sitemap to Google
There are three ways to tell Google about your sitemap:
- robots.txt reference: Add a
Sitemap:line to your robots.txt (shown above). Google checks robots.txt regularly and will discover the sitemap. - Google Search Console submission: In GSC, go to Sitemaps, enter your sitemap URL, and click Submit. This is the most direct method.
- Ping: Send a GET request to
https://www.google.com/ping?sitemap=https://yourdomain.com/sitemap.xml. Useful for automated workflows.
After submission, Google Search Console will show you the sitemap's processing status, including how many URLs were submitted, how many were indexed, and any errors.
About 68% of websites have an XML sitemap, but only 41% have submitted it to Google Search Console. If you submit your sitemap and verify it in GSC, you are already ahead of the majority.
Common Sitemap Mistakes
- Including noindex pages: If a page has a noindex tag, do not include it in your sitemap. Conflicting signals confuse Google.
- Stale lastmod dates: If you set lastmod to today's date every time the sitemap regenerates (without actual page changes), Google learns to ignore your lastmod data.
- Too many URLs: A sitemap can contain up to 50,000 URLs, but including thousands of low-quality pages dilutes the signal. Be selective.
- Non-canonical URLs: Only include the canonical version of each URL in your sitemap. Do not include URLs that redirect or have canonical tags pointing elsewhere.
Further Reading
- Google. "Build and Submit a Sitemap." Google Search Central. developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap
- Google. "Robots.txt Specifications." Google Search Central. developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- Sitemaps.org. "XML Sitemap Protocol." sitemaps.org/protocol.html
- Mueller, John. "Sitemaps and Indexing." Google Search Central YouTube. youtube.com/googlewebmasters
Assignment
- Check if your site has an XML sitemap at yourdomain.com/sitemap.xml. If not, create one (most CMS tools have sitemap plugins, or you can generate one manually).
- Review the URLs in your sitemap. Verify that all entity-critical pages (homepage, about, contact, services) are included. Remove any noindex pages or redirecting URLs.
- Check your robots.txt file. Verify it references your sitemap with a
Sitemap:directive. Verify that no entity-critical pages are blocked by Disallow rules. - Submit your sitemap to Google Search Console (Sitemaps section). Record the number of submitted URLs and the number of indexed URLs.
- If there is a gap between submitted and indexed URLs, use the URL Inspection Tool to investigate why specific pages are not indexed. Apply the fixes from Session 7.2.