Google's Entity Understanding Pipeline
Session 2.1 · ~5 min read
Google does not recognize entities in a single step. It processes entity information through a pipeline with distinct stages. At each stage, data can pass through or get dropped. Understanding this pipeline reveals exactly where most businesses fail and what to fix.
The Five-Stage Pipeline
Discover pages"] --> B["2. Parse
Read HTML + schema"] B --> C["3. Extract
Identify entity mentions"] C --> D["4. Reconcile
Match to known entities"] D --> E["5. Store
Add to Knowledge Graph"] style A fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style C fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style D fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style E fill:#2a2a28,stroke:#6b8f71,color:#ede9e3
Each stage has specific requirements. Failure at any stage means the downstream stages never receive your data.
| Stage | What Happens | Common Failure Point | How to Fix |
|---|---|---|---|
| 1. Crawl | Googlebot discovers and fetches your pages | Blocked by robots.txt, no internal links to key pages, JavaScript-only content | Ensure crawlability, fix robots.txt, add internal links |
| 2. Parse | Google reads HTML, renders JS, reads structured data | No JSON-LD schema, broken HTML, critical content behind JS | Add schema.org markup, validate HTML |
| 3. Extract | Google identifies entity mentions and attributes | Entity info buried in paragraphs, no structured declaration | Use explicit schema properties, clear About page |
| 4. Reconcile | Google matches mentions to known entities | Inconsistent NAP, no sameAs links, ambiguous name | Standardize NAP, build sameAs chain |
| 5. Store | Confirmed facts enter the Knowledge Graph | Insufficient confidence, unverifiable claims | Build corroboration: citations, GBP, Wikidata |
Stage 1: Crawl
Googlebot must find your pages before anything else can happen. It discovers URLs through links: internal links on your site, links from other sites, your XML sitemap, and previously known URLs.
Crawl failures are foundational. If Googlebot cannot reach your About page, it cannot read your Organization schema. If it cannot reach your author bio pages, it cannot build Person entities. Common crawl blockers include: overly restrictive robots.txt, orphan pages with no internal links, and single-page JavaScript applications that load content dynamically.
Stage 2: Parse
After fetching a page, Google parses the HTML. It reads the text content, follows internal and external links, and critically, processes any structured data in JSON-LD format.
This is where structured data enters the pipeline. A JSON-LD block in your page's head section is parsed directly into structured entity data. Without it, Google must infer entity information from unstructured text, a process that is slower, less reliable, and less complete.
Structured data is not a nice-to-have. It is how entity information enters the pipeline cleanly. Without it, you rely on Google's inference, which often produces nothing.
Stage 3: Extract
Google's Natural Language Processing systems identify entity mentions in your content. Named Entity Recognition (NER) detects company names, person names, locations, and products in your text. But NER is imperfect, especially for lesser-known entities.
Structured data bypasses NER limitations. When you declare {"@type": "Organization", "name": "PT Arsindo Perkasa"} in JSON-LD, Google does not need to figure out that "PT Arsindo Perkasa" is a company name. You have told it directly.
Stage 4: Reconcile
This is where most entity infrastructure efforts succeed or fail. Google has extracted an entity mention from your site. Now it must determine: is this entity already in the Knowledge Graph? If so, it updates the existing entry. If not, it must decide whether to create a new entry or discard the data.
Reconciliation relies on matching signals: Does the name match a known entity? Does the address match? Does the URL match? Do the sameAs links connect to known profiles? The more matching signals, the higher the reconciliation confidence.
Stage 5: Store
When confidence is high enough, Google stores the entity data in the Knowledge Graph. This triggers visible effects: Knowledge Panels may appear, AI Overviews may reference your entity, and your content may receive entity-authority ranking benefits.
Storage is not permanent in the sense that data must be continually corroborated. If your GBP is suspended, your citations become inconsistent, or your schema is removed, the stored entity can weaken or be removed from the graph.
The chart illustrates the compounding loss. A typical SMB loses most of its entity data at the Parse stage (no structured data) and Reconcile stage (no consistent external signals). An entity-optimized site retains most data through all stages.
Further Reading
- In-Depth Guide to How Google Search Works - Google's official documentation on the crawl-index-serve pipeline
- How Google Identifies Entities from Unstructured Content - Technical analysis of Google's entity extraction systems
- Google Knowledge Graph Reconciliation - How Google matches entity mentions to Knowledge Graph entries
Assignment
Trace the pipeline for your company website. (1) Is your site crawlable? Check robots.txt. (2) Does your HTML have structured data? View source and search for "application/ld+json." (3) Are entity mentions consistent? Check your company name on different pages. (4) Do external sources corroborate your entity? Check GBP and three directories. Write down where the pipeline breaks for you.