Session 2.1: Google's Entity Understanding Pipeline

Course → Module 2: How Google Recognizes Companies

Session 1 of 7

Google does not recognize entities in a single step. It processes entity information through a pipeline with distinct stages. At each stage, data can pass through or get dropped. Understanding this pipeline reveals exactly where most businesses fail and what to fix.

The Five-Stage Pipeline

graph LR A["1. Crawl
Discover pages"] --> B["2. Parse
Read HTML + schema"] B --> C["3. Extract
Identify entity mentions"] C --> D["4. Reconcile
Match to known entities"] D --> E["5. Store
Add to Knowledge Graph"] style A fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style C fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style D fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style E fill:#2a2a28,stroke:#6b8f71,color:#ede9e3

Each stage has specific requirements. Failure at any stage means the downstream stages never receive your data.

Stage	What Happens	Common Failure Point	How to Fix
1. Crawl	Googlebot discovers and fetches your pages	Blocked by robots.txt, no internal links to key pages, JavaScript-only content	Ensure crawlability, fix robots.txt, add internal links
2. Parse	Google reads HTML, renders JS, reads structured data	No JSON-LD schema, broken HTML, critical content behind JS	Add schema.org markup, validate HTML
3. Extract	Google identifies entity mentions and attributes	Entity info buried in paragraphs, no structured declaration	Use explicit schema properties, clear About page
4. Reconcile	Google matches mentions to known entities	Inconsistent NAP, no sameAs links, ambiguous name	Standardize NAP, build sameAs chain
5. Store	Confirmed facts enter the Knowledge Graph	Insufficient confidence, unverifiable claims	Build corroboration: citations, GBP, Wikidata

Stage 1: Crawl

Googlebot must find your pages before anything else can happen. It discovers URLs through links: internal links on your site, links from other sites, your XML sitemap, and previously known URLs.

Crawl failures are foundational. If Googlebot cannot reach your About page, it cannot read your Organization schema. If it cannot reach your author bio pages, it cannot build Person entities. Common crawl blockers include: overly restrictive robots.txt, orphan pages with no internal links, and single-page JavaScript applications that load content dynamically.

Stage 2: Parse

After fetching a page, Google parses the HTML. It reads the text content, follows internal and external links, and critically, processes any structured data in JSON-LD format.

This is where structured data enters the pipeline. A JSON-LD block in your page's head section is parsed directly into structured entity data. Without it, Google must infer entity information from unstructured text, a process that is slower, less reliable, and less complete.

Structured data is not a nice-to-have. It is how entity information enters the pipeline cleanly. Without it, you rely on Google's inference, which often produces nothing.

Stage 3: Extract

Google's Natural Language Processing systems identify entity mentions in your content. Named Entity Recognition (NER) detects company names, person names, locations, and products in your text. But NER is imperfect, especially for lesser-known entities.

Structured data bypasses NER limitations. When you declare {"@type": "Organization", "name": "PT Arsindo Perkasa"} in JSON-LD, Google does not need to figure out that "PT Arsindo Perkasa" is a company name. You have told it directly.

Stage 4: Reconcile

This is where most entity infrastructure efforts succeed or fail. Google has extracted an entity mention from your site. Now it must determine: is this entity already in the Knowledge Graph? If so, it updates the existing entry. If not, it must decide whether to create a new entry or discard the data.

Reconciliation relies on matching signals: Does the name match a known entity? Does the address match? Does the URL match? Do the sameAs links connect to known profiles? The more matching signals, the higher the reconciliation confidence.

Stage 5: Store

When confidence is high enough, Google stores the entity data in the Knowledge Graph. This triggers visible effects: Knowledge Panels may appear, AI Overviews may reference your entity, and your content may receive entity-authority ranking benefits.

Storage is not permanent in the sense that data must be continually corroborated. If your GBP is suspended, your citations become inconsistent, or your schema is removed, the stored entity can weaken or be removed from the graph.

The chart illustrates the compounding loss. A typical SMB loses most of its entity data at the Parse stage (no structured data) and Reconcile stage (no consistent external signals). An entity-optimized site retains most data through all stages.

Assignment

Trace the pipeline for your company website. (1) Is your site crawlable? Check robots.txt. (2) Does your HTML have structured data? View source and search for "application/ld+json." (3) Are entity mentions consistent? Check your company name on different pages. (4) Do external sources corroborate your entity? Check GBP and three directories. Write down where the pipeline breaks for you.

Google's Entity Understanding Pipeline