Course → Module 2: How Google Recognizes Companies
Session 1 of 7

Google does not recognize entities in a single step. It processes entity information through a pipeline with distinct stages. At each stage, data can pass through or get dropped. Understanding this pipeline reveals exactly where most businesses fail and what to fix.

The Five-Stage Pipeline

graph LR A["1. Crawl
Discover pages"] --> B["2. Parse
Read HTML + schema"] B --> C["3. Extract
Identify entity mentions"] C --> D["4. Reconcile
Match to known entities"] D --> E["5. Store
Add to Knowledge Graph"] style A fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style C fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style D fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style E fill:#2a2a28,stroke:#6b8f71,color:#ede9e3

Each stage has specific requirements. Failure at any stage means the downstream stages never receive your data.

StageWhat HappensCommon Failure PointHow to Fix
1. CrawlGooglebot discovers and fetches your pagesBlocked by robots.txt, no internal links to key pages, JavaScript-only contentEnsure crawlability, fix robots.txt, add internal links
2. ParseGoogle reads HTML, renders JS, reads structured dataNo JSON-LD schema, broken HTML, critical content behind JSAdd schema.org markup, validate HTML
3. ExtractGoogle identifies entity mentions and attributesEntity info buried in paragraphs, no structured declarationUse explicit schema properties, clear About page
4. ReconcileGoogle matches mentions to known entitiesInconsistent NAP, no sameAs links, ambiguous nameStandardize NAP, build sameAs chain
5. StoreConfirmed facts enter the Knowledge GraphInsufficient confidence, unverifiable claimsBuild corroboration: citations, GBP, Wikidata

Stage 1: Crawl

Googlebot must find your pages before anything else can happen. It discovers URLs through links: internal links on your site, links from other sites, your XML sitemap, and previously known URLs.

Crawl failures are foundational. If Googlebot cannot reach your About page, it cannot read your Organization schema. If it cannot reach your author bio pages, it cannot build Person entities. Common crawl blockers include: overly restrictive robots.txt, orphan pages with no internal links, and single-page JavaScript applications that load content dynamically.

Stage 2: Parse

After fetching a page, Google parses the HTML. It reads the text content, follows internal and external links, and critically, processes any structured data in JSON-LD format.

This is where structured data enters the pipeline. A JSON-LD block in your page's head section is parsed directly into structured entity data. Without it, Google must infer entity information from unstructured text, a process that is slower, less reliable, and less complete.

Structured data is not a nice-to-have. It is how entity information enters the pipeline cleanly. Without it, you rely on Google's inference, which often produces nothing.

Stage 3: Extract

Google's Natural Language Processing systems identify entity mentions in your content. Named Entity Recognition (NER) detects company names, person names, locations, and products in your text. But NER is imperfect, especially for lesser-known entities.

Structured data bypasses NER limitations. When you declare {"@type": "Organization", "name": "PT Arsindo Perkasa"} in JSON-LD, Google does not need to figure out that "PT Arsindo Perkasa" is a company name. You have told it directly.

Stage 4: Reconcile

This is where most entity infrastructure efforts succeed or fail. Google has extracted an entity mention from your site. Now it must determine: is this entity already in the Knowledge Graph? If so, it updates the existing entry. If not, it must decide whether to create a new entry or discard the data.

Reconciliation relies on matching signals: Does the name match a known entity? Does the address match? Does the URL match? Do the sameAs links connect to known profiles? The more matching signals, the higher the reconciliation confidence.

Stage 5: Store

When confidence is high enough, Google stores the entity data in the Knowledge Graph. This triggers visible effects: Knowledge Panels may appear, AI Overviews may reference your entity, and your content may receive entity-authority ranking benefits.

Storage is not permanent in the sense that data must be continually corroborated. If your GBP is suspended, your citations become inconsistent, or your schema is removed, the stored entity can weaken or be removed from the graph.

The chart illustrates the compounding loss. A typical SMB loses most of its entity data at the Parse stage (no structured data) and Reconcile stage (no consistent external signals). An entity-optimized site retains most data through all stages.

Further Reading

Assignment

Trace the pipeline for your company website. (1) Is your site crawlable? Check robots.txt. (2) Does your HTML have structured data? View source and search for "application/ld+json." (3) Are entity mentions consistent? Check your company name on different pages. (4) Do external sources corroborate your entity? Check GBP and three directories. Write down where the pipeline breaks for you.