Session 9.6: Error Handling: When an Agent Produces Garbage

Course → Module 9: Multi-Agent Workflows

Session 6 of 7

Agents Fail. Plan for It.

Agents hallucinate. They ignore instructions. They produce output in the wrong format. They exceed token limits. They return empty responses. They invent facts that sound plausible. These are not edge cases. They are normal operating conditions for any AI system.

Your pipeline needs error handling: automated checks between each agent that catch failures before they propagate downstream. A bad research brief that reaches the Writing Agent produces a bad draft. A bad draft that reaches the Editing Agent produces meaningless review scores. Errors compound. Catch them early.

Three Layers of Checks

flowchart TD A["Agent Output"] --> B{"Layer 1:
Format Check"} B -- Pass --> C{"Layer 2:
Content Check"} B -- Fail --> D["Retry with
format error message"] C -- Pass --> E{"Layer 3:
Quality Check"} C -- Fail --> F["Retry with
content error message"] E -- Pass --> G["Forward to
Next Agent"] E -- Fail --> H["Halt for
Human Review"] D -- "Max retries" --> H F -- "Max retries" --> H style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style C fill:#222221,stroke:#8a8478,color:#ede9e3 style D fill:#222221,stroke:#c47a5a,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#c47a5a,color:#ede9e3 style G fill:#222221,stroke:#6b8f71,color:#ede9e3 style H fill:#222221,stroke:#c47a5a,color:#ede9e3

Layer 1: Format Checks

Does the output match the expected format? For JSON output: is it parseable? Does it contain the required keys? Are the data types correct? For Markdown output: does it contain the expected headings? Is it within the word count range?

Format checks are fully automatable. A Python script with json.loads() validates JSON in milliseconds. A regex counts headings. A word count function checks length. No AI needed.

Agent	Format Check	Implementation
Research Agent	Valid JSON, all required keys present, sources array non-empty	Python: json.loads() + key existence check
Writing Agent	Markdown with expected H2 headings, word count 800-1200	Python: regex heading count + word count
Editing Agent	All 5 dimensions scored (0-10), verdict present (PASS/REWORK/FAIL)	Python: parse scores + check verdict field

Layer 2: Content Checks

Does the output contain the right content? For the Research Agent: do the sources actually exist? (Check URLs with a HEAD request.) Are the "key findings" actually about the requested topic? For the Writing Agent: does it reference the research brief's data points? Does it follow the outline's section order?

Content checks require more logic but are still automatable. URL validation is a simple HTTP request. Checking whether specific data points from the research brief appear in the draft is a string search.

Layer 3: Quality Checks

Does the output meet quality thresholds? For the Writing Agent: how many AI artifact markers are present? (Scan for the 15 markers from Module 1.) What percentage of sentences start with the same word? (A sign of parallel structure overuse.) Is the reading level appropriate for the target audience?

Quality checks can be partially automated with heuristics. Artifact detection uses keyword and pattern matching. Reading level uses established formulas (Flesch-Kincaid). Some quality checks require a separate AI call, which adds cost but catches issues that pattern matching misses.

Retry Strategy

When a check fails, the default response is to retry the agent with a modified prompt. The modification should include the specific error found.

Failure Type	Retry Approach	Max Retries
Invalid JSON	Append: "Your previous output was not valid JSON. Output ONLY valid JSON."	2
Missing required field	Append: "Your output was missing the [field] key. Include all required fields."	2
Word count out of range	Append: "Output was [N] words. Target is 900-1100. Adjust length."	1
High artifact count	Append: "Found [N] AI artifacts. Remove hedging phrases and tricolons."	1
Outline not followed	Regenerate from scratch with emphasis on outline compliance	1

Set a maximum retry count per agent (typically 2). If the agent fails after maximum retries, the chain halts for human intervention. Do not retry indefinitely. An agent that fails 3 times on the same input has a prompt problem, not a luck problem.

Error Logging

Every failure and retry should be logged: the agent name, the input (or a hash of it), the error type, the retry attempt, and whether the retry succeeded. Over time, this log reveals patterns. If the Research Agent fails on schema compliance 30% of the time, your schema instruction needs reinforcement. If the Writing Agent fails on word count with long outlines, your content spec needs adjustment for outline length.

Error handling is not a safety net. It is a feedback mechanism. Errors tell you where your agent instructions are weak. Fix the instructions, and the error rate drops. A pipeline with zero errors is not a healthy pipeline. It is a pipeline that is not logging properly.

Assignment

Add one automated check between each agent in your chain:

Between Research Agent and Writing Agent: define the check, what constitutes pass/fail, and the retry strategy.
Between Writing Agent and Editing Agent: define the check, pass/fail criteria, and retry strategy.
Implement at least one check as code (or have your AI coding assistant write it).

Run your chain 5 times. How many times did a check catch an error? How many retries were needed? Did any chain halt for human intervention? Document the results and update your agent system prompts based on the error patterns you observed.

Error Handling: When an Agent Produces Garbage