Parallel Processing: Multiple Streams Simultaneously
Session 9.4 · ~5 min read
From Sequential to Parallel
Your three-agent chain processes one piece of content at a time. Agent 1 finishes, Agent 2 starts, Agent 2 finishes, Agent 3 starts. For a single piece, this is fine. For ten pieces, it means waiting for all three agents to complete on piece 1 before starting piece 2. That is sequential processing, and it does not scale.
Parallel processing runs multiple chains simultaneously. While the Research Agent works on topic 5, the Writing Agent works on topic 3, and the Editing Agent works on topic 1. All ten pieces move through the pipeline concurrently, bounded only by API rate limits and your budget.
The Throughput Difference
Assume each agent call takes 10 seconds. A three-agent chain takes 30 seconds per piece. Sequentially, 10 pieces take 300 seconds (5 minutes). With full parallelism, all 10 pieces run their Research Agent simultaneously, then all 10 Writing Agents, then all 10 Editing Agents: 90 seconds total.
In practice, you cannot run unlimited parallel calls. APIs have rate limits. Your budget has limits. The solution is controlled concurrency: run N chains simultaneously, where N is capped by a semaphore.
(limit: 5 concurrent)"] B --> C1["Chain 1: R→W→E"] B --> C2["Chain 2: R→W→E"] B --> C3["Chain 3: R→W→E"] B --> C4["Chain 4: R→W→E"] B --> C5["Chain 5: R→W→E"] C1 --> D["Complete → Release slot"] C2 --> D C3 --> D C4 --> D C5 --> D D --> E["Next topic enters"] E --> B style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c47a5a,color:#ede9e3 style C1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style C2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style C3 fill:#222221,stroke:#6b8f71,color:#ede9e3 style C4 fill:#222221,stroke:#6b8f71,color:#ede9e3 style C5 fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#8a8478,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3
| Processing Mode | 10 Pieces (30s/chain) | 50 Pieces | 100 Pieces |
|---|---|---|---|
| Sequential | 300s (5 min) | 1,500s (25 min) | 3,000s (50 min) |
| Parallel (5 concurrent) | 90s (1.5 min) | 330s (5.5 min) | 630s (10.5 min) |
| Parallel (10 concurrent) | 60s (1 min) | 180s (3 min) | 330s (5.5 min) |
| Parallel (25 concurrent) | 30s | 90s (1.5 min) | 150s (2.5 min) |
Implementing Concurrency
Python's asyncio library handles concurrent API calls. The pattern is straightforward: define an async function for your agent chain, use asyncio.Semaphore to cap concurrency, and run all chains with asyncio.gather.
The conceptual structure looks like this:
import asyncio
semaphore = asyncio.Semaphore(5) # max 5 concurrent chains
async def run_chain(topic):
async with semaphore:
research = await call_research_agent(topic)
draft = await call_writing_agent(research)
review = await call_editing_agent(draft)
return review
async def main():
topics = ["topic1", "topic2", ..., "topic10"]
results = await asyncio.gather(*[run_chain(t) for t in topics])
asyncio.run(main())
Your AI coding assistant can fill in the actual API calls, error handling, and file saving. The structural pattern above is what you need to understand. Everything else is implementation detail.
Rate Limits and Throttling
Every API provider imposes rate limits: maximum requests per minute, maximum tokens per minute, or both. Exceeding these limits returns 429 (Too Many Requests) errors. Your script needs to handle these gracefully.
Two strategies:
- Proactive throttling: Set your semaphore count low enough that you never hit the rate limit. If the API allows 60 requests/minute and each chain makes 3 requests, set the semaphore to 15-18 to stay safely under the limit.
- Reactive retry: When a 429 error occurs, wait for the duration specified in the response headers (usually
Retry-After), then retry. Combine this with exponential backoff: wait 1 second, then 2, then 4, up to a maximum.
The proactive approach is smoother. The reactive approach handles edge cases. Use both.
Failure Isolation
When running 25 chains in parallel, some will fail. An API timeout on chain 7 should not crash chains 1 through 6 and 8 through 25. Each chain runs independently. Failures are logged, and the failed chain is retried after all successful chains complete.
This is called failure isolation, and it is the difference between a script that produces 24 out of 25 pieces and a script that produces 0 because one failure brought down the entire batch.
Parallel processing does not change what your pipeline does. It changes how many pieces move through the pipeline simultaneously. The quality gates, the handoff contracts, the human review, all remain identical. Scale is a throughput change, not a quality change.
Further Reading
- Asynchronous LLM API Calls in Python: A Comprehensive Guide, Unite.AI
- Mastering asyncio.gather for LLM Processing, Instructor
- Python Asyncio for LLM Concurrency: Best Practices, Newline
Assignment
Take your working three-agent chain and run it on 3 different topics simultaneously:
- If you are comfortable with code: write (or have your AI assistant write) an async script with a semaphore set to 2. Run all 3 chains concurrently.
- If not: open 3 terminal windows and run each chain manually in parallel.
- Time the entire operation. Compare to running all 3 sequentially.
Document: total time (parallel vs. sequential), any failures encountered, and the quality of each output. Did parallel execution affect output quality? (It should not.) Calculate the time savings as a percentage.