Concurrency: Running 25 API Calls Simultaneously
Session 10.3 · ~5 min read
The Speed Problem
A single API call takes 3 to 15 seconds depending on the model, the prompt length, and the output length. A three-agent chain takes 9 to 45 seconds. Run 100 chains sequentially, and you are waiting 15 to 75 minutes. That is not a production bottleneck for content quality, but it is a bottleneck for your time. You are sitting idle, watching a progress bar, when you could be reviewing outputs.
Concurrency solves this. Instead of running one chain at a time, you run multiple chains simultaneously. The total processing time drops from N * time_per_chain to roughly N / concurrency_limit * time_per_chain.
How asyncio Works
Python's asyncio library lets you run multiple tasks concurrently without threads or processes. The core concept: when one task is waiting for an API response (which takes seconds), another task can start its API call. No task blocks the others.
Three pieces make this work:
| Component | What It Does | Analogy |
|---|---|---|
async def |
Declares a function that can pause and resume | A kitchen station that can pause mid-recipe while waiting for the oven |
await |
Pauses the function until a result arrives | "Wait for the oven" without blocking other stations |
asyncio.gather |
Runs multiple async functions concurrently | Running 5 kitchen stations at once, each cooking a different dish |
The Semaphore Pattern
Running 100 API calls simultaneously will trigger rate limits and may overwhelm your system. A semaphore caps the number of concurrent operations.
(max 10 concurrent)"] B --> C["10 tasks running"] C --> D["Task completes"] D --> E["Semaphore releases slot"] E --> F["Next queued task enters"] F --> C D --> G["Result saved"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c47a5a,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#8a8478,color:#ede9e3 style E fill:#222221,stroke:#c8a882,color:#ede9e3 style F fill:#222221,stroke:#6b8f71,color:#ede9e3 style G fill:#222221,stroke:#8a8478,color:#ede9e3
The semaphore value depends on your API rate limits. If your provider allows 60 requests per minute and each chain makes 3 requests, a semaphore of 10 keeps you at approximately 30 active requests, well within limits.
Real Performance Numbers
Tested with a three-agent chain (research, write, edit) generating 1,000-word articles. Average chain time: 25 seconds.
| Batch Size | Sequential | 5 Concurrent | 10 Concurrent | 25 Concurrent |
|---|---|---|---|---|
| 10 articles | 4 min 10s | 1 min 15s | 40s | 30s |
| 25 articles | 10 min 25s | 2 min 45s | 1 min 30s | 40s |
| 50 articles | 20 min 50s | 5 min 25s | 2 min 55s | 1 min 15s |
| 100 articles | 41 min 40s | 10 min 50s | 5 min 45s | 2 min 30s |
At 25 concurrent chains, 100 articles generate in under 3 minutes. The bottleneck shifts entirely from generation to human review, which is where it should be.
Error Handling in Concurrent Batches
When one chain fails in a concurrent batch, it must not bring down the other chains. Wrap each chain in a try/except block. On failure, log the error and the failing row's ID. After the batch completes, retry only the failed rows.
async def safe_chain(topic, semaphore):
async with semaphore:
try:
return await run_chain(topic)
except Exception as e:
log_error(topic, e)
return {"id": topic.id, "status": "failed", "error": str(e)}
After batch completion, filter results for status "failed" and re-run only those rows. This gives you a complete batch with minimal wasted computation.
Connection Pooling
Each API call opens an HTTP connection. Opening 25 connections simultaneously and closing them after each call is wasteful. Connection pooling reuses connections across calls, reducing overhead. Python's aiohttp library handles this automatically with a session object.
Create one session at the start of the batch. Pass it to all chains. Close it when the batch completes. This single change can reduce per-call latency by 100 to 200 milliseconds, which compounds across hundreds of calls.
Concurrency is a throughput multiplier, not a quality multiplier. Your pipeline produces the same output whether you run one chain or 25. Concurrency just means you wait 2 minutes instead of 40. Use the saved time for the stage that actually matters: human review.
Further Reading
- Asynchronous LLM API Calls in Python, Unite.AI
- Mastering asyncio.gather for LLM Processing, Instructor
- LLM Parallel Processing in Practice, Dev.to
- Python Asyncio for LLM Concurrency, Newline
Assignment
Write (or have your AI coding assistant write) a script that demonstrates concurrency:
- Make 10 API calls sequentially. Record total time.
- Make 10 API calls concurrently with a semaphore limiting to 5 simultaneous calls. Record total time.
- Calculate the speedup factor.
Then apply this to your real pipeline: run your batch manifest from Session 10.2 with concurrent processing. Compare the total batch time to sequential processing. Document any issues: rate limit errors, connection timeouts, or quality differences between sequential and concurrent outputs.