Empirical Evidence: Why Single-Threaded Coroutines Work

The claim that single-threaded cooperative concurrency is effective for IO-bound workloads is supported by measurements, academic research, and operational experience with large-scale systems.

1. Switching Cost: Coroutine vs OS Thread

The main advantage of coroutines is that cooperative switching occurs in user space, without invoking the OS kernel.

Measurements on Linux

Metric	OS Thread (Linux NPTL)	Coroutine / async task
Context switch	1.2–1.5 µs (pinned), ~2.2 µs (unpinned)	~170 ns (Go), ~200 ns (Rust async)
Task creation	~17 µs	~0.3 µs
Memory per task	~9.5 KiB (min), 8 MiB (default stack)	~0.4 KiB (Rust), 2–4 KiB (Go)
Scalability	~80,000 threads (test)	250,000+ async tasks (test)

Sources:

Eli Bendersky, Measuring context switching and memory overheads for Linux threads (2018) — direct measurements of Linux thread switching costs and comparison with goroutines
Jim Blandy, context-switch (Rust benchmark) — async task switches in ~0.2 µs vs ~1.7 µs for a thread (8.5x faster), created in 0.3 µs vs 17 µs (56x faster), uses 0.4 KiB vs 9.5 KiB (24x less)

What This Means in Practice

Switching a coroutine costs ~200 nanoseconds — an order of magnitude cheaper than switching an OS thread (~1.5 µs). But even more importantly, coroutine switching does not incur indirect costs: TLB cache flush, branch predictor invalidation, migration between cores — all of these are characteristic of threads, but not of coroutines within a single thread.

For an event loop handling 80 coroutines per core, the total switching overhead is:

80 × 200 ns = 16 µs for a full cycle through all coroutines

This is negligible compared to 80 ms of I/O wait time.

2. Memory: Scale of Differences

OS threads allocate a fixed-size stack (8 MiB by default on Linux). Coroutines store only their state — local variables and the resumption point.

Implementation	Memory per unit of concurrency
Linux thread (default stack)	8 MiB virtual, ~10 KiB RSS minimum
Go goroutine	2–4 KiB (dynamic stack, grows as needed)
Kotlin coroutine	tens of bytes on heap; thread:coroutine ratio ≈ 6:1
Rust async task	~0.4 KiB
C++ coroutine frame (Pigweed)	88–408 bytes
Python asyncio coroutine	~2 KiB (vs ~5 KiB + 32 KiB stack for a thread)

Sources:

Kotlin Coroutines vs Threads Memory Benchmark (TechYourChance) — 6:1 memory ratio
Super Fast Python: Coroutines Use Less Memory Than Threads — comparison in Python
Go FAQ: goroutines — dynamic goroutine stack

Implications for Web Servers

For 640 concurrent tasks (8 cores × 80 coroutines):

OS threads: 640 × 8 MiB = 5 GiB of virtual memory (actually less due to lazy allocation, but the pressure on the OS scheduler is significant)
Coroutines: 640 × 4 KiB = 2.5 MiB (a difference of three orders of magnitude)

3. The C10K Problem and Real Servers

The Problem

In 1999, Dan Kegel formulated the C10K problem: servers using the "one thread per connection" model were unable to serve 10,000 simultaneous connections. The cause was not hardware limitations, but OS thread overhead.

The Solution

The problem was solved by transitioning to an event-driven architecture: instead of creating a thread for each connection, a single event loop serves thousands of connections in one thread.

This is exactly the approach implemented by nginx, Node.js, libuv, and — in the PHP context — True Async.

Benchmarks: nginx (event-driven) vs Apache (thread-per-request)

Metric (1000 concurrent connections)	nginx	Apache
Requests per second (static)	2,500–3,000	800–1,200
HTTP/2 throughput	>6,000 req/s	~826 req/s
Stability under load	Stable	Degradation at >150 connections

nginx serves 2–4x more requests than Apache, while consuming significantly less memory. Apache with thread-per-request architecture accepts no more than 150 simultaneous connections (by default), after which new clients wait in a queue.

Sources:

Dan Kegel, The C10K problem (1999) — problem statement
Nginx vs Apache: Web Server Performance Comparison (2025) — benchmarks
Cloudflare: How we scaled nginx — industry experience

4. Academic Research

SEDA: Staged Event-Driven Architecture (Welsh et al., 2001)

Matt Welsh, David Culler, and Eric Brewer from UC Berkeley proposed SEDA — a server architecture based on events and queues between processing stages.

Key result: The SEDA server in Java outperformed Apache (C, thread-per-connection) in throughput at 10,000+ simultaneous connections. Apache could not accept more than 150 simultaneous connections.

Welsh M., Culler D., Brewer E. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. SOSP '01 (2001). PDF

Comparison of Web Server Architectures (Pariag et al., 2007)

The most thorough comparison of architectures was conducted by Pariag et al. from the University of Waterloo. They compared three servers on the same codebase:

µserver — event-driven (SYMPED, single process)
Knot — thread-per-connection (Capriccio library)
WatPipe — hybrid (pipeline, similar to SEDA)

Key result: The event-driven µserver and pipeline-based WatPipe delivered ~18% higher throughput than the thread-based Knot. WatPipe required 25 writer threads to achieve the same performance as µserver with 10 processes.

Pariag D. et al. Comparing the Performance of Web Server Architectures. EuroSys '07 (2007). PDF

AEStream: Accelerating Event Processing with Coroutines (2022)

A study published on arXiv conducted a direct comparison of coroutines and threads for stream data processing (event-based processing).

Key result: Coroutines delivered at least 2x throughput compared to conventional threads for event stream processing.

Pedersen J.E. et al. AEStream: Accelerated Event-Based Processing with Coroutines. (2022). arXiv:2212.10719

5. Scalability: 100,000 Tasks

Kotlin: 100,000 Coroutines in 100 ms

In the TechYourChance benchmark, creating and launching 100,000 coroutines took ~100 ms of overhead. An equivalent number of threads would require ~1.7 seconds just for creation (100,000 × 17 µs) and ~950 MiB of memory for stacks.

Rust: 250,000 Async Tasks

In the context-switch benchmark, 250,000 async tasks were launched in a single process, while OS threads reached their limit at ~80,000.

Go: Millions of Goroutines

Go routinely launches hundreds of thousands and millions of goroutines in production systems. This is what enables servers like Caddy, Traefik, and CockroachDB to handle tens of thousands of simultaneous connections.

6. Evidence Summary

Claim	Confirmation
Coroutine switching is cheaper than threads	~200 ns vs ~1500 ns — 7–8x (Bendersky 2018, Blandy)
Coroutines consume less memory	0.4–4 KiB vs 9.5 KiB–8 MiB — 24x+ (Blandy, Go FAQ)
Event-driven server scales better	nginx 2–4x throughput vs Apache (benchmarks)
Event-driven > thread-per-connection (academically)	+18% throughput (Pariag 2007), C10K solved (Kegel 1999)
Coroutines > threads for event processing	2x throughput (AEStream 2022)
Hundreds of thousands of coroutines in one process	250K async tasks (Rust), 100K coroutines in 100ms (Kotlin)
Formula N ≈ 1 + T_io/T_cpu is correct	Goetz 2006, Zalando, Little's Law

References

Measurements and Benchmarks

Academic Papers

Welsh M. et al. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. SOSP '01. PDF
Pariag D. et al. Comparing the Performance of Web Server Architectures. EuroSys '07. PDF
Pedersen J.E. et al. AEStream: Accelerated Event-Based Processing with Coroutines. arXiv:2212.10719

Why Coroutines Work

Empirical Evidence: Why Single-Threaded Coroutines Work

1. Switching Cost: Coroutine vs OS Thread

Measurements on Linux

What This Means in Practice

2. Memory: Scale of Differences

Implications for Web Servers

3. The C10K Problem and Real Servers

The Problem

The Solution

Benchmarks: nginx (event-driven) vs Apache (thread-per-request)

4. Academic Research

SEDA: Staged Event-Driven Architecture (Welsh et al., 2001)

Comparison of Web Server Architectures (Pariag et al., 2007)

AEStream: Accelerating Event Processing with Coroutines (2022)

5. Scalability: 100,000 Tasks

Kotlin: 100,000 Coroutines in 100 ms

Rust: 250,000 Async Tasks

Go: Millions of Goroutines

6. Evidence Summary

References

Measurements and Benchmarks

Academic Papers

Industry Experience

See Also

Why Coroutines Work

Empirical Evidence: Why Single-Threaded Coroutines Work ​

1. Switching Cost: Coroutine vs OS Thread ​

Measurements on Linux ​

What This Means in Practice ​

2. Memory: Scale of Differences ​

Implications for Web Servers ​

3. The C10K Problem and Real Servers ​

The Problem ​

The Solution ​

Benchmarks: nginx (event-driven) vs Apache (thread-per-request) ​

4. Academic Research ​

SEDA: Staged Event-Driven Architecture (Welsh et al., 2001) ​

Comparison of Web Server Architectures (Pariag et al., 2007) ​

AEStream: Accelerating Event Processing with Coroutines (2022) ​

5. Scalability: 100,000 Tasks ​

Kotlin: 100,000 Coroutines in 100 ms ​

Rust: 250,000 Async Tasks ​

Go: Millions of Goroutines ​

6. Evidence Summary ​

References ​

Measurements and Benchmarks ​

Academic Papers ​

Industry Experience ​

See Also ​

Empirical Evidence: Why Single-Threaded Coroutines Work

1. Switching Cost: Coroutine vs OS Thread

Measurements on Linux

What This Means in Practice

2. Memory: Scale of Differences

Implications for Web Servers

3. The C10K Problem and Real Servers

The Problem

The Solution

Benchmarks: nginx (event-driven) vs Apache (thread-per-request)

4. Academic Research

SEDA: Staged Event-Driven Architecture (Welsh et al., 2001)

Comparison of Web Server Architectures (Pariag et al., 2007)

AEStream: Accelerating Event Processing with Coroutines (2022)

5. Scalability: 100,000 Tasks

Kotlin: 100,000 Coroutines in 100 ms

Rust: 250,000 Async Tasks

Go: Millions of Goroutines

6. Evidence Summary

References

Measurements and Benchmarks

Academic Papers

Industry Experience

See Also