Empirical Evidence: Why Single-Threaded Coroutines Work

The claim that single-threaded cooperative concurrency is effective for IO-bound workloads is supported by measurements, academic research, and operational experience with large-scale systems.


1. Switching Cost: Coroutine vs OS Thread

The main advantage of coroutines is that cooperative switching occurs in user space, without invoking the OS kernel.

Measurements on Linux

MetricOS Thread (Linux NPTL)Coroutine / async task
Context switch1.2–1.5 µs (pinned), ~2.2 µs (unpinned)~170 ns (Go), ~200 ns (Rust async)
Task creation~17 µs~0.3 µs
Memory per task~9.5 KiB (min), 8 MiB (default stack)~0.4 KiB (Rust), 2–4 KiB (Go)
Scalability~80,000 threads (test)250,000+ async tasks (test)

Sources:

What This Means in Practice

Switching a coroutine costs ~200 nanoseconds — an order of magnitude cheaper than switching an OS thread (~1.5 µs). But even more importantly, coroutine switching does not incur indirect costs: TLB cache flush, branch predictor invalidation, migration between cores — all of these are characteristic of threads, but not of coroutines within a single thread.

For an event loop handling 80 coroutines per core, the total switching overhead is:

80 × 200 ns = 16 µs for a full cycle through all coroutines

This is negligible compared to 80 ms of I/O wait time.


2. Memory: Scale of Differences

OS threads allocate a fixed-size stack (8 MiB by default on Linux). Coroutines store only their state — local variables and the resumption point.

ImplementationMemory per unit of concurrency
Linux thread (default stack)8 MiB virtual, ~10 KiB RSS minimum
Go goroutine2–4 KiB (dynamic stack, grows as needed)
Kotlin coroutinetens of bytes on heap; thread:coroutine ratio ≈ 6:1
Rust async task~0.4 KiB
C++ coroutine frame (Pigweed)88–408 bytes
Python asyncio coroutine~2 KiB (vs ~5 KiB + 32 KiB stack for a thread)

Sources:

Implications for Web Servers

For 640 concurrent tasks (8 cores × 80 coroutines):

  • OS threads: 640 × 8 MiB = 5 GiB of virtual memory (actually less due to lazy allocation, but the pressure on the OS scheduler is significant)
  • Coroutines: 640 × 4 KiB = 2.5 MiB (a difference of three orders of magnitude)

3. The C10K Problem and Real Servers

The Problem

In 1999, Dan Kegel formulated the C10K problem: servers using the "one thread per connection" model were unable to serve 10,000 simultaneous connections. The cause was not hardware limitations, but OS thread overhead.

The Solution

The problem was solved by transitioning to an event-driven architecture: instead of creating a thread for each connection, a single event loop serves thousands of connections in one thread.

This is exactly the approach implemented by nginx, Node.js, libuv, and — in the PHP context — True Async.

Benchmarks: nginx (event-driven) vs Apache (thread-per-request)

Metric (1000 concurrent connections)nginxApache
Requests per second (static)2,500–3,000800–1,200
HTTP/2 throughput>6,000 req/s~826 req/s
Stability under loadStableDegradation at >150 connections

nginx serves 2–4x more requests than Apache, while consuming significantly less memory. Apache with thread-per-request architecture accepts no more than 150 simultaneous connections (by default), after which new clients wait in a queue.

Sources:


4. Academic Research

SEDA: Staged Event-Driven Architecture (Welsh et al., 2001)

Matt Welsh, David Culler, and Eric Brewer from UC Berkeley proposed SEDA — a server architecture based on events and queues between processing stages.

Key result: The SEDA server in Java outperformed Apache (C, thread-per-connection) in throughput at 10,000+ simultaneous connections. Apache could not accept more than 150 simultaneous connections.

Welsh M., Culler D., Brewer E. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. SOSP '01 (2001). PDF

Comparison of Web Server Architectures (Pariag et al., 2007)

The most thorough comparison of architectures was conducted by Pariag et al. from the University of Waterloo. They compared three servers on the same codebase:

  • µserver — event-driven (SYMPED, single process)
  • Knot — thread-per-connection (Capriccio library)
  • WatPipe — hybrid (pipeline, similar to SEDA)

Key result: The event-driven µserver and pipeline-based WatPipe delivered ~18% higher throughput than the thread-based Knot. WatPipe required 25 writer threads to achieve the same performance as µserver with 10 processes.

Pariag D. et al. Comparing the Performance of Web Server Architectures. EuroSys '07 (2007). PDF

AEStream: Accelerating Event Processing with Coroutines (2022)

A study published on arXiv conducted a direct comparison of coroutines and threads for stream data processing (event-based processing).

Key result: Coroutines delivered at least 2x throughput compared to conventional threads for event stream processing.

Pedersen J.E. et al. AEStream: Accelerated Event-Based Processing with Coroutines. (2022). arXiv:2212.10719


5. Scalability: 100,000 Tasks

Kotlin: 100,000 Coroutines in 100 ms

In the TechYourChance benchmark, creating and launching 100,000 coroutines took ~100 ms of overhead. An equivalent number of threads would require ~1.7 seconds just for creation (100,000 × 17 µs) and ~950 MiB of memory for stacks.

Rust: 250,000 Async Tasks

In the context-switch benchmark, 250,000 async tasks were launched in a single process, while OS threads reached their limit at ~80,000.

Go: Millions of Goroutines

Go routinely launches hundreds of thousands and millions of goroutines in production systems. This is what enables servers like Caddy, Traefik, and CockroachDB to handle tens of thousands of simultaneous connections.


6. Evidence Summary

ClaimConfirmation
Coroutine switching is cheaper than threads~200 ns vs ~1500 ns — 7–8x (Bendersky 2018, Blandy)
Coroutines consume less memory0.4–4 KiB vs 9.5 KiB–8 MiB — 24x+ (Blandy, Go FAQ)
Event-driven server scales betternginx 2–4x throughput vs Apache (benchmarks)
Event-driven > thread-per-connection (academically)+18% throughput (Pariag 2007), C10K solved (Kegel 1999)
Coroutines > threads for event processing2x throughput (AEStream 2022)
Hundreds of thousands of coroutines in one process250K async tasks (Rust), 100K coroutines in 100ms (Kotlin)
Formula N ≈ 1 + T_io/T_cpu is correctGoetz 2006, Zalando, Little's Law

References

Measurements and Benchmarks

Academic Papers

  • Welsh M. et al. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. SOSP '01. PDF
  • Pariag D. et al. Comparing the Performance of Web Server Architectures. EuroSys '07. PDF
  • Pedersen J.E. et al. AEStream: Accelerated Event-Based Processing with Coroutines. arXiv:2212.10719

Industry Experience

See Also