I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s handwritten-assembly implementation of stack switching for symmetric coroutines.
The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.
The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which
yield()s tree node values back to the main routine.
One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.
Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using
yield() compared to
return was about 110 cycles per
yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!