Coroutine Performance

I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s  handwritten-assembly implementation of stack switching for symmetric coroutines.

The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.

The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which yield()s tree node values back to the main routine.

One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.

Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using resume()/yield() compared to call/return was about 110 cycles per yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s