-
Notifications
You must be signed in to change notification settings - Fork 95
Description
I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. Libcoro was added recently and has been underperforming.
I want to give you the opportunity to present libcoro in the best light. Perhaps you can take a look at my benchmark implementation and let me know if there is a way to optimize? The libcoro implementation is here: https://github.com/tzcnt/runtime-benchmarks/tree/main/cpp/libcoro
I'm looking to add some I/O benchmarks soon to compare some libraries like boost::cobalt and my own tmc-asio and I'm excited to benchmark against libcoro, as it's one of the only libraries to provide its own networking implementation.
Some relevant notes on the implementation / docs of libcoro:
thread_pool uses a mutex to wrap a std::deque which is shared among all threads. This seems to be the major performance killer, with 96% of the performance profiler being spent in kernel native_queued_spin_lock_slowpath. This could be progressively optimized like so:
- Use a deque + mutex per thread. Have threads push to their own local deque and consume from their own local deque first, before checking each other thread to steal work.
- Have threads push and pop (LIFO) from their own local deque, but pull (FIFO) from other threads. This maximizes the efficiency of work stealing. It also limits the number of maximum active tasks by effectively changing the traversal of the task graph from a BFS to a DFS.
- Replace the mutex locking with a lock-free MPMC queue.
The requirement to call tp.schedule() to fork introduces additional overhead, as the task must be resumed at least once, only to be immediately suspended and then posted to the executor. It would be much more efficient to introduce a fork() function which directly enqueues the task to the executor without commencing execution. For when_all(), all but one task should be forked, and the last task should be resumed by symmetric transfer.
I noticed that when_all_task_promise::final_suspend::completion_notifier::await_suspend returns void and synchronously calls coroutine.promise().m_latch->notify_awaitable_completed(); which synchronously calls m_awaiting_coroutine.resume();. This could lead to stack overflow in the case of deeper recursion. If you use symmetric transfer here instead, this can be avoided. I saw this pattern in a few other places as well... this is really missing out on the benefits of C++20 coroutines.
What is the coro::task_container type? It's mentioned in the docs but I don't see an actual implementation (just a template parameter that references a task_container_type).