Performance in recursive fork/join is very poor

I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. Libcoro was added recently and has been underperforming.

I want to give you the opportunity to present libcoro in the best light. Perhaps you can take a look at my benchmark implementation and let me know if there is a way to optimize? The libcoro implementation is here: https://github.com/tzcnt/runtime-benchmarks/tree/main/cpp/libcoro

I'm looking to add some I/O benchmarks soon to compare some libraries like `boost::cobalt` and my own `tmc-asio` and I'm excited to benchmark against libcoro, as it's one of the only libraries to provide its own networking implementation.

---

Some relevant notes on the implementation / docs of libcoro:

`thread_pool` uses a mutex to wrap a std::deque which is shared among all threads. This seems to be the major performance killer, with 96% of the performance profiler being spent in kernel `native_queued_spin_lock_slowpath`. This could be progressively optimized like so:
- Use a deque + mutex per thread. Have threads push to their own local deque and consume from their own local deque first, before checking each other thread to steal work.
- Have threads push and pop (LIFO) from their own local deque, but pull (FIFO) from other threads. This maximizes the efficiency of work stealing. It also limits the number of maximum active tasks by effectively changing the traversal of the task graph from a BFS to a DFS.
- Replace the mutex locking with a lock-free MPMC queue.

The requirement to call `tp.schedule()` to fork introduces additional overhead, as the task must be resumed at least once, only to be immediately suspended and then posted to the executor. It would be much more efficient to introduce a fork() function which directly enqueues the task to the executor without commencing execution. For when_all(), all but one task should be forked, and the last task should be resumed by symmetric transfer.

I noticed that `when_all_task_promise::final_suspend::completion_notifier::await_suspend` returns void and synchronously calls `coroutine.promise().m_latch->notify_awaitable_completed();` which synchronously calls `m_awaiting_coroutine.resume();`. This could lead to stack overflow in the case of deeper recursion. If you use symmetric transfer here instead, this can be avoided. I saw this pattern in a few other places as well... this is really missing out on the benefits of C++20 coroutines.

What is the `coro::task_container` type? It's mentioned in the docs but I don't see an actual implementation (just a template parameter that references a task_container_type).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance in recursive fork/join is very poor #404

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance in recursive fork/join is very poor #404

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions