-
Notifications
You must be signed in to change notification settings - Fork 80
perf(hesai): optimize convert_returns hot path #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
mojomex
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for the PR!
I realize that it's still in draft phase, but I've left some comments anyway just to align with you.
I think a 6% performance improvement is sizeable and we are happy to take it!
However, there is quite a lot of boilerplate code that comes from it, so I'd be interested in finding out where the bulk of the 6% comes from. I expect that things like __builtin_expect and some of the caching can likely be excluded while keeping almost all of the perforamnce gains.
Let me know what you think!
| // ============ PERFORMANCE OPTIMIZATION: Pre-allocated buffers ============ | ||
| /// @brief Pre-allocated return unit pointers to avoid heap allocation per channel | ||
| std::vector<const typename SensorT::packet_t::body_t::block_t::unit_t *> return_units_buf_; | ||
| /// @brief Pre-computed distances for current return group | ||
| std::array<float, SensorT::packet_t::max_returns> distances_buf_; | ||
| // ========================================================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid adding even more member variables as the decoder is already getting quite large.
As std::array is stack-allocated, it can be moved into convert_returns with no performance overhead.
The std::vector could be replaced by a boost::static_vector and moved to convert_returns as well.
| #include <cstdint> | ||
| #include <memory> | ||
| #include <optional> | ||
| #include <span> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spans are a C++20 feature while the project is officially still C++17.
I would love to move to C++20 or 23 but we need to ensure that ROS2 buildfarm can handle that. We are planning buildfarm release soon.
|
|
||
| // Pre-fetch block pointers for this return group | ||
| const auto * const blocks = &packet_.body.blocks[start_block_id]; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_timestamp_ns, which uses timegm internally, is indeed slow, so caching that is a good idea. The rest of these should be instant anyway and are not needed imho.
|
|
||
| // ====== OPTIMIZATION: Combined range check with likely/unlikely hints ====== | ||
| if (__builtin_expect( | ||
| distance < SensorT::min_range || distance > SensorT::max_range || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does __builtin_expect measurably improve performance here? If there is not a very clear, strong performance gain, I would like to keep compiler extrinsics out of Nebula as they make the code less readable.
PR Type
Related Links
N/A
Description
Optimize Hesai decoder hot path (
convert_returns) to reduce point cloud processing latency by ~6%.Changes
Eliminate per-channel heap allocation
return_units_buf_withreserve()in constructorresize()(O(1) when capacity is sufficient)Cache packet-level values
dis_unit(distance unit multiplier)return_mode(avoid repeatedstatic_cast)dual_return_distance_threshold,min_range,max_rangeis_inside_overlapresultPre-compute distances
distances_buf_[]to avoid redundant calculations in inner loopAdd branch prediction hints
__builtin_expect(..., 0)for unlikely conditions (zero distance, out-of-range)Performance Measurement
Measured via bpftrace on
unpackfunction with PandarXT32 rosbag playback:Benchmark methodology
Terminal 1 - Launch driver:
Terminal 2 - Play rosbag (10 iterations):
Terminal 3 - bpftrace measurement:
Review Procedure
Remarks