Skip to content

Conversation

@k1832
Copy link

@k1832 k1832 commented Jan 6, 2026

PR Type

  • Improvement

Related Links

N/A

Description

Optimize Hesai decoder hot path (convert_returns) to reduce point cloud processing latency by ~6%.

Changes

  1. Eliminate per-channel heap allocation

    • Pre-allocate return_units_buf_ with reserve() in constructor
    • Reuse buffer via resize() (O(1) when capacity is sufficient)
  2. Cache packet-level values

    • dis_unit (distance unit multiplier)
    • return_mode (avoid repeated static_cast)
    • dual_return_distance_threshold, min_range, max_range
    • is_inside_overlap result
  3. Pre-compute distances

    • Store in distances_buf_[] to avoid redundant calculations in inner loop
  4. Add branch prediction hints

    • Use __builtin_expect(..., 0) for unlikely conditions (zero distance, out-of-range)

Performance Measurement

Measured via bpftrace on unpack function with PandarXT32 rosbag playback:

  • ~6% reduction in function execution time

Benchmark methodology

Terminal 1 - Launch driver:

ros2 launch nebula_ros nebula_launch.py sensor_model:=PandarXT32 launch_hw:=false

Terminal 2 - Play rosbag (10 iterations):

for i in {1..10}; do \
  ros2 bag play /path/to/nebula/nebula_tests/data/hesai/xt32/1673400677802009732 \
  sleep 1 \
done

Terminal 3 - bpftrace measurement:

LIB="/path/to/install/nebula_decoders/lib/libnebula_decoders_hesai.so"
sudo bpftrace -e "
uprobe:$LIB:*unpack* {
    @start[tid] = nsecs;
}
uretprobe:$LIB:*unpack* /@start[tid]/ {
    \$ns = nsecs - @start[tid];
    @stats_ns[func] = stats(\$ns);
    @dist_ns[func] = hist(\$ns);
    delete(@start[tid]);
}
"

Review Procedure

  1. Review changes in hesai_decoder.hpp (main optimization)
  2. Review minor changes in hesai_sensor.hpp (caching in helper functions)
  3. Verify all unit tests pass

Remarks

  • This optimization only affects real hardware or rosbag replay with raw packets (/pandar_packets)
  • Does not affect planning_simulator or logging_simulator (they use pre-decoded point clouds)

Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Copy link
Collaborator

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the PR!
I realize that it's still in draft phase, but I've left some comments anyway just to align with you.

I think a 6% performance improvement is sizeable and we are happy to take it!

However, there is quite a lot of boilerplate code that comes from it, so I'd be interested in finding out where the bulk of the 6% comes from. I expect that things like __builtin_expect and some of the caching can likely be excluded while keeping almost all of the perforamnce gains.

Let me know what you think!

Comment on lines +93 to +98
// ============ PERFORMANCE OPTIMIZATION: Pre-allocated buffers ============
/// @brief Pre-allocated return unit pointers to avoid heap allocation per channel
std::vector<const typename SensorT::packet_t::body_t::block_t::unit_t *> return_units_buf_;
/// @brief Pre-computed distances for current return group
std::array<float, SensorT::packet_t::max_returns> distances_buf_;
// =========================================================================
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid adding even more member variables as the decoder is already getting quite large.

As std::array is stack-allocated, it can be moved into convert_returns with no performance overhead.

The std::vector could be replaced by a boost::static_vector and moved to convert_returns as well.

#include <cstdint>
#include <memory>
#include <optional>
#include <span>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spans are a C++20 feature while the project is officially still C++17.
I would love to move to C++20 or 23 but we need to ensure that ROS2 buildfarm can handle that. We are planning buildfarm release soon.


// Pre-fetch block pointers for this return group
const auto * const blocks = &packet_.body.blocks[start_block_id];

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_timestamp_ns, which uses timegm internally, is indeed slow, so caching that is a good idea. The rest of these should be instant anyway and are not needed imho.


// ====== OPTIMIZATION: Combined range check with likely/unlikely hints ======
if (__builtin_expect(
distance < SensorT::min_range || distance > SensorT::max_range ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does __builtin_expect measurably improve performance here? If there is not a very clear, strong performance gain, I would like to keep compiler extrinsics out of Nebula as they make the code less readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants