3DEgoACT is a robust imitation learning policy designed to solve the "viewpoint failure" problem in robotic manipulation. By fusing sparse 3D point clouds with 2D egocentric visual cues, this model achieves significant robustness to camera perturbations where standard policies (like ACT) fail.
Standard imitation learning methods, such as the Action Chunking Transformer (ACT), excel at encoding human dexterity but are brittle to camera shifts.
3DEgoACT addresses this by introducing a hybrid architecture:
-
Global Geometry: Uses a PointNet encoder to process a downsampled 3D point cloud, providing a viewpoint-invariant spatial prior.
-
Local Precision: Retains an egocentric (Eye-in-Hand) RGB view encoded via ResNet for fine-grained manipulation capability.
-
Efficiency: Designed for low-cost deployment, running at ~14 actions/s on MacBook M1 Pro.
The model extends the standard ACT architecture by adding a parallel 3D encoding stream.
- Inputs:
-
Egocentric View: RGB image from the wrist camera (EEF), providing local texture and depth cues for grasping.
-
Point Cloud: A (RGBXYZ) sparse point cloud, generated from a front-facing RGB-D camera and downsampled via Farthest Point Sampling (FPS).
-
Proprioception: 7D state vector (6 joint angles + gripper state).
- Encoders:
- ResNet-18: Encodes the egocentric image.
- PointNet: Embeds the point cloud into a 512 dimensional token.
-
Fusion: The 3D geometric token is fused with 2D visual features and proprioception via a Transformer Encoder.
-
Action Head: Standard ACT Transformer Decoder predicts a chunk of future actions.
The project is built on the LeRobot codebase and simulates a 7-DoF xArm manipulator in MuJoCo.
- Front Camera (Static): Used only to generate the point cloud.
- Wrist Camera (Dynamic): Used for live visual feedback.
- Proprioception: Joint positions and gripper status.
- 4D Control Vector: (End-effector Cartesian delta + Gripper command).
Dataset was recorded by teleoperation using an XBox controller. Dataset is available at rishabhrj11/gym-xarm-pointcloud. Dataset stores
- Front, left and right allocentric views
- Egocentric EEF camera
- FPS downsampled 512x6 RGBXYZ point cloud from front allocentric view
We evaluated 3DEgoACT against a strong ACT baseline on a pick-and-place task. While the baseline collapsed when the camera was moved to "Left" or "Right" positions, 3DEgoACT maintained high success rates.
| Model | Camera View | Reach Success | Pick Success | Place Success |
Place Success |
|---|---|---|---|---|---|
| Baseline ACT | Front (Train) | 90% | 80% | 54% | 78% |
| Left (Unseen) | 8% | 0% | 0% | 0% | |
| Right (Unseen) | 6% | 0% | 0% | 0% | |
| 3DEgoACT | Front (Train) | 96% | 88% | 60% | 74% |
| Left (Unseen) | 96% | 72% | 54% | 68% | |
| Right (Unseen) | 94% | 76% | 50% | 64% |
Data sourced from Table I of the report.
Removing the Egocentric Camera resulted in 0% pick success, despite perfect reaching. This confirms that while the point cloud handles global navigation, the wrist camera is critical for the final grasp.
The model was trained for 100 epochs with the following configuration:
- Batch Size: 912
- Learning Rate: (Cosine Annealing)
- Optimizer: AdamW (Weight Decay 0.01)
- Chunk Size: 50
- Hardware: Single NVIDIA L4 GPU
