MonoDDLE (Monocular Dense Depth Distillation for Localization Errors) is an improved version based on MonoDLE. The paper has not been published yet.
The core challenge of 3D object detection lies in recovering lost depth information from a single RGB image. Existing mainstream methods typically rely on sparse LiDAR point cloud ground truth for supervised training, which is limited by sparsity and high data acquisition costs. This project aims to use the absolute metric depth from visual foundation models (such as Depth Anything V3) as "soft labels" or "dense supervision signals." Through knowledge distillation, we guide lightweight monocular detectors to learn more robust depth features, thereby significantly improving detection accuracy without increasing inference costs.
Some visualization results on the KITTI dataset:
| 2D Bounding Box | 3D Bounding Box |
|---|---|
![]() |
![]() |
| DA3 Depth Pseudo-Label | Depth Uncertainty |
|---|---|
![]() |
![]() ![]() |
| Object Center Heatmap | LiDAR BEV Projection |
|---|---|
![]() ![]() |
![]() |
Note: The image number corresponding to the above visualization results is 001230.
We conducted extensive experiments on the KITTI dataset. The following are some core experimental results (for detailed data, please refer to summary.md in the root directory of the repository):
| Method | 3D@0.7 (Easy/Mod/Hard) | BEV@0.7 (Easy/Mod/Hard) | 3D@0.5 (Easy/Mod/Hard) | BEV@0.5 (Easy/Mod/Hard) |
|---|---|---|---|---|
| CenterNet | 0.60 / 0.66 / 0.77 | 3.46 / 3.31 / 3.21 | 20.00 / 17.50 / 15.57 | 34.36 / 27.91 / 24.65 |
| MonoGRNet | 11.90 / 7.56 / 5.76 | 19.72 / 12.81 / 10.15 | 47.59 / 32.28 / 25.50 | 48.53 / 35.94 / 28.59 |
| MonoDIS | 11.06 / 7.60 / 6.37 | 18.45 / 12.58 / 10.66 | - | - |
| M3D-RPN | 14.53 / 11.07 / 8.65 | 20.85 / 15.62 / 11.88 | 48.53 / 35.94 / 28.59 | 53.35 / 39.60 / 31.76 |
| MonoPair | 16.28 / 12.30 / 10.42 | 24.12 / 18.17 / 15.76 | 55.38 / 42.39 / 37.99 | 61.06 / 47.63 / 41.92 |
| MonoDLE (Re-impl.) | 15.17 / 12.10 / 10.82 | 21.10 / 17.20 / 15.10 | 50.70 / 38.91 / 34.82 | 56.94 / 43.74 / 38.41 |
| MonoDDLE (Ours) | 18.49 / 14.48 / 12.14 | 26.38 / 20.12 / 17.89 | 59.80 / 43.89 / 39.27 | 65.10 / 48.85 / 42.97 |
| Method | DA3 Depth | Uncertainty | 3D APR40 (E / M / H) | BEV APR40 (E / M / H) |
|---|---|---|---|---|
| Baseline | 15.17 / 12.10 / 10.82 | 21.10 / 17.20 / 15.10 | ||
| + DA3 | ✓ | 18.27 / 14.26 / 11.96 | 25.59 / 19.65 / 16.79 | |
| + Uncertainty | ✓ | ✓ | 18.49 / 14.48 / 12.14 | 26.38 / 20.12 / 17.89 |
| Model | Backbone | FLOPs (G) | Params (M) |
|---|---|---|---|
| MonoDLE | DLA-34 | 79.37 | 20.31 |
| MonoDDLE (Ours) | DLA-34 | 83.91 | 20.46 |
| HRNet-W32 | 212.25 | 48.91 | |
| ResNet-50 | 439.70 | 91.41 | |
| ConvNeXt-Tiny | 129.83 | 38.34 |
| Backbone | 3D APR40 (E / M / H) | BEV APR40 (E / M / H) |
|---|---|---|
| DLA-34 | 17.52 / 13.59 / 12.06 | 25.46 / 19.69 / 17.01 |
| HRNet-W32 | 17.87 / 13.72 / 11.73 | 24.79 / 19.23 / 16.58 |
| ConvNeXtV2-Tiny | 17.17 / 13.25 / 11.69 | 24.97 / 19.42 / 16.74 |
| ResNet-50 | 15.45 / 12.03 / 10.11 | 22.38 / 17.81 / 15.51 |
To deeply explore the bottlenecks of model performance and the coupling relationship between various prediction attributes, we conducted an Oracle analysis. By replacing specific prediction heads (such as depth, size, heading, etc.) with real Ground Truth (GT) values during the inference phase, we observed the 3D AP (R40) performance of the model in different depth intervals.
| Method | baseline | w/ gt proj. center | w/ gt depth | w/ gt location | w/ gt size_3d | w/ gt heading |
|---|---|---|---|---|---|---|
| MonoDLE | 12.09 | 12.10 | 51.92 | 58.20 | 11.97 | 10.60 |
| MonoDDLE | 14.26 | 10.09 | 21.91 | 21.79 | 5.24 | 4.91 |
| MonoDDLE+U | 14.51 | 14.08 | 48.22 | 51.95 | 13.50 | 11.56 |
- Depth estimation is the core bottleneck: When providing real depth information (
w/ gt depth), the performance of both MonoDLE and MonoDDLE+U skyrocketed (reaching 51.92 and 48.22 respectively). This indicates that as long as the depth estimation is accurate, the existing network heads already possess very strong 3D box regression capabilities. - MonoDDLE has severe attribute coupling: When providing perfect GT size (
w/ gt size_3d) or heading (w/ gt heading), the performance of MonoDDLE plummeted from 14.26 to 5.24 and 4.91 instead. This shows that MonoDDLE generated serious "error compensation" during training, where the model relies on incorrect size or heading to compensate for incorrect depth. Even when providing real depth, its performance only improved to 21.91, far lower than other models. - MonoDDLE+U successfully achieved decoupling: After introducing uncertainty guidance, MonoDDLE+U perfectly solved the above coupling problem. When providing real depth, its performance jumped significantly to 48.22 as expected; when providing real size or heading, it also completely eliminated the precipitous drop phenomenon in MonoDDLE, proving that it broke the pathological coupling between attributes and has a higher performance upper bound.
| Baseline | w/ GT Depth |
|---|---|
![]() |
![]() |
| w/ GT Location | w/ GT Proj. Center |
|---|---|
![]() |
![]() |
| w/ GT Size 3D | w/ GT Heading |
|---|---|
![]() |
![]() |
uv manages the Python environment and dependencies (the project uses the .venv in the repository by default):
cd #ROOT
uv venv .venv
source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
uv pip install -r requirements.txtPlease download the KITTI dataset first, and organize it as follows:
MonoDDLE
└── data
└── KITTI
├── ImageSets # Split files provided by the repository
├── training
│ ├── calib
│ ├── image_2
│ └── label_2
├── testing
│ ├── calib
│ └── image_2
└── DA3_depth_results # Required for DA3 distillation training
├── 000000.npz
├── 000000_vis.jpg
├── 000001.npz
├── 000001_vis.jpg
└── ...
The script to generate DA3 depth data is as follows (ensure depth_anything_3 is installed):
python tools/generate_da3_depth.py --data_path data/KITTI --split trainingThis script will generate two files for each image:
.npz: Contains three keys:depth(H, W),intrinsics(3, 3), andextrinsics(3, 4), which are the metric depth map predicted by DA3, camera intrinsics, and extrinsics matrices, respectively._vis.jpg: Visualization of the original image concatenated vertically with the color depth map (for quick quality check of depth).
Note: Only the
depthkey in.npzis used during training,intrinsicsandextrinsicsare auxiliary information.
tools/train_val.py will parse dataset.root_dir as relative to the project root directory, so you can execute the command directly in the repository root directory.
cd #ROOT
# Run MonoDDLE (with uncertainty distillation)
python tools/train_val.py --config experiments/configs/monodle/kitti_da3_uncertainty.yamlcd #ROOT
# Automatically use all visible GPUs
bash experiments/scripts/train_ddp.sh experiments/configs/monodle/kitti_da3_uncertainty.yamlRun evaluation only:
python tools/train_val.py --config experiments/configs/monodle/kitti_da3_uncertainty.yaml -eThis project uses Depth Anything V3 as the teacher model to pre-generate full-image dense metric depth maps as pseudo-labels. Through distillation loss, it provides additional supervision to the depth prediction head of the detector without changing the backbone network structure of the detector.
Before running depth distillation training, you need to generate DA3 depth pseudo-labels (see Section 2 Data Preparation):
python tools/generate_da3_depth.py --data_path data/KITTI --split trainingEach .npz file contains depth, intrinsics, and extrinsics keys, and a _vis.jpg visualization image is generated simultaneously. Only the depth key is read during training.
The total distillation loss is:
Where
dataset:
use_da3_depth: TrueBased on DA3 depth distillation, pixel-wise uncertainty prediction is further introduced to adaptively model the confidence of the model in DA3 pseudo-labels. Regions with high uncertainty (such as reflective surfaces, occlusion boundaries) will automatically reduce the distillation loss weight, thereby mitigating the negative impact of noisy pseudo-labels:
Where
dataset:
use_da3_depth: True # Must enable DA3 depth distillation
distill:
lambda: 0.5
loss_type: 'l1' # 'l1' or 'silog'
foreground_weight: 5.0
use_uncertainty: True # Enable uncertainty-guided adaptive depth distillationExcellent implementations of MonoDLE and CenterNet.
Open sourced under MIT License.
















