Abstract:To address the inevitable limitations of current LiDAR-only 3D detection methods, which are affected by point cloud sparsity—where LiDAR-scanned point clouds exhibit significantly higher sparsity at long range compared to short range, leading to imbalanced positive and negative samples during model training—we propose a novel multi-modal framework named MCA-VoxelNet, based on pseudo-point-cloud fusion.It consists of two key designs: the pseudo-point clouds generated by depth completion are utilized to solve the problem of point cloud sparsity, and a large number of nearby redundant voxels are discarded through the distance-aware sampling module to enhance computational efficiency; a multi-stage cascaded attention detection structure is employed to aggregate the target features of multiple detection stages, balance the number of positive and negative samples, and gradually improve the region proposals output by the Region Proposal Network. Experiments on the authoritative KITTI autonomous driving dataset demonstrate that MCA-VoxelNet achieves an inference speed of 17.54 FPS and attains car detection accuracies of 94.19%, 85.93%, and 86.17% on the easy, moderate, and hard difficulty levels, respectively. These results outperform the second-best method by 2.64%, 1.16%, and 1.91%.