Abstract:To address the inability to model the long-distance dependence of features due to the limitation of local receptive fields, and the destruction of topological structure caused by the window division strategy for point cloud data in most 3D object detection, this article proposes a global voxel feature interaction-based 3D object detection method. First, a long-range context feature extraction module based on the Hilbert space-filling curves and Mamba is designed. It employs Hilbert curve ordering to serialize the voxel space while preserving spatial locality among voxels, and leverages the capability of Mamba in processing long sequences to capture point cloud context features with long-range dependencies, significantly enhancing the ability to model global contextual relationships. Secondly, an adaptive voxel diffusion module based on feature map intensity is introduced, which facilitates large-scale long-range feature interactions between voxels by dynamically generating diffused voxels to enhance the semantic representation capacity of target center voxels. Furthermore, a spatial feature recovery operator is proposed to compensate for information loss during serialization and aggregation, leveraging the local structure preservation of submanifold convolution and the global modeling capability of Mamba to further synergistically optimize both local and global feature representations. Experiments on the KITTI dataset show that the method achieves state-of-the-art performance, with 82.36%, 61.96%, and 66.05% accuracy on the car, pedestrian, and cyclist classes at moderate difficulty, while maintaining a high inference speed of 19 frames per second (FPS). The proposed method represents a superior balance between accuracy and efficiency. In addition, by comparing our method with others in real road scenes intuitively. It demonstrates that the proposed method has strong generalization ability and practical application potential.