Abstract:In recent years, with the rapid development of new energy vehicles, 3D object detection, as a core foundation of autonomous driving technology, has become increasingly important. Strategies that integrate multimodal information, such as radar point clouds and images, can significantly enhance the accuracy and robustness of object detection. Inspired by BEVDet, this paper proposes an improved multimodal fusion 3D object detection method based on the BEV (Bird"s Eye View) perspective. The method employs a ConvNeXt network combined with an FPN-DCN structure to efficiently extract image features and utilizes a deformable cross-attention mechanism to achieve deep fusion of image and point cloud data, thereby further enhancing the detection accuracy of the model. Experiments on the nuScenes autonomous driving dataset demonstrate the superior performance of our model, with an NDS of 64.9% on the test set, significantly outperforming most existing detection methods.