Abstract:In the field of computer vision, monocular depth estimation has garnered significant attention due to its importance in applications such as autonomous driving and scene reconstruction. However, existing self-supervised monocular depth estimation methods fail to fully exploit low-level features, resulting in poor depth estimation performance for object contours. To address this issue, this paper proposed a multi-scale feature fusion decoding method. The original RGB image is progressively downsampled using a Gaussian approach to obtain feature maps at various levels, which are then upsampled using Gaussian processes. During upsampling and downsampling, Laplacian pyramids are constructed using feature maps of the same dimensions. During decoding, the lost contour cues from downsampling are fused with the features extracted by the encoder at each scale, guiding the decoder to generate more accurate depth maps and maximizing the utilization of low-level features from the encoder. Compared with the experimental results of the baseline method Monodepth2 on the KITTI dataset, this method reduced the absolute relative error Abs Rel by 1.69%, the squared relative error Sq Rel by 6.80%, and the root mean square error RMSE by 1.00%, indicating that this method has improved the accuracy of global depth estimation, and the visual analysis also verified that the method has significantly improved in the depth estimation effect of object contours.