Abstract:At present, although the stereo matching network based on deep learning has high accuracy, the complex model structure in the network leads to a sharp increase in computing time.In order to balance the matching speed and accuracy of the network, this paper proposes a stereo matching network based on fusing contextual information selectively. First, the cost volume is constructed through the correlation layer method, and then the single encoder decoder structure is used in the aggregation module to reduce the complexity of the model. Secondly, multi-scale cost bodies are fused in the encoder to capture different levels of parallax information; a selective context information fusion module is designed in the decoder, which uses the context features of the reference image to guide the generation of high-quality geometric information. Thirdly, multi-scale cost volume is fused in the encoder to capture different levels of parallax information; at the same time, fusing contextual information selectively module is designed in the decoder, which uses the context features of the reference image to guide the high-quality decoding of geometric information. Finally, the multi branch aggregation pyramid pooling module is designed to enhance the ability of the encoding-decoding module to understand the global context. The experimental results show that the mismatch rate of all regions on the KITTI2015 dataset is 1.97%, and the three pixel error on the KITTI2012 dataset is 1.50%. Compared with other algorithms, our algorithm achieves more accurate stereo matching accuracy while meeting the real-time requirements.