Abstract:A deep feature interaction network is proposed to address the problems of insufficient utilization of complementary information between multimodalities and the tendency of feature interaction to introduce noise in existing methods. First, a deep feature multilayer interaction module is proposed in the coding stage, which uses depth features as cues for feature interaction to fully utilize the texture information of visible light and the position information of thermal imaging. Second, a texture-position feature interaction module is designed to interact texture information with position information to achieve feature complementarity between the same layers. Then, the inflated convolutional feature fusion module is proposed in the decoding stage, which improves the model sensory field by inflating the convolutional block so that the model focuses on the multi-scale information in the network. Finally, extensive experiments are conducted on the public RGB-T datasets VT5000, VT1000 and VT821, which show that the average absolute errors of the proposed networks reach 2.2%, 1.5% and 2.5%, respectively, and achieve excellent performance compared with the stateof-the-art methods in the field.