Abstract:In the key component detection task of UAV inspection of aerial images for transmission lines, a multimodal multi-scale target detection approach is proposed to address the challenges of accuracy degradation and high miss rates for small targets in single-modal detection methods. This approach integrates visible light and infrared images. First, the network constructs a parallel two-stream feature extraction backbone designed to simultaneously process visible light and infrared images. This design fully utilizes the rich color and texture detail information from the visible light images, along with the superior imaging stability and high contrast characteristics of the infrared images. Next, to facilitate cross-modal information interaction and complementarity, a Multimodal Feature Fusion Interactive Module (MFIFM) is developed. This module dynamically adjusts the fusion weights of features from different modalities, adaptively integrating the most discriminative information and effectively mitigating conflicts arising from modality differences. Additionally, to enhance the perception of small target components, a Hybrid Residual Multi-Scale Transformer (HRMS Transformer) module is incorporated into the dual-stream backbone. By utilizing a multi-head attention mechanism, hierarchical feature reorganization, and a residual-based strategy, the model′s ability to extract global context information is significantly strengthened. Experimental results demonstrate that the model′s mean Average Precision (mAP) at IoU thresholds of 0.50 and 0.50:0.95 improves by 5.35% and 4.48%, respectively, compared to existing single-modal methods. These findings confirm the effectiveness and applicability of multimodal fusion technology in transmission line inspection.