Abstract:In view of the challenges of dense small targets, multi-scale variations, and complex weather background interference in roadside perspective target detection tasks, a multi-scale feature fusion and interaction-based target detection algorithm, MF-YOLO, is proposed. Design C2f-CAST, interact and transform features from different subspaces through star operations, and introduce MLCA to capture local, global, channel, and spatial features between distant pixels. Multi-scale information aggregation enhances attention to significant semantic information of occluded objects and eliminates background influence; to address the problem of low efficiency in context information fusion for the neck layer, we add lightweight convolution GSConv to optimize traditional convolution, and design a cross-level partial network module VoV-GSCSP to reduce model complexity and parameter count. Construct a cross-level fusion module SDFM to perform self-calibration on shallow feature maps and fuse semantic information from deep feature maps to solve the problem of missed detection of small targets; finally, the design is based on an adaptive penalty factor, a gradient adjustment function for anchor box quality combined with a dynamic clustering mechanism to improve the WPIoU loss function, enhancing the performance of bounding box regression and detection robustness. The experimental results show that MF-YOLO achieves mAP@0.5 of 85.1% and 92.3% on DAIR-V2X-I and UA-DETRAC datasets, respectively, which is 4.4% and 1.8% higher than the original YOLOv8s, with a reduction of 19.8% in computational complexity and 8.18% in parameter count. The detection speed reaches 152 fps, meeting the real-time requirements.