Abstract:In response to the challenges posed by small target volumes and complex backgrounds in aerial remote sensing images, a lightweight object detection algorithm named ELS-RTDETR, based on enhancements to RT-DETR, has been proposed. This algorithm introduces and utilizes a new backbone network called LOB-Vovnet, which is an improved version based on the Vovnet network, to replace the original backbone network.Within the LOB-Vovnet architecture, a novel feature enhancement module named LRFF (Lightweight receptive field focus) has been designed to enhance the detection accuracy of small targets. To address complex background interference, an attention mechanism called SE (Squeeze-and-Excitation) based on adaptive channel extraction has been introduced.To strike a balance between model accuracy and size, LOB-Vovnet replaces some convolutions with depthwise separable convolutions. Extensive ablation experiments have been conducted to readjust the depth and width of the backbone network. In the AIFI section, a Cascaded Group Attention (CGA) mechanism has been introduced to effectively reduce computational redundancy in multi-head attention mechanisms.Regarding datasets, the RSOD dataset and NWPU VHR-10 dataset have been merged. Additionally, offline data augmentation techniques such as affine transformations and camera noise have been applied to the original data to make the training dataset more closely aligned with real-world applications.Experimental results indicate that the improved ELS-RTDETR model has shown a 2.7% increase in mAP@50 compared to the original model, with a reduction in model parameters by 32.9%. It has demonstrated good detection performance for challenging targets. Further validation of the enhanced method has been conducted on the SIMD dataset to verify its effectiveness.