Abstract:In response to the issues of background interference in fine-grained images and the challenge of identifying the most discriminative features in the target region, this paper proposes an improved CVT-based fine-grained image recognition algorithm. First, a target region localization module is introduced into the CVT model. This module extracts features of the target region using a multilevel feature aggregation method and determines the target region via threshold-based decision-making. The original image is then cropped proportionally to reduce the interference of background information. Furthermore, a mechanism called MDCSAIA (Multi-Dimensional Channel Spatial-Aware Interaction) is proposed. This mechanism employs dimensional transformation to facilitate effective interaction between spatial information of adjacent channels and channel information of adjacent spatial positions, thereby enhancing the network′s ability to perceive the local details of the target region. Experimental results show that, compared to baseline algorithms, the proposed method improves recognition accuracy by 2.1%, 1.7%, and 1.5% on the CUB-200-2011, Stanford Cars, and Stanford Dogs datasets, respectively. These results validate the effectiveness of the proposed approach.