Abstract:In weakly supervised object localization tasks, using hard fusion to combine deep and shallow features can cause the network to overly focus on discriminative regions or mistakenly identify the background as the object. To address this issue, this paper proposes a weakly supervised object localization method based on soft fusion of deep and shallow features and positive-negative sample contrast. First, the proposed soft fusion strategy for shallow and deep features generates foreground prediction maps from both shallow and deep features by designing a foreground generator. Then, a reverse supervision operation is applied to guide the network in gradually learning multi-level fine-grained features, achieving mutual optimization between shallow and deep features. Second, based on the concept of contrastive learning, a positive and negative sample contrastive loss function is proposed. By constructing positive and negative samples, the network is guided to focus more on the foreground regions during training while suppressing background noise interference. The effectiveness of the proposed method is validated on the CUB-200-2011 and ILSVRC-2012 datasets, achieving localization accuracies of 95.77% and 72.90%, respectively. The experimental results demonstrate the effectiveness and applicability of the proposed method in weakly supervised object localization tasks.