Abstract:To address the limitations of the CAM++ model in feature extraction and recognition performance under complex acoustic conditions, this paper proposes TF-DCAM, a speaker verification model integrating dilated convolution and temporal-frequency multi-scale attention mechanisms. The model enhances feature representation through dilated residual convolution and a time-frequency adaptive refocusing unit to suppress redundant information. A temporal-frequency multi-scale attention module is introduced to improve sensitivity to key information via channel attention and cross-dimensional interaction. An adaptive masking temporal convolution module is further incorporated to model long-term dependencies effectively. Finally, a combination of contrastive loss functions is applied to jointly optimize the speaker embedding space. Experiments conducted on the CN-Celeb dataset show that TF-DCAM reduces EER and minDCF by 14.98% and 10.98% respectively, compared with the baseline. The model also demonstrates strong cross-lingual generalization on the VoxCeleb1 dataset. Results indicate that the proposed method significantly improves speaker verification performance and robustness while maintaining model efficiency.