融合空洞卷积与多尺度注意力机制的说话人确认方法
DOI:
CSTR:
作者:
作者单位:

1.桂林电子科技大学信息与通信学院 桂林 541004; 2.桂林电子科技大学认知无线电与信息处理教育部重点实验室 桂林 541004

作者简介:

通讯作者:

中图分类号:

TN912.34

基金项目:

认知无线电与信息处理教育部重点实验室项目(CRKL230103)资助


Speaker verification method based on dilated convolution and multi-scale attention mechanism
Author:
Affiliation:

1.School of Information and Communication, Guilin University of Electronic Technology,Guilin 541004, China; 2.Key Laboratory of Cognitive Radio and Information Processing, Ministry of Education, Guilin University of Electronic Technology,Guilin 541004, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对复杂语音环境下CAM++模型在特征提取与识别性能方面存在的不足,本文提出了一种融合空洞卷积与时频多尺度注意力机制的说话人确认模型TF-DCAM。该模型首先利用空洞残差卷积与时频重聚焦机制增强特征提取能力,提升对冗余信息的抑制效果;其次引入时频多尺度注意力模块,通过通道注意力与跨纬度交互机制提升模型对关键信息的感知能力;再通过自适应掩码时序卷积模块强化长时依赖建模;最后采用对比损失函数联合优化嵌入空间结构。实验在CN-Celeb数据集上表明,TF-DCAM在EER和minDCF上分别相较基线模型降低了14.98%和10.98%;在VoxCeleb1上亦展现出良好的跨语种泛化能力。结果证明所提方法在保证轻量化的同时显著提升了说话人确认性能与鲁棒性。

    Abstract:

    To address the limitations of the CAM++ model in feature extraction and recognition performance under complex acoustic conditions, this paper proposes TF-DCAM, a speaker verification model integrating dilated convolution and temporal-frequency multi-scale attention mechanisms. The model enhances feature representation through dilated residual convolution and a time-frequency adaptive refocusing unit to suppress redundant information. A temporal-frequency multi-scale attention module is introduced to improve sensitivity to key information via channel attention and cross-dimensional interaction. An adaptive masking temporal convolution module is further incorporated to model long-term dependencies effectively. Finally, a combination of contrastive loss functions is applied to jointly optimize the speaker embedding space. Experiments conducted on the CN-Celeb dataset show that TF-DCAM reduces EER and minDCF by 14.98% and 10.98% respectively, compared with the baseline. The model also demonstrates strong cross-lingual generalization on the VoxCeleb1 dataset. Results indicate that the proposed method significantly improves speaker verification performance and robustness while maintaining model efficiency.

    参考文献
    相似文献
    引证文献
引用本文

李嘉麒,郑展恒,曾庆宁,王健.融合空洞卷积与多尺度注意力机制的说话人确认方法[J].电子测量技术,2025,48(22):119-128

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2026-01-09
  • 出版日期:
文章二维码

重要通知公告

①《电子测量技术》期刊收款账户变更公告