构音障碍语音识别的解码策略研究
DOI:
CSTR:
作者:
作者单位:

1.浙江理工大学计算机科学与技术学院 杭州 310018;2.嘉兴大学机械工程学院 嘉兴 314001; 3.浙江理工大学信息科学与工程学院 杭州 310018

作者简介:

通讯作者:

中图分类号:

TN912.34; R741

基金项目:

浙江省重点研发计划项目(2017C01043)、浙江省医学电子与数字健康重点实验室项目(MEDH202206)资助


Decoding strategies for dysarthric speech recognition
Author:
Affiliation:

1.School of Computer Science and Technology, Zhejiang Sci-Tech University,Hangzhou 310018,China; 2.College of Mechanical Engineering, Jiaxing University,Jiaxing 314001,China; 3.College of Information Science and Engineering, Zhejiang Sci-Tech University,Hangzhou 310018,China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    构音障碍语音是神经病变导致患者发音器官产生运动障碍因而发音和韵律异常的语音,给传统自动语音识别系统带来了巨大挑战。为此,本研究提出了一种结合多层表征融合解码策略和热词增重技术的创新算法。在基于Transformer架构的编码器解码器模型上,改进传统的单视角解码方式为多层次表征融合,通过3种表征融合策略有效增强了模型对语境和复杂句子的理解能力。同时,为进一步提升构音障碍语音的识别精度,本文将热词增重技术集成到波束搜索解码过程中,针对关键词赋予更高权重。实验在TORGO和UASpeech数据集上进行,结果表明与其他基准模型相比,该方法显著降低了WER。其中与Whisper基准模型比较,在UASpeech数据集上WER从38.31%降低至27.18%,在TORGO数据集上WER从16.38%降低至12.67%,证明了该方法在提升构音障碍语音识别精度方面的有效性。

    Abstract:

    Dysarthric speech arises from neurological disorders that cause motor impairments in the articulatory organs, resulting in abnormal pronunciation and prosody, which pose significant challenges to traditional ASR systems. To address these issues, this paper proposes an innovative algorithm that combines a multi-level representation fusion decoding strategy with hotword boosting technology. Built upon the Transformer-based encoder-decoder architecture, the approach improves the conventional single-view decoding method by introducing multilevel representation fusion. This is achieved through three distinct fusion strategies, which effectively enhance the model′s ability to comprehend complex sentences and contextual information. Additionally, to further improve the recognition accuracy of dysarthric speech, hotword boosting is integrated into the beam search decoding process to assign higher weights to key terms. The results demonstrate that the proposed method significantly reduces the WER compared to other baseline models. Specifically, compared to the Whisper baseline model, the WER on the UASpeech dataset decreased from 38.31% to 27.18%, and on the TORGO dataset, it decreased from 16.38% to 12.67%. This highlights the effectiveness of the proposed method in improving the accuracy of dysarthric speech recognition.

    参考文献
    相似文献
    引证文献
引用本文

樊紫岩,朱耀东,赵伟岚,李轩逸.构音障碍语音识别的解码策略研究[J].电子测量技术,2025,48(8):10-17

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-05-23
  • 出版日期:
文章二维码