基于多特征融合的唇语识别模型
DOI:
CSTR:
作者:
作者单位:

华北电力大学控制与计算机工程学院 北京 102206

作者简介:

通讯作者:

中图分类号:

TP391.4; TN911.73

基金项目:

国家自然科学基金(62301220)项目资助


Lip-reading model based on multi-feature fusion
Author:
Affiliation:

School of Control and Computer Engineering, North China Electric Power University,Beijng 102206, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在单词级唇语识别研究中,使用三维卷积神经网络与残差网络的主流模型往往难以捕捉唇运动的几何动态,并且对细节依赖性高。为了缓解该问题,本文提出了一种基于多特征融合的端到端单词级唇语识别模型,该模型集成了像素级纹理细节特征、几何级轮廓形状特征和词边界特征,实现了从时间和空间、像素级与几何级等多个维度的特征融合。其中,纹理细节特征提供精细化的局部信息;轮廓形状特征反应唇部几何结构及动态变化;词边界特征则引导模型关注有效时间帧。此外,本文将空间通道注意力机制整合到3D CNN和ResNet-18中以增强纹理细节特征提取,并利用全局上下文网络对时空图卷积网络进行改进后将其引入模型以捕捉几何级轮廓形状特征。实验表明,输入为灰度视频时,本文模型在公开的大规模单词级唇语识别数据集LRW上的准确率达到89.3%,相较于相同条件下单一或部分特征模型提升1.3%~3.9%,且高于大部分现有模型,验证了所提模型的有效性;同时,实验发现,使用彩色视频作为输入时,模型准确率进一步提高,为89-7%,验证了色彩信息对唇语识别的影响。

    Abstract:

    Mainstream word-level lipreading models, based on three-dimensional convolutional neural networks and residual networks, struggle to capture the geometric dynamics of lip movements. Their reliance on pixel-level texture details makes them highly sensitive to noise and facial variations. To address these limitations, this paper proposes an end-to-end word-level lipreading model that integrates pixel-level texture detail features, geometry-level contour shape features, and word boundary features, achieving comprehensive multi-feature fusion across temporal, spatial, pixel-level, and geometric-level dimensions. The proposed model incorporates the spatial and channel squeeze-and-excitation mechanism into 3D CNNs and ResNet-18 to enhance texture feature extraction, while an improved spatial-temporal graph convolutional network integrates a global context network to strengthen global geometric relationships. Additionally, word boundary features further guide the model to focus on relevant temporal frames, reducing noise sensitivity. These features are fused and processed by a back-end temporal module to complete the recognition task. Experiments show that when the input is grayscale video, the accuracy of this paper′s model on the publicly available large-scale word-level lip recognition dataset LRW reaches 89.3%, which is improved by 1.3%~3.9% compared with single or partial feature models under the same conditions, and higher than most existing models, which verifies the validity of the proposed model; at the same time, experiments find that, when colorful video is used as the input, the accuracy of the model further improves to 89.7%, verifying the effect of color information on lip recognition.

    参考文献
    相似文献
    引证文献
引用本文

张甜愉,吕博,周蓉,王琳,蒲梦杨.基于多特征融合的唇语识别模型[J].电子测量技术,2025,48(12):166-175

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-07-28
  • 出版日期:
文章二维码

重要通知公告

①《电子测量技术》期刊收款账户变更公告