基于细粒度动作语境聚合的动作检测与识别
DOI:
CSTR:
作者:
作者单位:

郑州大学体育学院 郑州 450044

作者简介:

通讯作者:

中图分类号:

TN101

基金项目:

国家自然科学基金青年项目(62306284)、2024年河南省科技攻关项目(242102320282)资助


Motion detection and recognition model based on fine-grained motion & situation fusion
Author:
Affiliation:

College of Physical Education, Zhengzhou University,Zhengzhou 450044, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在空域和时域上精确定位并识别视频中的人体动作对于智能体育分析等应用具有重要意义。然而,现有的分步人体动作识别方法通常受限于RoI特征的固定感受野,难以在复杂场景中进行有效建模和语义表达。为此,本文提出了一种细粒度动作语境聚合网络,利用并行的语义建模单元和动作候选单元对人物表征特征和全局时空语境特征进行有机融合。前者中采用人体定位模型从关键帧生成细粒度的人物候选特征,并通过3D视频骨干网络提取全局时空特征;后者则利用共享Transformer框架对上述多模态特征进行统一建模,捕捉人物与环境之间的复杂关联,从而获得具有高度区分能力的动作预测。进一步地,本文引入加权分数聚合策略,将多个关键帧与短时视频片段的动作分类信息整合,用于长视频片段的动作识别。在AVA-60 v2.2数据集上,本文模型在帧级mAP指标上达到了30.01%,而基于长时策略的本文模型则达到了30.74%。在Charades数据集上,本文模型的mAP提升至30.68%,而基于长时策略的本文模型结果提升至31.29%。

    Abstract:

    Accurately localizing and recognizing human moions in both spatial and temporal dimensions is of significant importance for applications such as intelligent sports analysis. However, existing step-by-step human motion recognition methods are often limited by the fixed receptive field of RoI features, making it difficult to achieve effective modeling and semantic representation in complex scenarios. To address this issue, this paper proposes a fine-grained motion & situation fusion (FMSF) network that integrates human representation features and global spatiotemporal situation features through parallel semantic modeling and motion proposal units. The semantic modeling unit employs a human localization model to generate finegrained human candidate features from key frames and leverages a 3D video backbone network to extract global spatiotemporal features. The motion proposal unit then uses a shared Transformer framework to jointly model these multi-modal features, capturing complex interactions between humans and their surroundings, resulting in highly discriminative motion predictions. Furthermore, a weighted score aggregation strategy is introduced to integrate the motion classification results of multiple key frames and short video segments for long-video motion recognition. On the AVA-60 v2.2 dataset, the FMSF model achieved a frame-level mAP of 30.01%, while the long-video strategy-based FMSF-Prolonged reached 30.74%. On the Charades dataset, the mAP of FMSF increased to 30.68%, and that of FMSF-Prolonged increased to 31.29%.

    参考文献
    相似文献
    引证文献
引用本文

王峥,赵新辉,王小伟.基于细粒度动作语境聚合的动作检测与识别[J].电子测量技术,2025,48(7):75-85

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-05-12
  • 出版日期:
文章二维码