Spatial temporal convolutional Transformer network for skeleton-based action recognition
DOI:
CSTR:
Author:
Affiliation:

1.Zhengzhou Hengda Intelligent Control Technology Company Limited,Zhengzhou 450000, China; 2.School of Electrical Engineering and Automation, Henan Polytechnic University,Jiaozuo 454003, China; 3.Research Institute for Artificial Intelligence, Beihang University,Beijing 100191, China

Clc Number:

TP391.4

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    In the methon of skeleton action recognition based on graph convolution, the rely heavily on hand-designed graph topology in modelling joint features, and lack the ability to model global joint dependencies. To address this issue, we proposed a spatio-temporal convolutional Transformer network to implement the modelling of spatial and temporal joint features. In the spatial joint feature modeling, we proposed a dynamic grouping decoupling Transformer that grouped the input skeleton sequence in the channel dimension and dynamically generated different attention matrices for each group, establishing global dependencies between joints without requiring knowledge of the human topology. In the temporal joint feature modeling, multi-scale temporal convolution was used to extract features of target behaviors at different scales. Finally, we proposed a spatio-temporal channel joint attention module to further refine the extracted spatio-temporal features. The proposed method achieved Top1 recognition accuracy rates of 92.5% and 89.3% on the cross-subject evaluation criteria for the NTU-RGB+D and NTU-RGB+D 120 datasets, respectively, demonstrating its effectiveness.

    Reference
    Related
    Cited by
Get Citation
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:
  • Revised:
  • Adopted:
  • Online: April 24,2024
  • Published:
Article QR Code