Abstract:Accurately localizing and recognizing human moions in both spatial and temporal dimensions is of significant importance for applications such as intelligent sports analysis. However, existing step-by-step human motion recognition methods are often limited by the fixed receptive field of RoI features, making it difficult to achieve effective modeling and semantic representation in complex scenarios. To address this issue, this paper proposes a fine-grained motion & situation fusion (FMSF) network that integrates human representation features and global spatiotemporal situation features through parallel semantic modeling and motion proposal units. The semantic modeling unit employs a human localization model to generate finegrained human candidate features from key frames and leverages a 3D video backbone network to extract global spatiotemporal features. The motion proposal unit then uses a shared Transformer framework to jointly model these multi-modal features, capturing complex interactions between humans and their surroundings, resulting in highly discriminative motion predictions. Furthermore, a weighted score aggregation strategy is introduced to integrate the motion classification results of multiple key frames and short video segments for long-video motion recognition. On the AVA-60 v2.2 dataset, the FMSF model achieved a frame-level mAP of 30.01%, while the long-video strategy-based FMSF-Prolonged reached 30.74%. On the Charades dataset, the mAP of FMSF increased to 30.68%, and that of FMSF-Prolonged increased to 31.29%.