Abstract:Aiming at the problems of insufficient spatio-temporal feature fusion and failure to fully utilize the rich skeleton information in the existing action recognition algorithms, this paper proposes a dual-stream fusion action recognition model based on cross-modal synergetic perception. Firstly, this paper proposes a dual-stream fusion model, which obtains the global information of the two modules by fusing the RGB video stream and the skeleton stream, realizing the complementary advantages; proposes a spatio-temporal interaction and attention enhancement module, which realizes the in-depth synergistic and dynamic complementarity of spatio-temporal features and dynamically enhances the attention weight of the relevant spatio-temporal region; and finally, designs a Multimodal Feature Fusion Module.Feature Fusion Module, which will be enhanced by feature fusion through the outputs of RGB video streams and skeleton streams, and fully exploits the complementary information between RGB visual appearance and human skeleton motion through adaptive weight assignment and cross-modal interaction, so as to improve the accuracy of action recognition. The results of multiple sets of experiments show that this CC-DFARM achieves high accuracy on the NTU RGB+D and NTU RGB+D 120 datasets of action recognition, obtaining 97.2% and 92.3% accuracies, respectively, and improving the accuracy by 3.6% and 3.2% compared to the baseline method MMTM. The results show that the model can fully extract and utilize the human skeleton information, and at the same time fully integrate the spatio-temporal features to improve the accuracy of action recognition.