Abstract:Because children with autism show abnormalities in visual attention in their early years, it provides an important distinguishing criterion for early intervention. In view of the insufficient attention paid to semantic alignment and dynamic interaction between modalities in autism research, this study proposes a multimodal model that integrates saliency maps and eye movement trajectory data features, providing an objective implementation method for the diagnosis of autism. This method constructs a dualstream network architecture: the U-Net feature extractor is used to process the saliency map, and the temporal convolutional network is utilized to conduct temporal modeling of the eye movement trajectory. To achieve dynamic weighted fusion between two different modal data, a cross-modal attention mechanism is introduced. During the process of time series modeling, eye movement trajectory prediction is carried out simultaneously. Additionally, the prediction error is introduced as a distinguishing feature into the classification process to enhance the classification performance of the model. Through comparative experiments, it was verified that the proposed model achieved an accuracy rate of 98.89% in the early screening task of autism.