Abstract:Alzheimer′s disease (AD) is a neurological disorder that primarily affects a person′s brain cells and is the main form of dementia; due to its irreversible nature, early diagnosis is critical to slowing the progression of the disease. Structural magnetic resonance imaging (sMRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) are two imaging techniques that are widely used in neurodegenerative disease research, and combining these two images to assess the brain state can improve the accuracy of the results. In this paper, we propose a multimodal fusion framework based on Vision Transformer, which extracts features from unimodal images through a self-attentive vision transformer, and at the same time focuses on the similarity of the features of the two images by using an interactive attentional fusion network, which strengthens the independent characterization ability of each modality, and also improves the interactivity of the two modalities. At the same time, a deep confidence network is used to reduce the redundancy of the extracted features and improve the complementary information of different modalities, and finally an integrated classifier is used to make AD classification results. The ADNI dataset is selected and the classification performance of the proposed network is evaluated, and the accuracy, sensitivity and specificity reach 94.65%, 93.24% and 95.62%, respectively, and the proposed method achieves superior results in the AD classification task compared to current fusion methods.