Abstract:Deep vein thrombosis can potentially give rise to severe complications like pulmonary embolism, which poses a threat to the life safety of patients. Therefore, early prediction of DVT risk holds significant clinical implications. However, current DVT risk prediction methods mainly focus on only predict using either single-text or single-image data, and there are few studies which integrate these two types of modal data for DVT risk prediction. To address these challenges, this study combines the Mamba state space model with multimodal fusion and proposes a novel DVT risk prediction method based on Mamba self-attention and multimodal fusion for the first time. This method takes the patient′s ultrasound images and structured text data such as medical history and laboratory test indicators as multimodal input data. Firstly, a dual-channel feature encoding framework is constructed, which uses ViT to capture the features of ultrasound images and DNN to obtain the features of structured clinical data. Then, this paper proposes a multimodal feature fusion framework based on Mamba self-attention. This framework first concatenates the image and text features to obtain the joint features, and then uses the original Mamba to train the joint features to obtain the multimodal fusion features. Subsequently, Mamba self-attention, feedforward network, and CNN are designed to extract and fuse global and local, high-level and low-level features of multimodal data, thereby preserving the original multimodal features from multiple perspectives. Finally, multi-level MLP is used for feature dimension reduction to obtain the DVT prediction results. Comparative experiments were conducted on a clinical dataset with 13 other combined models. The results show that this model outperforms the others, with an AUC of 0.912, an average improvement of 11.97% compared to the single structured data model, and an average improvement of 13% in F1 score. Compared with the traditional single image data model, the AUC is improved by an average of 14.7%, and the accuracy and F1 score are both increased by more than 20%. Among the multimodal comparison models, this model outperforms the ResNet and Transformer fusion model (AUC=0.871) in terms of accuracy, precision, recall, and F1 score by approximately 6%. Compared with the same-structured Transformer hybrid model, the AUC and the other four performance evaluation indicators as well as the model inference speed are all improved by more than 20%. The results indicate that the model proposed in this study provides strong support for the early prevention and prediction of DVT and has good application prospects and clinical value.