Abstract:Accurate segmentation of the nasal septum anatomical structure holds significant clinical value for disease assessment and surgical planning. However, existing methods based on Convolutional Neural Network (CNN) exhibit limitations in global feature representation. To address this issue, this study innovatively constructs the CTA-Net model, which achieves local-global feature collaborative learning through a dual-branch encoding architecture: the CNN branch captures fine anatomical details, while the Transformer branch models long-distance spatial dependencies, and a feature fusion module is designed to enable effective information exchange. Particularly, a multi-scale feature attention mechanism is introduced in the bottleneck layer to enhance the model′s capability to represent complex anatomical structures through different receptive fields. Experiments were conducted on three medical datasets—a self-annotated clinical dataset of the nasal septum, ISIC 2018, and Kvasir. The results demonstrate that, in the nasal septum segmentation task, the model achieves IoU and Dice coefficients of 90.38% and 94.94%, respectively. In cross-dataset experiments, the IoU accuracy for gastrointestinal endoscopic image segmentation reached 76.17%, significantly outperforming other existing models, thereby confirming the model′s advantages in feature learning and generalization. This study provides an innovative solution for medical image analysis by integrating local perception with global modeling, and it holds significant promise for intelligent diagnosis and treatment in the otolaryngology field.