Abstract:Convolutional neural network (CNN) have been limited in their ability to effectively model long-range dependencies due to their localized convolution operations. In contrast, vision Transformer achieves explicit modeling of global dependencies through mechanisms such as self-attention. In surface defect detection tasks, especially in scenarios with complex background textures or diverse defect morphologies, vision Transformer shows superior performance compared with CNN. This article provides a comprehensive review of recent domestic and international research progress and challenges in surface defect detection based on vision Transformer, focusing on two dimensions: The technical advantages and application methodologies, as well as key challenges and corresponding strategies. Firstly, the fundamental definition of surface defect detection is elucidated, and the technical characteristics and main challenges in this field are summarized. Secondly, the technical advantages and key challenges of the vision Transformer in the context of defect detection are analyzed. Subsequently, leveraging the technical strengths of vision Transformer, typical applications in surface defect detection tasks are examined in detail, including handling complex texture background interference, achieving multimodal information fusion, and integrating local-global feature information based on a modular design approach. Subsequently, the article discusses the main optimization strategies and solutions adopted by vision Transformer to address key challenges in surface defect detection, such as scarce sample data, high computational complexity, insufficient real-time performance, low training efficiency, and poor performance in detecting small targets. Finally, future research directions and development trends of vision Transformer in the field of surface defect detection are prospected, such as the development of transfer learning-based pre-trained models and their advanced fusion with multimodal methodologies, among others.