Abstract:The integrity and freshness of sample data directly determine the generalization ability and prediction accuracy of machine learning model. As the core data source in the open environment, the network can provide wide coverage and high real-time sample support for model training. However, the dynamic, complexity and scale of network data sources cause the traditional acquisition methods to face the severe challenges of low development efficiency and high maintenance cost. Through the analysis of the current mainstream collection framework, it is concluded that selenium collection framework has the advantages of high development efficiency and strong dynamic support ability compared with other collection frameworks. Therefore, this paper innovatively proposes a lightweight script language design method for network sample data collection, and constructs a lightweight script syntax system based on type 3 regular grammar based on Chomsky hierarchy theory, which is further optimized on the basis of selenium. In order to make the script syntax proposed can be used in practice, this paper implements a hierarchical syntax parser that supports multi-threaded asynchronous execution, which can be dynamically translated into standardized selenium code. By abstracting the native selenium API, binding dynamic intelligent waiting and other mechanisms, this scheme significantly reduces the cost of development and maintenance, and greatly improves the efficiency of developers. At the same time, through the acquisition task of two kinds of DOM structure difference scenarios, it is verified that in the satellite orbit parameter (TLE) acquisition task, compared with the traditional selenium scheme, the script language designed in this paper can reduce the amount of code by more than 85%, the maintenance cost after page structure change can be reduced by more than 70%, and the increase of average acquisition delay can be ignored. This research provides a lightweight solution for efficient data acquisition in a highly dynamic network environment. The simplified script language is more conducive to the training and reasoning of large language model LLM in the future, and realizes the automatic generation of sample data acquisition tasks.