网络样本数据轻量化采集脚本语言设计与实现
DOI:
CSTR:
作者:
作者单位:

上海大学通信与信息工程学院 上海 200444

作者简介:

通讯作者:

中图分类号:

TN911.73

基金项目:

国家重点研发计划(2021YFB2900800)、高等学校学科创新引智计划(111项目)(D20031)资助


Design and implementation of a lightweight script for network sample data collection
Author:
Affiliation:

School of Communication and Information Engineering, Shanghai University,Shanghai 200444, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    样本数据的完整度与新鲜度直接决定了机器学习模型的泛化能力与预测精度。网络作为开放环境下的核心数据来源,可为模型训练提供广覆盖、高实时的样本支持。然而,网络数据源的动态性、复杂性及规模性导致传统采集方法面临开发效率低、维护成本高的严峻挑战。通过对目前主流采集框架的分析,得出Selenium采集框架相较于其他具有开发效率高、动态支持能力强的特点优势。因此本文创新地提出一种面向网络样本数据采集的轻量化脚本语言设计方法,基于乔姆斯基层次理论构建基于3型正则文法的轻量化脚本语法体系,在Selenium的基础上进一步优化。为了所提出的脚本语法可被实际使用,本文实现了支持多线程异步执行的分层语法解析器,可将其动态转译为标准化Selenium代码。本方案通过抽象原生Selenium API、绑定动态智能等待等机制,显著降低了开发与维护成本,大大提升了开发人员的效率。同时通过两类DOM结构差异场景的采集任务,验证了在卫星轨道参数(TLE)采集任务中,相较于传统Selenium方案,本文所设计的脚本语言在代码量方面能够减少85%以上,页面结构变更后的维护成本降低70%以上,平均采集延迟的增加可忽略不计。本研究为高动态网络环境下的数据高效捕获提供了轻量化解决方案,简化的脚本语言在未来更有利于大语言模型LLM的训练和推理,实现对于样本数据采集任务的自动化生成。

    Abstract:

    The integrity and freshness of sample data directly determine the generalization ability and prediction accuracy of machine learning model. As the core data source in the open environment, the network can provide wide coverage and high real-time sample support for model training. However, the dynamic, complexity and scale of network data sources cause the traditional acquisition methods to face the severe challenges of low development efficiency and high maintenance cost. Through the analysis of the current mainstream collection framework, it is concluded that selenium collection framework has the advantages of high development efficiency and strong dynamic support ability compared with other collection frameworks. Therefore, this paper innovatively proposes a lightweight script language design method for network sample data collection, and constructs a lightweight script syntax system based on type 3 regular grammar based on Chomsky hierarchy theory, which is further optimized on the basis of selenium. In order to make the script syntax proposed can be used in practice, this paper implements a hierarchical syntax parser that supports multi-threaded asynchronous execution, which can be dynamically translated into standardized selenium code. By abstracting the native selenium API, binding dynamic intelligent waiting and other mechanisms, this scheme significantly reduces the cost of development and maintenance, and greatly improves the efficiency of developers. At the same time, through the acquisition task of two kinds of DOM structure difference scenarios, it is verified that in the satellite orbit parameter (TLE) acquisition task, compared with the traditional selenium scheme, the script language designed in this paper can reduce the amount of code by more than 85%, the maintenance cost after page structure change can be reduced by more than 70%, and the increase of average acquisition delay can be ignored. This research provides a lightweight solution for efficient data acquisition in a highly dynamic network environment. The simplified script language is more conducive to the training and reasoning of large language model LLM in the future, and realizes the automatic generation of sample data acquisition tasks.

    参考文献
    相似文献
    引证文献
引用本文

曹炳尧,张嵇豪.网络样本数据轻量化采集脚本语言设计与实现[J].电子测量技术,2025,48(19):134-143

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-12-01
  • 出版日期:
文章二维码