摘要
针对物流车货源信息自动抽取方法匮乏,手工构建繁琐且难于维护,物流信息抽取冗余且效率低下的问题,文中根据车货源页面结构特点,通过标签路径识别页面主体元素,并通过元素CLASS选择器自动构建抽取规则,提出了基于标签路径及CSS选择器的全自动抽取模板的构建方法,在一定程度上实现了物流车货源信息的自动化采集工作,降低了人工构建包装器的成本,保证了抽取规则的准确度,并通过搭建基于Scrapy-redis的分布式爬虫,实现物流信息的高效抽取,并将抽取的数据存储在MongoDB数据库中。实验表明自动生成的抽取规则可以替代人工构建的抽取模板,分布式抽取方式与传统单机抽取方法在抽取效率上有明显的提升。
Due to the lack of automatic extraction of logistics vehicle and cargo source information, manual build extraction rules are tedious and difficult to maintain, and redundancy and inefficiency of logistics information extraction, This paper identifies the main elements of the page through the tag path according to the structural characteristics of the logistics vehicle and cargo source information page, and automatically constructs extraction rules through the element CLASS selector, The automatic extraction rules are built, and a method for constructing automatic extraction templates based on label paths and CSS selectors is proposed. To a certain extent, the automatic collection of logistics vehicle and cargo source information is achieved, it reduces the cost of manually building the wrapper and ensures the accuracy of the extraction rules, and through the establishment of Scrapy-redis-based distributed crawler, achieves efficient extraction of logistics information, and the extracted data is stored in the MongoDB database. The experiments show that the automatically generated extraction rules can replace the artificially constructed extraction templates, and the extraction efficiency of the distributed extraction method and the traditional single-machine extraction method are significantly improved.
作者
马汉达
曹瑞
谢诗帧
MA Han-da;CAO Rui;XIE Shi-zhen(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang 212013,Jiangsu Province,China)
出处
《信息技术》
2018年第10期40-44,共5页
Information Technology
基金
2017年江苏大学学生实践创新训练项目(2017102-99330W)