期刊文献+

物流车货源信息自动抽取系统研究与实现 被引量:1

Research and implementation of automatic collection system for logistics vehicle and cargo source information
下载PDF
导出
摘要 针对物流车货源信息自动抽取方法匮乏,手工构建繁琐且难于维护,物流信息抽取冗余且效率低下的问题,文中根据车货源页面结构特点,通过标签路径识别页面主体元素,并通过元素CLASS选择器自动构建抽取规则,提出了基于标签路径及CSS选择器的全自动抽取模板的构建方法,在一定程度上实现了物流车货源信息的自动化采集工作,降低了人工构建包装器的成本,保证了抽取规则的准确度,并通过搭建基于Scrapy-redis的分布式爬虫,实现物流信息的高效抽取,并将抽取的数据存储在MongoDB数据库中。实验表明自动生成的抽取规则可以替代人工构建的抽取模板,分布式抽取方式与传统单机抽取方法在抽取效率上有明显的提升。 Due to the lack of automatic extraction of logistics vehicle and cargo source information, manual build extraction rules are tedious and difficult to maintain, and redundancy and inefficiency of logistics information extraction, This paper identifies the main elements of the page through the tag path according to the structural characteristics of the logistics vehicle and cargo source information page, and automatically constructs extraction rules through the element CLASS selector, The automatic extraction rules are built, and a method for constructing automatic extraction templates based on label paths and CSS selectors is proposed. To a certain extent, the automatic collection of logistics vehicle and cargo source information is achieved, it reduces the cost of manually building the wrapper and ensures the accuracy of the extraction rules, and through the establishment of Scrapy-redis-based distributed crawler, achieves efficient extraction of logistics information, and the extracted data is stored in the MongoDB database. The experiments show that the automatically generated extraction rules can replace the artificially constructed extraction templates, and the extraction efficiency of the distributed extraction method and the traditional single-machine extraction method are significantly improved.
作者 马汉达 曹瑞 谢诗帧 MA Han-da;CAO Rui;XIE Shi-zhen(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang 212013,Jiangsu Province,China)
出处 《信息技术》 2018年第10期40-44,共5页 Information Technology
基金 2017年江苏大学学生实践创新训练项目(2017102-99330W)
关键词 WEB信息抽取 分布式爬虫 标签路径 CSS选择器 Web information extraction distributed crawler tag path CSS selector
  • 相关文献

参考文献2

二级参考文献14

  • 1孙吉贵,刘杰,赵连宇.聚类算法研究[J].计算机研究与发展,2008(19):48-61.
  • 2Liu Bing.Web Data Mining[M].愈勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:291-295.
  • 3Liu Bing, Grossman R,Zhai Ya:nong. Ming data records in web pages [C]// Proceedings of the ACM International on Know- ledge Discovery and Data Ming. 2003 : 601-606.
  • 4Jsoup:Java Html Parser[OL]. http://jsoup, org/apidocs/.
  • 5Miao G, Tatemura J, Hsiung Wang-pin, et al. Extracting data re- cords from the Web using tag path clustering[C]//Madrid. 2009.
  • 6Arasu A, Garcia-Molina H. Extracting structured data from Web pages[C]//Proc of ACM SIGMOD International Confe- rence on the Management of Data. 2003 : 337-348.
  • 7Cafarella M J, Halevy A, Wang D Z, et al. Exploring the power of tables on the tables on the Web [C]//Proceedings of 34mIn- temational Conference on Very Large Data Bases. 2008:538-549.
  • 8http://news.netcraft.com/,2nd April, 2013.
  • 9Introduction to Automata Theory. Languages, and Computa- tion[M]. JE Hopcroft, R Motwani 2007 - Addison-wesley.
  • 10http://userpages.umbc.edu/mabzugl/cs/mdS/mdS.html, lst April,2002.

共引文献5

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部