物流车货源信息自动抽取系统研究与实现被引量：1

Research and implementation of automatic collection system for logistics vehicle and cargo source information

下载PDF

导出

摘要针对物流车货源信息自动抽取方法匮乏,手工构建繁琐且难于维护,物流信息抽取冗余且效率低下的问题,文中根据车货源页面结构特点,通过标签路径识别页面主体元素,并通过元素CLASS选择器自动构建抽取规则,提出了基于标签路径及CSS选择器的全自动抽取模板的构建方法,在一定程度上实现了物流车货源信息的自动化采集工作,降低了人工构建包装器的成本,保证了抽取规则的准确度,并通过搭建基于Scrapy-redis的分布式爬虫,实现物流信息的高效抽取,并将抽取的数据存储在MongoDB数据库中。实验表明自动生成的抽取规则可以替代人工构建的抽取模板,分布式抽取方式与传统单机抽取方法在抽取效率上有明显的提升。 Due to the lack of automatic extraction of logistics vehicle and cargo source information, manual build extraction rules are tedious and difficult to maintain, and redundancy and inefficiency of logistics information extraction, This paper identifies the main elements of the page through the tag path according to the structural characteristics of the logistics vehicle and cargo source information page, and automatically constructs extraction rules through the element CLASS selector, The automatic extraction rules are built, and a method for constructing automatic extraction templates based on label paths and CSS selectors is proposed. To a certain extent, the automatic collection of logistics vehicle and cargo source information is achieved, it reduces the cost of manually building the wrapper and ensures the accuracy of the extraction rules, and through the establishment of Scrapy-redis-based distributed crawler, achieves efficient extraction of logistics information, and the extracted data is stored in the MongoDB database. The experiments show that the automatically generated extraction rules can replace the artificially constructed extraction templates, and the extraction efficiency of the distributed extraction method and the traditional single-machine extraction method are significantly improved.

作者马汉达曹瑞谢诗帧 MA Han-da;CAO Rui;XIE Shi-zhen(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang 212013,Jiangsu Province,China)

机构地区江苏大学计算机科学与通信工程学院

出处《信息技术》 2018年第10期40-44,共5页 Information Technology

基金 2017年江苏大学学生实践创新训练项目(2017102-99330W)

关键词 WEB信息抽取分布式爬虫标签路径 CSS选择器 Web information extraction distributed crawler tag path CSS selector

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1李贵,陈成,李征宇,韩子扬,孙平,孙焕良.基于标签路径的Web结构化数据自动抽取[J].计算机科学,2013,40(06A):141-144. 被引量：3
2黄恩博.基于布隆过滤器的网页搜索去重方法[J].现代计算机,2013,19(14):7-10. 被引量：4

二级参考文献14

1孙吉贵,刘杰,赵连宇.聚类算法研究[J].计算机研究与发展,2008(19):48-61.
2Liu Bing.Web Data Mining[M].愈勇,薛贵荣,韩定一,译.北京:清华大学出版社,2009:291-295.
3Liu Bing, Grossman R,Zhai Ya:nong. Ming data records in web pages [C]// Proceedings of the ACM International on Know- ledge Discovery and Data Ming. 2003 : 601-606.
4Jsoup:Java Html Parser[OL]. http://jsoup, org/apidocs/.
5Miao G, Tatemura J, Hsiung Wang-pin, et al. Extracting data re- cords from the Web using tag path clustering[C]//Madrid. 2009.
6Arasu A, Garcia-Molina H. Extracting structured data from Web pages[C]//Proc of ACM SIGMOD International Confe- rence on the Management of Data. 2003 : 337-348.
7Cafarella M J, Halevy A, Wang D Z, et al. Exploring the power of tables on the tables on the Web [C]//Proceedings of 34mIn- temational Conference on Very Large Data Bases. 2008:538-549.
8http://news.netcraft.com/,2nd April, 2013.
9Introduction to Automata Theory. Languages, and Computa- tion[M]. JE Hopcroft, R Motwani 2007 - Addison-wesley.
10http://userpages.umbc.edu/mabzugl/cs/mdS/mdS.html, lst April,2002.

共引文献5

1罗永莲,赵昌垣.突发事件新闻标题与正文提取方法[J].计算机应用,2014,34(10):2865-2868. 被引量：10
2吕永国,杨斌,彭之军.基于PHP的网页信息抽取研究[J].科技广场,2015(5):29-32. 被引量：1
3朱坤,张娜娜,朱丹丹.基于Hadoop平台的布隆过滤分布式并行算法设计与实现[J].互联网天地,2016(10):69-73.
4文天乐.面向虚拟现实内容的网络爬虫系统的设计与实现[J].中国高新科技,2017(19):39-41.
5潘昊,鄂海红,宋美娜.布隆过滤器在网页消重中的应用[J].软件,2015,36(12):166-170. 被引量：2

同被引文献3

1袁煜明,李骅熹,蒋利峰.区块链模式:让多方协作实现共赢[J].清华管理评论,2018(10):64-69. 被引量：6
2李海波.区块链视角下我国跨境电商问题解决对策[J].中国流通经济,2018,32(11):41-48. 被引量：53
3焦凯琳,于自强.智慧物流分布式计算模型与创新服务研究[J].计算机技术与发展,2019,29(1):206-210. 被引量：5

引证文献1

1陈映村,程鹏飞.智慧物流分布式计算模型与创新服务研究[J].计算机产品与流通,2019,8(2):151-151.

1彭艳兵,谢馨庭.基于单DOM树特征预分类的自适应Web信息抽取方法[J].电子设计工程,2017,25(19):56-59. 被引量：4
2杨贤,唐超兰,李航.基于文本块密度与标签路径等特征的正文提取[J].广东工业大学学报,2018,35(2):51-56. 被引量：1
3连建峰.“被需值”教育理念的教育元素主体间性关系研究[J].智富时代,2018,0(5X):195-195.
4刘鹏程,胡骏,吴共庆.基于文本块密度和标签路径覆盖率的网页正文抽取[J].计算机应用研究,2018,35(6):1645-1650. 被引量：5
5王红娜.探讨专科特色人文关怀对手术病人护理的应用效果[J].首都食品与医药,2018,25(17):143-143.
6沈志敏.提高运输效率是助力“公转铁”货源增量的有效措施[J].丝路视野,2018,0(21):147-147.
7利用平台漏洞虚构信息货运司机屡屡中招[J].中国防伪报道,2018,0(9):65-66.
8梁燕,邢菲,陈冬瑞.我国职业教育集团化办学:问题聚焦与研究展望[J].职教通讯,2018(11):59-64. 被引量：3
9刘振.基于网络科技信息的事件抽取研究[J].情报科学,2018,36(9):115-117. 被引量：12
10李雁群,何云琪,钱龙华,周国栋.中文嵌套命名实体识别语料库的构建[J].中文信息学报,2018,32(8):19-26. 被引量：13

信息技术

2018年第10期

浏览历史

内容加载中请稍等...

物流车货源信息自动抽取系统研究与实现被引量：1

参考文献2

二级参考文献14

共引文献5

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

物流车货源信息自动抽取系统研究与实现 被引量：1

参考文献2

二级参考文献14

共引文献5

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

物流车货源信息自动抽取系统研究与实现被引量：1