期刊文献+

基于Heritrix和Jsoup的信息抽取系统的设计与实现 被引量:2

DESIGN AND IMPLEMENTATION OF WEB INFORMATION EXTRACTION SYSTEM BASED ON HERITRIX AND JSOUP
下载PDF
导出
摘要 应用开源的Heritrix和Jsoup设计了一个通用性强的网络商品信息抽取系统,实现了Web信息的抽取、存储.系统由三个分别独立的功能模块组成,即采集网页模块、抽取信息模块、数据存储模块,并对抽取算法在真实数据页面上进行了验证.实验结果表明系统具有良好的召回率和准确率,抽取效果良好. Heritrix and Jsoup are used to design a general - purpose network commodity information extraction system which achieves the crawler and storage of Web information in this paper. The system is composed of three respective modules: web crawling module , web analysis module and database storage module. It: vetlfies the extraction algorithm in the real data on the page. The experimental results show that the system has good recall rate and precision rate, extraction good results.
出处 《山东师范大学学报(自然科学版)》 CAS 2015年第2期16-19,共4页 Journal of Shandong Normal University(Natural Science)
关键词 WEB信息抽取 HTML解析器 Jsoup 网络爬虫 Heritirx Web information extraction the HTML parser Jsoup Web Grawler Hertirx
  • 相关文献

参考文献12

  • 1于琨,蔡智,糜仲春,蔡庆生.B2C电子商务中的信息抽取技术[J].计算机科学,2002,29(12):106-108. 被引量:1
  • 2杨舟,卓林,赵朋朋,崔志明.一种针对商品数据记录的自动抽取方法[J].计算机工程,2010,36(23):262-265. 被引量:8
  • 3张敏,孙敏.基于Heritrix限定爬虫的设计与实现[J].计算机应用与软件,2013,30(4):33-35. 被引量:13
  • 4罗刚,王振东.自己动手写网络爬虫[M].北京:清华大学出版社,2012:39-64.
  • 5夏天.中心网页中主题网页链接的自动抽取[J].山东大学学报(理学版),2012,47(5):25-31. 被引量:4
  • 6Jiao Z, Yan X, Sun J, et al. Web Content Extraction Technology [ M ~//Computer Engineering and Networking. Springer International Publishing, 2014 : 365 - 373.
  • 7李萍,朱建波,周立新,廖彬.基于快速构建模板的购物信息抽取方法[J].计算机应用,2014,34(3):733-737. 被引量:3
  • 8Sun F, Song D, Liao L. DOM based content extraction via text density [ C ]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Beijing : ACM Press ,2011:245 - 254.
  • 9LUO Q Y, YANG Y S, SUN B F. Integrated decision -making of resident travel mode and route based on prospect theory[ C ]//Proceedings of the 2011 International Conference on Transportation,Mechanical,and Electrical Engineering. Washington,DC:IEEE Computer Society,2011:1822 - 1825.
  • 10Wang J, Lochovsky F H. Data- rich section extraction from HTML pages[ C]//proeeedings of the Third International Conference on Web Information Systems Engineering. Washington, DC :IEEE Computer Society ,2002:313 - 322.

二级参考文献54

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:92
  • 2林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:48
  • 3周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 4王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量:2
  • 5朱红灿,邹凯.基于机器学习的Web链接的抽取[J].情报理论与实践,2007,30(2):252-255. 被引量:2
  • 6刘兵.Web数据挖掘[M].北京:清华大学出版社,2009.
  • 7Liu Bing. Mining Data Records in Web Pages[C]//Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining. Washington D. C. , USA: [s. n. ], 2003:601-606.
  • 8Miao Gengxin, Tatemura J, Hsiung Wang+Pin, et al. Extracting Data Records from the Web Using Tag Path Clustering[C] //Proceedings of the 18th International Conference on the World Wide Web. Madrid: Spain, [s. n. ], 2009: 981-990.
  • 9Zhai Yanhong, Liu Bing. Web Data Extraction Based on Partial Tree Alignment [C]//Proceedings of the 14th International Conference on the World Wide Web. Chiba, Japan.. [s. n. ], 2005 : 76-85.
  • 10Wang Jingyi, Lochovsk F H. Data Extraction and Label Assignment for Web Databases[C]//Proceedings of the 12th International Conference on the World Wide Web. Budapest, Hungary: [s. n. ],2003.. 187-196.

共引文献24

同被引文献23

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部