期刊文献+

基于HTMLParser信息提取的网络爬虫设计 被引量:7

Design of Crawler Based on HTML Parser Information Extraction
下载PDF
导出
摘要 无论是通用搜索还是垂直搜索,其关键的核心技术之一就是网络爬虫的设计。本文结合HTMLParser信息提取方法,对生活类垂直搜索引擎中网络爬虫进行了详细研究。通过深入分析生活类网站网址的树形结构的构架,开发了收集种子页面URL的模拟搜索器,并基于HTMLParser的信息提取方法,从种子页面中提取出与生活类主题相关的目标URL。经实验测试证明该爬虫的爬准率达93.552%,爬全率达96.720%,表明该网络爬虫是有效的,达到中等规模的垂直搜索企业级应用的要求。 Whether general search engine or vertical search engine, the design of web crawler is the core technology. In this article, a novel system of life-theme web crawler based on HTMLParser information extraction is thoroughly studied. In this system, a simulation searcher is designed for collecting the seed URL by analyzing tree structure of life-theme website, then, based on the discussion of HTMLParser information extraction, the target URL that relate to life-theme is extracted from the seed pages. Empirical studies show that the Precision=93.552% and the Recall=96.720% , proving its effectiveness and achieving requirements for general enterprise-level application of vertical search engine.
作者 郑力明 易平
出处 《微计算机信息》 2009年第15期123-124,69,共3页 Control & Automation
关键词 网络爬虫 垂直搜索 HTMLPARSER web crawler vertical search engine HTMLParser
  • 相关文献

参考文献8

  • 1Kunpeng Zhu,Zhiming Xu,Xiaolong Wang, and Yuming Zhao.A Full Distribute Web Crawler Based on Structred Network_Lecture Notes in Computer Science.2008, 4993:478-483
  • 2Shoubin Dong,Xiaofeng Lu,Ling Zhang,and Kejing He. An Efficient Parallel Crawler in Grid Environment. Lecture Notes in Computer Science .2004, 3032:229-232
  • 3徐远超,刘江华,刘丽珍,关永.基于Web的网络爬虫的设计与实现[J].微计算机信息,2007,23(21):119-121. 被引量:36
  • 4Yun Huang,Yun Ming Ye. wHunter: A Focused Web Crawler - A Tool for Digital Library. Lecture Notes in Computer Science. 2004,3334:519-522
  • 5Lefleris Kozanidis.An Ontology-Based Focused Crawler.LNCS. 2008,5039:376-379
  • 6Yong Wang, Yiqun Liu, et al. A News Page Discovery Policy for Instant Crawlers. LNCS.2008,4993:520-525
  • 7http://htmlparser.sourc e forge.net
  • 8F Menczer, G Pant, M Ruiz et al. Evaluating topic-driven web erawlers[C].In: Proc ACM SIGIR 2001,2001

二级参考文献5

  • 1印鉴,陈忆群,张钢.搜索引擎技术研究与发展[J].计算机工程,2005,31(14):54-56. 被引量:53
  • 2陈刚,卢炎生.BBS搜索引擎设计与实现[J].微计算机信息,2006,22(06X):34-36. 被引量:4
  • 3Winter.中文搜索引擎技术揭密:网络蜘蛛[EB/OL].http://article.bwtech.net/artshow_33.htm.
  • 4Winter.中文搜索引擎技术揭密:中文分词[EB/OL].http://article.bwtech.net/artshow_30.htm.
  • 5Winter.中文搜索引擎技术揭密:排序技术[EB/OL].http://article.bwtech.net/artshow_31.htm.

共引文献35

同被引文献52

引证文献7

二级引证文献48

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部