期刊文献+

基于ID3分类算法的深度网络爬虫设计 被引量:4

Design of Web Crawler for Deep Web Based on ID3 Algorithm
下载PDF
导出
摘要 针对目前Web信息挖掘中存在的信息覆盖率较低的问题,对网络爬虫系统进行研究,提出一种针对深度网络的、基于ID3分类算法的Web页面收集方法。对Web页面的特征进行分析、处理和分类,提取包含深度网页的表单,通过自动提交这些表单来进行更深和更广的页面获取,实验表明该方法可以有效减少现有搜索引擎的盲区,改善搜索结果。 Considering the problem of poor information coverage in Web data mining, this paper proposes a configurable Web crawling method for deep Web which can improve the results performance of a general search engine significantly. It classifies Web pages and manipulates key information of page content in order to make sensible queries. The experiment results also show it.
出处 《现代图书情报技术》 CSSCI 北大核心 2008年第6期41-45,共5页 New Technology of Library and Information Service
关键词 网络爬虫 深度网络 ID3算法 Web crawler Deep Web ID3 algorithm
  • 相关文献

参考文献6

  • 1Cohen L, The Deep Web[EB/OL], [2008-01 - 18]. http:// www. internettutorials. net/deepweb. html.
  • 2中国互联网络信息中心(CNNIC).中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/download/2004/2004072002.pdf[EB/OL].中国国家信息产业部,2004.7.
  • 3娄卓男,吴志强.近几年国外隐蔽网络研究概述[J].图书情报工作,2004,48(1):102-104. 被引量:8
  • 4UC Berkeley - Teaching Library Internet Workshops, Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity[ EB/OL]. [ 2008 - 01 - 18 ]. http://www. lib. berkeley. edu/TeachingLib/Guides/Internet/InvisibleWeb. html.
  • 5曲开社,成文丽,王俊红.ID3算法的一种改进算法[J].计算机工程与应用,2003,39(25):104-107. 被引量:79
  • 6马瑜,王有刚.ID3算法应用研究[J].信息技术,2006,30(12):84-86. 被引量:10

二级参考文献29

  • 1李国伟,周颜,李钜.ID3算法在硕士研究生报名中的应用[J].中原工学院学报,2005,16(3):37-39. 被引量:2
  • 2R·格罗恩.数据挖掘一构筑企业竞争优势[M].西安:西安交通大学出版社,2001..
  • 3Quinlan J R.Induction of decision tree[J].Machine Learning, 1986; (1): 81-106.
  • 4Quinlan J R.C4.5:Programs for Machine Learnint[M].Morgan Kaufmann,1992.
  • 5Quinlan J R.Discovering rules from large collections of examples:Acase study[C].In:Michie D,eds.Expert Systems in the Micro Electronic Age,Edinburgh University Press,1979.
  • 6Quinlan J R.Learning efficient classification procedures and their application to chess endgames[C].In:R S Michalski,J G Carbonell,T M Mitchell eds.Machine Learning:An Artificial Intelligence Approach, Tioga, 1983.
  • 7Quinlan J R.The effect of noise on concept learninl[C].In:R S Michalske,JG Carbonell,T M Mitchell eds.Machine Learning:An Artificial Intelligence Approach,Morgan Kaufmann,1986.
  • 8Quinlan J R.Simplifying Decision Trees[J].Intemet Journal of Man-Machine Studies, 1987;27:221-234.
  • 9Quinlan J R.Generating production rules from decision trees[C].In: Proceedings of IJCAI-87,Milan,Italy,1987.
  • 10Drucker H,Cortes C.Boosting decision tree[M].Neural Information Processing,MorganKaufmann,MIT Press, 1996.

共引文献98

同被引文献32

  • 1刘洁清,吴京慧.面向主题的个人实时搜索引擎的设计与实现[J].现代图书情报技术,2006(5):40-43. 被引量:6
  • 2李超锋,卢炎生.基于URL结构和访问时间的Web页面访问相似性度量[J].计算机科学,2007,34(4):207-209. 被引量:4
  • 3赵燕,陈晓云,莫明辉,汤勇.基于用户群的智能主题爬虫[J].广西师范大学学报(自然科学版),2007,25(2):230-233. 被引量:3
  • 4李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82. 被引量:11
  • 5Manuel alvarez, Alberto Pan, Juan Raposo,et al. Crawling the Client - side Hidden Web [ EB/OL ]. [ 2008 - 06 - 25 ]. http ://www. tic. udc. es/- mad/publications/icwi2004, pdf.
  • 6SourceForge Org. HtmlUnit [ EB/OL ]. [ 2008 - 07 - 20]. http :// htmlunit, sourceforge, net/.
  • 7Suman Tedla B E. Analyzing bias and quality of search engines using HIT. The University of Houston-Clear Lake, 2006 : 2--3.
  • 8Hemovici M, Jacovi M, Maarek Y S, et al. The Shark-Search Algorithm: An Application:Tailored Web Site Mapping[ C ]//Proceedings of the7th international World Wide Web 7 conference. Brisbane, Australia, 1998.
  • 9Joson Rennie, Andrew Kachites McCallum. Using reinforcement learning to spider the web efficiently[ C ]//Proceedings of the 16th International Conference on Machine Learning( ICML - 99 ). Bled, Slovenia, 1999:335 - 343.
  • 10Diligenti M, Coetzee F, Lawrence S, et al. Focused crawling using context graphs. Proceedings of the 26th International Conference on Very Large Database ( VLDB2000), 2000:527 - 534.

引证文献4

二级引证文献29

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部