期刊文献+

基于遗传算法的主题爬虫策略改进 被引量:4

Improvement of Focused Crawling Strategy Based on Genetic Algorithm
下载PDF
导出
摘要 针对主题爬虫存在"主题漂移"的问题,为了快速抓取网页,提出了一种基于遗传算法的主题爬行改进策略。在现有遗传算法爬行策略的基础上新引入了PageRank算法,调整了以往网页主题相关度计算方式,以计算得到的网页PageRank和相关度值为依据来选择爬行中的遗传因子,并重新设置了适应度函数,在保证优越遗传因子(与主题相关且重要网页)被优先遗传的同时,减少了遗传因子在传递过程中的"主题漂移",从而使爬行到网页的重要性和主题相关性均有所提高。与以往基于遗传算法的策略相比较,在不影响查全率的情况下,与主题相关且重要的网页数可提升5%以上。 Aiming at the subject drifting problem of topic crawling,this paper presents an improved strategy.Based on Genetic Algorithm,this strategy absorbs the idea of the PageRank algorithm and correlation of page,re-sets the fitness function and adjusts the size of correlation parameters of page by it.In this way,the superior gene is selected first and the subject drifting is reduced while delivering.Compared with previous strategies based on genetic algorithm,without prejudice to recall the circumstances,the number of pages relevant to the subject can raise more than 5%.
出处 《计算机仿真》 CSCD 北大核心 2010年第10期87-90,123,共5页 Computer Simulation
基金 国家自然基金项目(60872115) 上海市教委重点学科建设项目(J50104)
关键词 主题爬虫 排序算法 遗传算法 网页信息 Focused crawler Pagerank algorithm Genetic algorithm Web information
  • 相关文献

参考文献8

二级参考文献64

共引文献145

同被引文献41

  • 1胡华梁,何进,钟元生.图书垂直搜索引擎的设计[J].计算机与现代化,2007(8):96-99. 被引量:3
  • 2Chau M,Chen H.A machine learning approach to Web page filtering using content and structure analysis[J].Decision Support Systems,2008,44(2):482-494.
  • 3Zhang H X,Lu J.SCTWC:An online semi-supervised clustering approach to topical web crawlers[J].Applied Soft Computing,2010,10 (2):490-495.
  • 4Donderler M E,Saykol E,Arslan U,et al.BilVideo:Design and implementation of a video database management system[J].Multimedia Tools and Applications,2005,27(1):79-104.
  • 5Liu King-Lup,Yu Clement,Meng Weiyi,et al.A statistical method for estimating the usefulness of text databases[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(6):1422-1437.
  • 6Wang Da-quan,Wang Tian,Zhang Lin,et al.Deep into Web general vs vertical search engine design based on secure and QoS[C]//Cross Strait Quad-Regional Radio Science and Wireless Technology Conference (CSQRWC),2011.2011,1:847-851.
  • 7Jia Yubo,Fan Hongdan,Xia Guanghu,et al.Design of an application model based on vertical search engine[C]//Proceedings of the 2nd International Conference on Networking and Distributed Computing (ICNDC).2011:57-60.
  • 8Yan Lei,Wang Ting,Shang Yang.A research on theme correlation of vertical search engine based on ontology[C]//Proceedings of the 2010 International Conference on Information Networking and Automation (ICINA).2010,1:210-214.
  • 9Shao Lei,Li Jianwei,Gou Xuerong.Research and design of a vertical search engine for educational resources[C]//Proceedings of the 2011 International Conference on Advanced Intelligence and Awareness Internet.2011:159-163.
  • 10Gil-Costa V,Inostrosa-Psijas A,Marin M,et al.Service deployment algorithms for vertical search engines[C]//Proceedings of the 21st Euromicro International Conference on Parallel,Distributed,and Network-based Processing.2013:140-147.

引证文献4

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部