期刊文献+

基于主题相关概念和网页分块的主题爬虫研究 被引量:9

Research on focused crawler based on topic-related concept and page segmentation
下载PDF
导出
摘要 针对传统主题爬虫的不足,提出一种基于主题相关概念和网页分块的主题爬虫。先通过主题分类树获取主题相关概念集合,然后结合主题描述文档构建主题向量来描述主题;下载网页后引入网页分块来穿越"灰色隧道";采用文本内容和链接结构相结合的策略计算候选链接优先级,并在HITS算法的基础上提出了R-HITS算法计算链接结构对候选链接优先级的贡献。实验结果表明,利用该方法实现的主题爬虫查准率达66%、信息量总和达53%,在垂直搜索引擎和舆情分析应用方面有更好的搜索效果。 For the shortcomings of traditional focused crawler, this paper proposed a focused crawler based on topic-related concept and page segmentation. It set up topic vector by combining topic descriptive document with topic-related concept set which was generated by category tree to describe topic, and it introduced page segmentation after downloading a Web page to traverse grey tunneling. Then it took text content and link structure into consideration when computing the priority of candidate links. It also proposed a R-HITS algorithm based on the HITS algorithm to compute link structure' s contribution to priority of candidate links. The experimental result shows that, the precision of the focused crawler implemented by this method is 66% and sum of information is 53%. It has better effect on the applications of vertical search engine and public opinion analysis.
作者 黄仁 王良伟
出处 《计算机应用研究》 CSCD 北大核心 2013年第8期2377-2380,2409,共5页 Application Research of Computers
基金 国家自然科学基金资助项目(71102065)
关键词 主题爬虫 主题相关概念 网页分块 优先级计算 R-HITS focused crawler topic-related concept page segmentation priority computation relevant hyperlink-induced topic search
  • 相关文献

参考文献12

  • 1AGGARWAL C C, AL-GARAWI F, YU P S. Intelligent crawling on the world wide Web with arbitrary predicates [ C ]//Proc of the 10th International Conference on World Wide Web. New York: ACM Press,2001 : 96-105.
  • 2CHAKRABARTI S, JOSHI M M, PUNERA K, et al. The structure of broad topics on the Web [ C ]//Proc of the 11 th International Confe- rence on World Wide Web. 2002:251-262.
  • 3MENCZER F, PANT G, SRINIVASAN P. Topical Web crawlers: evaluating adaptive algorithms [ J]. ACM Trans on Intemet Tech- nology,2004,4(4) :378-419.
  • 4DILIGENTI M, COETZEE F M, LAWRENCE S, et al. Focused crawling using context graphs [ C ]//Proc of the 26th International Conference on Very Large Databases. 2000:527-534.
  • 5KOZANIDIS L. An ontology-based focused crawler [ C ]//Proc of the 13th International Conference on Applications of Natural Language to Information Systems. 2008:376-379.
  • 6QU Cheng, WANG Bei-zhan, WEI Pian-pian. Efficient focused craw- ling strategy using combination of link structure and content similarity [ C ]//Proc of IEEE International Symposium on IT in Medicine and Education. Piscataway : IEEE Press,2008 : 1045-1048.
  • 7MENCZER F, PANT G, SRINIVASAN P, et al. Evaluating topic- driven Web crawlers [ C ]//Proc of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval. New York: ACM Press,2001:241-249.
  • 8Open directory project [ EB/OL]. [ 2011- 05- 18 ]. http://www. dmoz. org/.
  • 9BAEZA-YATES R, POBLETE B. Evolution of the Chilean Web struc- ture composition[ C ]//Proc of the 1 st Latin American Web Congress. 2003 : 11 -13.
  • 10蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4):942-944. 被引量:9

二级参考文献21

  • 1赵佳鹤,王秀坤,刘亚欣.基于语义分析的主题信息采集系统的设计与实现[J].计算机应用,2007,27(2):406-408. 被引量:14
  • 2SAGGARWAL C C, AL-GARAWI F, YU P S. Intelligent crawling on the world wide Web with arbitrary predicates[ C]// Proceedings of the 10th International Conference on World Wide Web. New York: ACM, 2001:96 - 105.
  • 3DAVISON B D. Topical locality in the Web[ C]// Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2000: 272 - 279.
  • 4MENCZER F, PANT G, SRINIVASAN P. Topical Web crawlers: evaluating adaptive algorithms [ J]. ACM Transactions on Intemet Technology, 2004, 4(4) : 378 -419.
  • 5ZHENG HAI-TAO, KANG B Y, KIM H G. An ontology-based approach to learnable focused crawling [ J]. Information Sciences, 2008, 178(23) : 4512 -4522.
  • 6SU CHANG, GAO YANG, YANG JIANMEI, et al. An efficient a- daptive focused crawler based on ontology learning[ C]// Proceed- ings of the 5th International Conference on Hybrid Intelligent Systems. Washingon. DC: IEEE. 2005:73-78.
  • 7Wikipedia [ EB/OL]. [ 2011 - 02 - 16]. http://wikipedia, jaylee. cn/.
  • 8STRUBE M, PONZE'I3"O S P. WikiRelate! computing semantic re- latedness using Wikipedia[ C]//Proceedings of the National Confer- ence on Artificial Intelligence. Cambridge: AAAI Press, 2006:1419 - 1424.
  • 9中文维基百科资源[EB/OL].[2010-11-09].http://dumps.wikimedia.org/zhwiki/.
  • 10HERSOVICI M, JACOVI M, PELLEG D, et al. The shark-search algorithm: an application: tailored Web site mapping[ C]// Pro- ceedings of the 7th World Wide Web Conference. Amsterdam: Elsevier Science, 1998:317 -326.

共引文献12

同被引文献53

  • 1吴少华,崔鑫,胡勇.基于SNA的网络舆情演变分析方法[J].四川大学学报(工程科学版),2015,47(1):138-142. 被引量:13
  • 2郑健珍,林坤辉,周昌乐,康恺.基于本体语义的定题爬虫[J].山东大学学报(理学版),2006,41(3):106-110. 被引量:11
  • 3陈军,陈竹敏.基于网页分块的Shark-Search算法[J].山东大学学报(理学版),2007,42(9):62-66. 被引量:7
  • 4BAYKAN E,HENZINGER M R,MARIAN L,etal.PurelyURLbasedtopicclassification[C]//Procofthe18thInternationalWorldWideWebConference.NewYork:ACMPress,2009:1109-1110.
  • 5PANTG,SRINIVASANP,MENCZERF.Explorationversusexploitationintopicdrivencrawlers[C]//Procofthe2ndInternationalWorkshoponWebDynamics.NewYork:ACMPress,2002:88-97.
  • 6BIRDS,KLEINE,LOPERE.Naturallanguageprocessingwithpython[M].[S.l.]:O’ReillyMediaInc,2009.
  • 7Boanjak M,Oliveira E,et al.TwitterEcho:a distributed focused crawler to support open research with twitter data[C]∥WWW’12 Companion Proceedings of the 21st International Conference Companion on World Wide Web.2012.
  • 8Kazai G.In Search of Quality in Crowdsourcing for Search Engine Evaluation[J].Advances in information retrieval,Lecture Notes in Computer Science,2011,66(11):165-176.
  • 9de Groc C.Babouk:Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction[J].Web Intelligence and Intelligent Agent Technology (WI-IAT),IEEE/WIC/ACM International Conference,2011,3(1),497-498.
  • 10王上,于海,王钲旋.Deep Web垂直搜索引擎设计与实现[J].计算机研究与发展,2009,46:359-365.

引证文献9

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部