期刊文献+

基于预期剩余能量模型的聚焦爬行方法

Expected residual energy based focused crawling method
下载PDF
导出
摘要 如何确定搜索的方向和深度是聚焦爬行的核心问题。为此,提出了链接的预期剩余能量概念及其计算方法。该方法利用当前页面的信息计算链接的立即回报能量,利用到达同一链接不同历史路径给予的历史回报知识不断迭代更新链接的预期剩余能量。利用预期剩余能量作为链接的优先级和搜索深度限制,设计了基于预期剩余能量模型的聚焦爬行算法,并给出了关键模块的实现。实验结果显示该方法具有更强的主题网站发现能力。 How to determine the search direction and depth are the key problem of focused crawling. This paper proposes an expected residual energy based URL priority computing method. This method uses the information of the current web page to calculate the immediately returning energy of hyperlinks, and then updates the expected residual energy using the historical returning knowledge of different historical paths in an iterative way. Using the expected residual energy as the priority and depth limit, this paper presents the system architecture of the expected residual energy based focused crawler,and gives out the detailed implementation of the key modules. Experiment result shows the focused crawler acquires better topic relevant websites finding ability.
出处 《计算机工程与应用》 CSCD 北大核心 2015年第24期120-125,158,共7页 Computer Engineering and Applications
关键词 聚焦爬行 搜索方向 搜索深度 主题相关度 预期剩余能量 focused crawling search direction search depth topic relevance expected residual energy
  • 相关文献

参考文献12

  • 1Chakrabarti S,van Den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11):1623-1640.
  • 2De Bra P M E,Post R D J.Searching for arbitrary information in the WWW:the fish-search for mosaic[C]//WWW Conference,1994.
  • 3Hersovici M,Jacovi M,Maarek Y S,et al.The sharksearch algorithm.An application:tailored Web site mapping[J].Computer Networks and ISDN Systems,1998,30(1):317-326.
  • 4Aggarwal C C,Al-garawi F,Yu P S.Intelligent crawling on the World Wide Web with arbitrary predicates[C]//Proceedings of the 10th International Conference on World Wide Web,2001:96-105.
  • 5Ehrig M,Maedche A.Ontology-focused crawling of Web documents[C]//Proceedings of the 2003 ACM Symposium on Applied Computing,2003:1174-1178.
  • 6叶育鑫,欧阳丹彤.基于语义的主题爬行策略[J].软件学报,2011,22(9):2075-2088. 被引量:12
  • 7Diligenti M,Coetzee F,Lawrence S,et al.Focused crawling using context graphs[C]//VLDB,2000:527-534.
  • 8Chakrabarti S,Punera K,Subramanyam M.Acceleratedfocused crawling through online relevance feedback[C]//Proceedings of the 11th International Conference on World Wide Web,2002:148-159.
  • 9Hsu C C,Wu F.Topic-specific crawling on the web with the measurements of the relevancy context graph[J].Information Systems,2006,31(4):232-246.
  • 10彭涛,孟宇,左万利,王英,胡亮.主题爬行中的隧道穿越技术[J].计算机研究与发展,2010,47(4):628-637. 被引量:11

二级参考文献15

  • 1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量:3
  • 2王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 3Donna B,Carl L,Alex S.Focused crawls,tunneling,and digital libraries[G]//LNCS 2458:Proc of the 6th European Conf on Research and Advanced Technology for Digital Libraries.Berlin:Springer,2002:91-106.
  • 4Pant G,Srinivasan P,Menczer F.Exploration versus exploitation in topic driven crawlers[C]//Proc of WWW-02 Workshop on Web Dynamics.New York:ACM,2002.
  • 5Peng Tao,Zhang Changli,Zuo Wanli.Tunneling enhanced by Web page content block partition for focused crawling[J].Concurrency and Computation:Practice and Experience,2008,20(1):61-74.
  • 6Lin Shian-Hua,Ho Jan-Ming.Discovering informative content blocks from Web documents[C]//Proc of SIGKDD 2002.New York:ACM,2002:588-593.
  • 7Wong W,Fu A W.Finding structure and characteristics of Web documents for classification[C]//Proc of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD).New York:ACM,2000.
  • 8Embley D W,Jiang Y,Ng Y-K.Record-boundary discovery in Web documents[C]//Proc of the 1999 ACM SIGMOD Int Conf on Management of Data.New York:ACM,1999.
  • 9Chakrabarti S.Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction[C]//Proc of the 10th Int World Wide Web Conf.New York:ACM,2001.
  • 10Peng Tao,He Fengling,Zuo Wanli,et al.Adaptive topical Web crawling for domain-specific resource discovery guided by link-context[C]//Proc of MICAI 2006.Berlin:Springer,2006:963-973.

共引文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部