基于预期剩余能量模型的聚焦爬行方法

Expected residual energy based focused crawling method

下载PDF

导出

摘要如何确定搜索的方向和深度是聚焦爬行的核心问题。为此,提出了链接的预期剩余能量概念及其计算方法。该方法利用当前页面的信息计算链接的立即回报能量,利用到达同一链接不同历史路径给予的历史回报知识不断迭代更新链接的预期剩余能量。利用预期剩余能量作为链接的优先级和搜索深度限制,设计了基于预期剩余能量模型的聚焦爬行算法,并给出了关键模块的实现。实验结果显示该方法具有更强的主题网站发现能力。 How to determine the search direction and depth are the key problem of focused crawling. This paper proposes an expected residual energy based URL priority computing method. This method uses the information of the current web page to calculate the immediately returning energy of hyperlinks, and then updates the expected residual energy using the historical returning knowledge of different historical paths in an iterative way. Using the expected residual energy as the priority and depth limit, this paper presents the system architecture of the expected residual energy based focused crawler,and gives out the detailed implementation of the key modules. Experiment result shows the focused crawler acquires better topic relevant websites finding ability.

作者尹文科宗士强王珩

机构地区中国电子科技集团公司第二十八研究所信息系统工程重点实验室

出处《计算机工程与应用》 CSCD 北大核心 2015年第24期120-125,158,共7页 Computer Engineering and Applications

关键词聚焦爬行搜索方向搜索深度主题相关度预期剩余能量 focused crawling search direction search depth topic relevance expected residual energy

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1Chakrabarti S,van Den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11):1623-1640.
2De Bra P M E,Post R D J.Searching for arbitrary information in the WWW:the fish-search for mosaic[C]//WWW Conference,1994.
3Hersovici M,Jacovi M,Maarek Y S,et al.The sharksearch algorithm.An application:tailored Web site mapping[J].Computer Networks and ISDN Systems,1998,30(1):317-326.
4Aggarwal C C,Al-garawi F,Yu P S.Intelligent crawling on the World Wide Web with arbitrary predicates[C]//Proceedings of the 10th International Conference on World Wide Web,2001:96-105.
5Ehrig M,Maedche A.Ontology-focused crawling of Web documents[C]//Proceedings of the 2003 ACM Symposium on Applied Computing,2003:1174-1178.
6叶育鑫,欧阳丹彤.基于语义的主题爬行策略[J].软件学报,2011,22(9):2075-2088. 被引量：12
7Diligenti M,Coetzee F,Lawrence S,et al.Focused crawling using context graphs[C]//VLDB,2000:527-534.
8Chakrabarti S,Punera K,Subramanyam M.Acceleratedfocused crawling through online relevance feedback[C]//Proceedings of the 11th International Conference on World Wide Web,2002:148-159.
9Hsu C C,Wu F.Topic-specific crawling on the web with the measurements of the relevancy context graph[J].Information Systems,2006,31(4):232-246.
10彭涛,孟宇,左万利,王英,胡亮.主题爬行中的隧道穿越技术[J].计算机研究与发展,2010,47(4):628-637. 被引量：11

二级参考文献15

1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量：3
2王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
3Donna B,Carl L,Alex S.Focused crawls,tunneling,and digital libraries[G]//LNCS 2458:Proc of the 6th European Conf on Research and Advanced Technology for Digital Libraries.Berlin:Springer,2002:91-106.
4Pant G,Srinivasan P,Menczer F.Exploration versus exploitation in topic driven crawlers[C]//Proc of WWW-02 Workshop on Web Dynamics.New York:ACM,2002.
5Peng Tao,Zhang Changli,Zuo Wanli.Tunneling enhanced by Web page content block partition for focused crawling[J].Concurrency and Computation:Practice and Experience,2008,20(1):61-74.
6Lin Shian-Hua,Ho Jan-Ming.Discovering informative content blocks from Web documents[C]//Proc of SIGKDD 2002.New York:ACM,2002:588-593.
7Wong W,Fu A W.Finding structure and characteristics of Web documents for classification[C]//Proc of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD).New York:ACM,2000.
8Embley D W,Jiang Y,Ng Y-K.Record-boundary discovery in Web documents[C]//Proc of the 1999 ACM SIGMOD Int Conf on Management of Data.New York:ACM,1999.
9Chakrabarti S.Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction[C]//Proc of the 10th Int World Wide Web Conf.New York:ACM,2001.
10Peng Tao,He Fengling,Zuo Wanli,et al.Adaptive topical Web crawling for domain-specific resource discovery guided by link-context[C]//Proc of MICAI 2006.Berlin:Springer,2006:963-973.

共引文献20

1张乃洲,李石君,余伟,张卓.使用联合链接相似度评估爬取Web资源[J].计算机学报,2010,33(12):2267-2280. 被引量：6
2王兰成,朱建华.网络信息采集中链接与主题相关性的判定研究[J].计算机应用与软件,2012,29(5):209-211. 被引量：1
3陈叶旺,王华珍,李海波,钟必能,陈锻生.基于百度百科与文本分类的网络文本语义主题抽取方法[J].小型微型计算机系统,2012,33(12):2605-2610. 被引量：9
4杜亚军.多Agent主题爬虫协作策略的研究与分析[J].西华大学学报（自然科学版）,2013,32(1):31-38. 被引量：2
5白玉昭,梁久祯.基于概率模型的主题爬虫的研究和实现[J].计算机工程与科学,2013,35(1):160-165. 被引量：7
6胡廉民,张泽斌,徐威迪,黄翰,李英.基于分层结构保留的增量网络爬虫算法[J].计算机应用研究,2013,30(8):2381-2385. 被引量：3
7史宝明,贺元香,吴崇正.主题搜索引擎中爬虫搜索策略的研究[J].计算机工程与应用,2014,50(2):116-119. 被引量：15
8陈臣,陈双飞.一种基于大数据的数字图书馆高效搜索引擎[J].现代情报,2014,34(1):49-51. 被引量：14
9张永,吴崇正.基于词频差异特征选取的Context Graph算法改进[J].计算机工程与应用,2014,50(10):141-146. 被引量：1
10张环,刘乃文,段会川.基于T-Graph算法的主题爬虫研究[J].计算机工程与设计,2014,35(9):3014-3017. 被引量：5

1谭骏珊,陈可钦.聚焦爬行中网页爬行算法的改进[J].电脑知识与技术,2008,0(12Z):2145-2146. 被引量：2
2丁治国,朱学永.基于先验知识的自适应多叉树防碰撞算法[J].计算机工程,2014,40(2):303-307. 被引量：4
3刘贤锴.在ZigBee网络中建立虚拟链状网[J].计算机应用,2016,36(6):1486-1491. 被引量：1
4傅向华,冯博琴,马兆丰,何明.可在线增量自学习的聚焦爬行方法[J].西安交通大学学报,2004,38(6):599-602. 被引量：18
5蔡明,倪贤贵.基于超链接和内容相关度的综合爬行策略[J].微计算机信息,2008,24(27):204-205.
6胡燕祝,任玉.信息资源共享与基于智能主体的系统模型[J].株洲工学院学报,2005,19(4):58-61. 被引量：1
7陈启才.论述蜘蛛爬行深度阀值存在[J].电脑知识与技术（过刊）,2010,0(18):5024-5027.
8方巍,胡鹏昱,赵朋朋,崔志明.基于语义的Deep Web数据源自动发现技术[J].微电子学与计算机,2007,24(9):60-63. 被引量：4
9荆超,侯秀萍,郝帅.一种基于历史路径的工作流实例迁移方法[J].电脑知识与技术,2009,5(11X):9597-9598.
10黄庆欢,郑嘉利,韦冬雪,邓林.基于维码数的RFID混合防碰撞算法[J].计算机科学,2014,41(B11):10-14. 被引量：2

计算机工程与应用

2015年第24期

浏览历史

内容加载中请稍等...

基于预期剩余能量模型的聚焦爬行方法

参考文献12

二级参考文献15

共引文献20

相关作者

相关机构

相关主题

浏览历史