一种基于改进BFS算法的主题搜索技术研究被引量：1

An Improved Best-First Search Algorithm Based Focused Crawling Research

导出

摘要通过对Web主题爬行器在预测链接优先级时所用到的特征因子的细化和重新分类,引入收割率和媒体类型两个新特征作为相关性判断依据,提出一种改进的最好优先搜索算法。该算法采用"细粒度"策略过滤不相关网页,选取多个角度有代表性的特征因子构造链接优先级计算公式,以达到全面揭示和预测链接主题的目的。通过与其他三类主题搜索算法的小规模实验比较,证明改进算法在收割率和平均提交链接数上效果较好。 This paper introduces two new features harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best - First Search algorithm. The algorithm uses ＂fine - grained＂ policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small - scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.

作者乔建忠

机构地区解放军艺术学院信息管理中心

出处《现代图书情报技术》 CSSCI 北大核心 2013年第7期28-35,共8页 New Technology of Library and Information Service

关键词主题搜索搜索算法最好优先搜索算法主题爬行器特征因子 Focused crawling Search algorithm Best - First Search algorithm Focused crawler Characteristic factor

分类号 G254 [文化科学—图书馆学]

引文网络
相关文献

参考文献18

1Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic - specific Web Resource Discovery [ J ]. Computer Networks, 1999, 31 ( 11 - 16) : 1623 - 1640.
2Russell S, Norvig P. Artificial Intelligence: A Modem Approach [ M]. The 2nd Edition. Upper Saddle River, New Jersey: Pren- tice Hall, 2003 : 94 - 95.
3Chakrabarti S. Mining the Web: Discovering Knowledge from Hy- pertext Data [ M ]. San Francisco: Morgan -Kaufmann Publishers, 2002:270 - 279.
4Haveliwala T H. Topic - Sensitive PageRank : A Context - Sensi- tive Ranking Algorithm for Web Search[ J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15 (4) :784 - 796.
5Bharat K, Henzinger M R. Improved Algorithms for Topic Distil- lation in a Hyperlinked Environment [ C ]. In : Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA : ACM, 1998:104 - 111.
6Pandey S, Olston C. Crawl Ordering by Search Impact [ C ]. In : Proceedings of the International Conference on Web Search and Web Data Mining(WSDM'08). New York, NY, USA: ACM, 2008: 3 - 14.
7夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J].现代图书情报技术,2007(5):41-44. 被引量：8
8Brin S, Page L. The Anatomy of a Large - Scale Hypertextual Web Search Engine[ J]. Computer Networks and ISDN Systems, 1998, 30(1 -7) : 107 -117.
9Kleinberg J M. Authoritative Sources in a Hyperlinked Environment [J]. Journal of the ACM,1999 ,46( 5 ) :604 -632.
10Shchekotykhin K, Jannach D, Friedrich G. xCrawl : A High - re- call Crawling Method for Web Mining[ C ]. In : Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550 - 559.

二级参考文献45

1周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量：155
2赵焕洲,唐爱民.对两种知识组织系统——叙词表与Ontology的比较研究[J].情报理论与实践,2005,28(5):469-471. 被引量：12
3McCallum A, Nigam K, Rennie J, et al. Building domain-specific search engine with machine learning techniques [A]. AAAI Spring Symposium on Intelligent Agents in Cyberspace, Stanford University,USA,1999.
4Chakrabarti S M, van den Berg H, Dom B. Focused crawling: a new approach to topic-specific Web resource discovery [J]. Computer Networks,1999,31(11-16):1 623-1 640.
5Diligenti M, Coetzee F M, Lawrence S, et al. Focused crawling using context graphs [A]. 26th International Conference on Very Large Database, Cairo,Egypt, 2000.
6Chakrabarti S, Kunal P, Mellela S. Accelerated focused crawling through online relevance feedback [A]. The Eleventh International Conference on World Wide Web, Hawaii,USA,2002.
7Nigam K. Using unlabeled data to improve text classification [D]. Pittsburgh, USA: School of Computer Science, Carnegie Mellon University, 2001.
8Jing Peng, Williams R. Incremental multi-step Q-learning [J]. Machine Learning,1996,22(1-3):283-290.
9Wiering M, Schmidhuber J. Fast online Q(λ)[J]. Machine Learning,1998,33(1):105-115.
10Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic - Specific Web Resource Discovery [ J ]. Com- puter Networks : The International Journal of Computer and Telecom- munications Networking, 1999, 31 ( I 1 - 16 ) : 1623 - 1640.

共引文献29

1傅向华,冯博琴.一种支持复杂查询的有组织P2P搜索方法[J].小型微型计算机系统,2006,27(3):401-406. 被引量：3
2傅向华,冯博琴.主题驱动的P2P分布式信息搜索机制研究[J].小型微型计算机系统,2006,27(4):609-613. 被引量：10
3傅向华,明仲.基于P2P的个性化Web搜索系统的设计与实现[J].计算机工程与应用,2007,43(7):111-113. 被引量：2
4刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量：132
5魏文国,谢桂园.自适应最优搜索算法的网络蜘蛛的设计与实现[J].计算机应用,2007,27(11):2857-2859. 被引量：1
6关慧芬,师军,马继红.网络爬行技术研究[J].郑州轻工业学院学报（自然科学版）,2008,23(6):69-73. 被引量：4
7吉莉莉,陈悦.HTML结构特征及概念学习聚焦网页采集系统设计[J].中国新技术新产品,2009(20):21-21.
8关慧芬,师军.基于本体的主题爬虫技术研究[J].计算机仿真,2009,26(10):123-126. 被引量：2
9卜书庆,刘华梅,王广平.近年来国内知识组织研究热点综述[J].中国索引,2010,8(1):2-12. 被引量：9
10谢志妮.一种新的基于概念树的主题网络爬虫方法[J].计算机与现代化,2010(4):103-106. 被引量：2

同被引文献16

1刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量：132
2Li W W,Yang C W,Yang C J.An active crawler for discovering geospatial Web services and their distribution pattern - A case study of OGC Web Map Service [J].International Journal of Geographical Information Science,2010,24(8):1127-1147.
3Heydon A,Najork M.Mercator:A scalable,extensible Web Crawler [J].World Wide Web,1999,2(4):219-229.
4Pal A,Tomar D S,Shrivastava S C.Effective focused crawling based on content and link structure analysis [J].International Journal of Computer Science and Information Security,2009,2(1).
5Salton G,Buckley C.Term weighting approaches in automatic text retrieval[R].Ithaca:Cornell University,1987.
6Chakrabarti S,van den Berg M,Dom B.Focused crawling:A new approach to topic-specific Web resource discovery [J].Computer Networks,1999,31(11-16):1623-1640.
7李勇,韩亮.主题搜索引擎中网络爬虫的搜索策略研究[J].计算机工程与科学,2008,30(3):4-6. 被引量：37
8蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4):942-944. 被引量：9
9戚欣.基于本体的主题网络爬虫设计[J].武汉理工大学学报,2009,31(3):138-141. 被引量：14
10李卫疆,赵铁军,朴星海.一种新的面向主题的爬行算法[J].计算机应用研究,2009,26(5):1663-1666. 被引量：5

引证文献1

1沈平,桂志鹏,游兰,胡凯,吴华意.一种主动发现网络地理信息服务的主题爬虫[J].地球信息科学学报,2015,17(2):185-190. 被引量：4

二级引证文献4

1钱建国,马占武.经验爬虫应用于网络地图服务获取[J].测绘与空间地理信息,2017,40(3):14-16. 被引量：1
2乔莲花,徐明镠.基于自适应小批量梯度学习的网络地理信息服务预测系统研究[J].信息化研究,2019,45(3):46-51. 被引量：1
3侯东阳,武昊,陈军.时空数据Web搜索的研究进展[J].地理信息世界,2020,27(4):1-12. 被引量：3
4吴华意,靳凤营,梁健源,张显源,邢华桥,桂志鹏,李锐,向隆刚.地理信息服务网络与协同研究进展[J].测绘学报,2022,51(6):1050-1061. 被引量：8

1乔建忠.面向主题搜索的特征因子研究综述[J].图书情报工作,2012,56(17):143-147.
2宋宇.从主题爬虫角度看数字资源建设[J].中国索引,2010,8(1):47-51. 被引量：3
3吴羽萍,杨仁广.网络多媒体主题搜索算法比较研究[J].图书情报工作,2013,57(7):112-115. 被引量：1
4乔建忠.基于锚与链接文本扩展的KBES算法隧道策略[J].现代图书情报技术,2011(3):45-50. 被引量：1
5陈定权.博士研究生论文文摘[J].现代图书情报技术,2004(3):94-94.
6陈一.从高歌猛进到理性反思--近年国内媒介素养研究述评[J].兰州学刊,2008(8):169-172. 被引量：17
7乔建忠.一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J].现代图书情报技术,2012(6):9-16. 被引量：3
8吴鹏.媒体的重新分类与未来[J].中国广告,2011(5):130-131. 被引量：1
9高永兴,叶红.科学研究的分类与资源配置[J].科技进步与对策,2003,20(1):127-128. 被引量：1
10杨秀君.《复印报刊资料》的重新分类和著录[J].情报资料工作,1984,5(5):13-17.

现代图书情报技术

2013年第7期

浏览历史

内容加载中请稍等...

一种基于改进BFS算法的主题搜索技术研究被引量：1

参考文献18

二级参考文献45

共引文献29

同被引文献16

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

一种基于改进BFS算法的主题搜索技术研究 被引量：1

参考文献18

二级参考文献45

共引文献29

同被引文献16

引证文献1

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

一种基于改进BFS算法的主题搜索技术研究被引量：1