基于主题相关概念和网页分块的主题爬虫研究被引量：9

Research on focused crawler based on topic-related concept and page segmentation

下载PDF

导出

摘要针对传统主题爬虫的不足,提出一种基于主题相关概念和网页分块的主题爬虫。先通过主题分类树获取主题相关概念集合,然后结合主题描述文档构建主题向量来描述主题;下载网页后引入网页分块来穿越"灰色隧道";采用文本内容和链接结构相结合的策略计算候选链接优先级,并在HITS算法的基础上提出了R-HITS算法计算链接结构对候选链接优先级的贡献。实验结果表明,利用该方法实现的主题爬虫查准率达66%、信息量总和达53%,在垂直搜索引擎和舆情分析应用方面有更好的搜索效果。 For the shortcomings of traditional focused crawler, this paper proposed a focused crawler based on topic-related concept and page segmentation. It set up topic vector by combining topic descriptive document with topic-related concept set which was generated by category tree to describe topic, and it introduced page segmentation after downloading a Web page to traverse grey tunneling. Then it took text content and link structure into consideration when computing the priority of candidate links. It also proposed a R-HITS algorithm based on the HITS algorithm to compute link structure＇ s contribution to priority of candidate links. The experimental result shows that, the precision of the focused crawler implemented by this method is 66% and sum of information is 53%. It has better effect on the applications of vertical search engine and public opinion analysis.

作者黄仁王良伟

机构地区重庆大学计算机学院

出处《计算机应用研究》 CSCD 北大核心 2013年第8期2377-2380,2409,共5页 Application Research of Computers

基金国家自然科学基金资助项目(71102065)

关键词主题爬虫主题相关概念网页分块优先级计算 R-HITS focused crawler topic-related concept page segmentation priority computation relevant hyperlink-induced topic search

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献12

1AGGARWAL C C, AL-GARAWI F, YU P S. Intelligent crawling on the world wide Web with arbitrary predicates [ C ]//Proc of the 10th International Conference on World Wide Web. New York: ACM Press,2001 : 96-105.
2CHAKRABARTI S, JOSHI M M, PUNERA K, et al. The structure of broad topics on the Web [ C ]//Proc of the 11 th International Confe- rence on World Wide Web. 2002:251-262.
3MENCZER F, PANT G, SRINIVASAN P. Topical Web crawlers: evaluating adaptive algorithms [ J]. ACM Trans on Intemet Tech- nology,2004,4(4) :378-419.
4DILIGENTI M, COETZEE F M, LAWRENCE S, et al. Focused crawling using context graphs [ C ]//Proc of the 26th International Conference on Very Large Databases. 2000:527-534.
5KOZANIDIS L. An ontology-based focused crawler [ C ]//Proc of the 13th International Conference on Applications of Natural Language to Information Systems. 2008:376-379.
6QU Cheng, WANG Bei-zhan, WEI Pian-pian. Efficient focused craw- ling strategy using combination of link structure and content similarity [ C ]//Proc of IEEE International Symposium on IT in Medicine and Education. Piscataway : IEEE Press,2008 : 1045-1048.
7MENCZER F, PANT G, SRINIVASAN P, et al. Evaluating topic- driven Web crawlers [ C ]//Proc of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval. New York: ACM Press,2001:241-249.
8Open directory project [ EB/OL]. [ 2011- 05- 18 ]. http://www. dmoz. org/.
9BAEZA-YATES R, POBLETE B. Evolution of the Chilean Web struc- ture composition[ C ]//Proc of the 1 st Latin American Web Congress. 2003 : 11 -13.
10蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4):942-944. 被引量：9

二级参考文献21

1赵佳鹤,王秀坤,刘亚欣.基于语义分析的主题信息采集系统的设计与实现[J].计算机应用,2007,27(2):406-408. 被引量：14
2SAGGARWAL C C, AL-GARAWI F, YU P S. Intelligent crawling on the world wide Web with arbitrary predicates[ C]// Proceedings of the 10th International Conference on World Wide Web. New York: ACM, 2001:96 - 105.
3DAVISON B D. Topical locality in the Web[ C]// Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2000: 272 - 279.
4MENCZER F, PANT G, SRINIVASAN P. Topical Web crawlers: evaluating adaptive algorithms [ J]. ACM Transactions on Intemet Technology, 2004, 4(4) : 378 -419.
5ZHENG HAI-TAO, KANG B Y, KIM H G. An ontology-based approach to learnable focused crawling [ J]. Information Sciences, 2008, 178(23) : 4512 -4522.
6SU CHANG, GAO YANG, YANG JIANMEI, et al. An efficient a- daptive focused crawler based on ontology learning[ C]// Proceed- ings of the 5th International Conference on Hybrid Intelligent Systems. Washingon. DC: IEEE. 2005:73-78.
7Wikipedia [ EB/OL]. [ 2011 - 02 - 16]. http://wikipedia, jaylee. cn/.
8STRUBE M, PONZE'I3"O S P. WikiRelate! computing semantic re- latedness using Wikipedia[ C]//Proceedings of the National Confer- ence on Artificial Intelligence. Cambridge: AAAI Press, 2006:1419 - 1424.
9中文维基百科资源[EB/OL].[2010-11-09].http://dumps.wikimedia.org/zhwiki/.
10HERSOVICI M, JACOVI M, PELLEG D, et al. The shark-search algorithm: an application: tailored Web site mapping[ C]// Pro- ceedings of the 7th World Wide Web Conference. Amsterdam: Elsevier Science, 1998:317 -326.

共引文献12

1韩国辉,陈黎,梁时木,唐小棚,王亚强,于中华.Nave Bayes分类器制导的专业网页爬取算法[J].中文信息学报,2010,24(4):32-38. 被引量：3
2张翔,周明全,李智杰,董丽丽.基于PageRank与Bagging的主题爬虫研究[J].计算机工程与设计,2010,31(14):3309-3312. 被引量：11
3谢大吉.基于Java的网络制造资源主题信息采集模块设计[J].计算机工程与设计,2010,31(19):4209-4212. 被引量：1
4陈志雄,朱向庆.基于内容评价与超链分析的主题爬虫策略[J].广西轻工业,2011,27(3):66-67. 被引量：2
5魏晶晶,杨定达,廖祥文.基于网页内容相似度改进算法的主题网络爬虫[J].计算机与现代化,2011(9):1-4. 被引量：6
6武昊,廖安平,何超英,侯东阳.基于主题相关度的地理信息Web服务爬虫研究[J].地理与地理信息科学,2012,28(2):27-30. 被引量：12
7王静,何婷婷,衣马木艾山.阿布都力克木.协同过滤在中文维基百科类别推荐上的应用[J].计算机应用,2013,33(3):838-840.
8马雷雷,梁汝鹏,连世伟,陈虎.一种主题本体驱动的语义搜索方法[J].地理空间信息,2013,11(4):46-48. 被引量：2
9张环,刘乃文,段会川.基于T-Graph算法的主题爬虫研究[J].计算机工程与设计,2014,35(9):3014-3017. 被引量：5
10赵永霄,哈力旦.阿布都热依木,张振东.面向增量同生主题的维吾尔文爬虫的研究[J].计算机应用研究,2014,31(11):3269-3272. 被引量：1

同被引文献53

1吴少华,崔鑫,胡勇.基于SNA的网络舆情演变分析方法[J].四川大学学报（工程科学版）,2015,47(1):138-142. 被引量：13
2郑健珍,林坤辉,周昌乐,康恺.基于本体语义的定题爬虫[J].山东大学学报（理学版）,2006,41(3):106-110. 被引量：11
3陈军,陈竹敏.基于网页分块的Shark-Search算法[J].山东大学学报（理学版）,2007,42(9):62-66. 被引量：7
4BAYKAN E,HENZINGER M R,MARIAN L,etal.PurelyURLbasedtopicclassification[C]//Procofthe18thInternationalWorldWideWebConference.NewYork:ACMPress,2009:1109-1110.
5PANTG,SRINIVASANP,MENCZERF.Explorationversusexploitationintopicdrivencrawlers[C]//Procofthe2ndInternationalWorkshoponWebDynamics.NewYork:ACMPress,2002:88-97.
6BIRDS,KLEINE,LOPERE.Naturallanguageprocessingwithpython[M].[S.l.]:O’ReillyMediaInc,2009.
7Boanjak M,Oliveira E,et al.TwitterEcho:a distributed focused crawler to support open research with twitter data[C]∥WWW’12 Companion Proceedings of the 21st International Conference Companion on World Wide Web.2012.
8Kazai G.In Search of Quality in Crowdsourcing for Search Engine Evaluation[J].Advances in information retrieval,Lecture Notes in Computer Science,2011,66(11):165-176.
9de Groc C.Babouk:Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction[J].Web Intelligence and Intelligent Agent Technology (WI-IAT),IEEE/WIC/ACM International Conference,2011,3(1),497-498.
10王上,于海,王钲旋.Deep Web垂直搜索引擎设计与实现[J].计算机研究与发展,2009,46:359-365.

引证文献9

1赵永霄,哈力旦.阿布都热依木,张振东.面向增量同生主题的维吾尔文爬虫的研究[J].计算机应用研究,2014,31(11):3269-3272. 被引量：1
2李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122. 被引量：12
3吴岳廷,李石君.基于扩展主题特征库的领域主题爬虫[J].计算机工程与设计,2015,36(5):1342-1347. 被引量：2
4张孝飞,刘伟光.基于语义分析的网页与特定主题相关性研究[J].情报科学,2016,34(4):52-54. 被引量：3
5周雪,刘乃文.引入主题链接块因子的候选链接搜索策略研究[J].计算机与数字工程,2018,46(5):874-878. 被引量：1
6龙科,李伟强,卢来.基于网页分块的科技信息采集系统的设计与实现[J].电脑迷,2017(3):179-180.
7刘灿,任剑宇,李伟,张强强.面向个性化推荐的教育新闻爬取及展示系统[J].软件工程,2018,21(2):38-40. 被引量：8
8周昆,王钊,于碧辉.基于语义相关度主题爬虫的语料采集方法[J].计算机系统应用,2019,28(5):190-195. 被引量：6
9耿增民,商书元,邵新艳,周毅灵,马玲.基于层次语义的Web服装图像智能采集方法[J].计算机科学,2016,43(S2):252-255. 被引量：4

二级引证文献35

1林红静,黄梦醒.基于微博信息的关键词库爬虫策略[J].海南大学学报（自然科学版）,2016,34(2):112-120. 被引量：3
2荆文鹏,王育坚,董伟伟.自适应遗传算法在主题爬虫搜索策略中的应用研究[J].计算机科学,2016,43(8):254-257. 被引量：6
3关卫国,骆永成.基于概念背景图的主题爬虫设计与实现[J].计算机工程与设计,2016,37(10):2679-2684. 被引量：4
4马雷雷,李宏伟,连世伟,梁汝鹏,陈虎.一种基于本体语义的灾害主题爬虫策略[J].计算机工程,2016,42(11):50-56. 被引量：4
5张莉婧,曾庆涛,李业丽,孙华艳,字云飞.面向图书主题的爬虫算法研究[J].计算机科学,2017,44(B11):460-463. 被引量：6
6张军,顾盼.保留老旧建筑门窗图像破损优化复原仿真[J].计算机仿真,2018,35(2):402-405.
7赵黎,杨连贺,黄新.基于多目标蜂群优化算法的计算机辅助配色[J].计算机集成制造系统,2018,24(2):381-389. 被引量：13
8曹纪清.基于查询时间属性的海量博客数据检索方法[J].内蒙古师范大学学报（自然科学汉文版）,2018,47(3):248-252.
9陆宇翔,黎炳燕,黄祖康,张捷.基于语义的智能Web挖掘技术研究[J].电脑知识与技术,2018,14(3Z):147-147.
10刘灿,任剑宇,李伟,张强强.面向个性化推荐的教育新闻爬取及展示系统[J].软件工程,2018,21(2):38-40. 被引量：8

1陈竹敏,马军,韩晓晖,雷景生.面向主题爬取的多粒度URLs优先级计算方法[J].中文信息学报,2009,23(3):31-38. 被引量：1
2徐永胜,王书文,李向群.基于D-S证据理论的图像修复算法[J].计算机工程,2010,36(19):222-223. 被引量：1
3朱香元,李仁发,杨胜,江文.基于优先级的数据广播内容选择算法[J].计算机工程与应用,2006,42(33):146-149.
4张明会,周勇,赵新政.动态多DAG调度的改进算法[J].中国科技论文,2015,10(14):1651-1655. 被引量：1
5谢毅,贺田塔,倪倩芸,吴汗青.面向能耗的云工作流调度优化[J].系统工程理论与实践,2017,37(4):1056-1071. 被引量：6
6魏海明,刘循,郑权,崔兰兰,孙青云.基于两次应用优先级的GDP时隙分配算法模型[J].计算机技术与发展,2013,23(12):37-42. 被引量：1
7张伟彬.基于修复顺序的图像修复算法[J].计算机工程与应用,2008,44(22):195-196. 被引量：4
8廖国琼,刘云生,肖迎元.实时内存数据库分区模糊检验点策略[J].计算机研究与发展,2006,43(7):1291-1296. 被引量：6
9池悦,何宁,张琪,赵珊珊.一种改进的基于Criminisi算法的目标移除方法[J].北京联合大学学报,2017,31(1):67-74. 被引量：5
10刘洋,桂小林,徐玉文.网格工作流中基于优先级的调度方法研究[J].西安交通大学学报,2006,40(4):411-414. 被引量：7

计算机应用研究

2013年第8期

浏览历史

内容加载中请稍等...

基于主题相关概念和网页分块的主题爬虫研究被引量：9

参考文献12

二级参考文献21

共引文献12

同被引文献53

引证文献9

二级引证文献35

相关作者

相关机构

相关主题

浏览历史

基于主题相关概念和网页分块的主题爬虫研究 被引量：9

参考文献12

二级参考文献21

共引文献12

同被引文献53

引证文献9

二级引证文献35

相关作者

相关机构

相关主题

浏览历史

基于主题相关概念和网页分块的主题爬虫研究被引量：9