基于链接路径预测的聚焦Web实体搜索被引量：1

Focused Web Entity Search Using the Linked-Path Prediction Model

下载PDF

导出

摘要实体搜索是一个有前景的研究领域,因为它能够为用户提供更为详细的Web信息.快速、完全地收集特定领域实体所在的网页是实体搜索中的一个关键问题.为了解决这个问题,将Web网站建模为一组互连的状态构成的图,提出一种链接路径预测学习算法LPC,该模型能够学习大型网站中从主页通向目标网页的最优路径,从而指导爬虫快速定位到含有Web实体的目标网页.LPC算法分为两个阶段:首先,使用概率无向图模型CRF,学习从网站主页通往目标网页的链接路径模型,CRF模型能够融合超连接和网页中的各种特征,包括状态特征和转移特征;其次,结合增强学习技术和训练的CRF模型对爬行前端队列的超链接进行优先级评分.一种来自增强学习的折扣回报方法通过利用路径分类阶段学习的CRF模型来计算连接的回报值。在多个领域大量真实数据上的实验结果表明,所提出的适用CRF模型指导的链接路径预测爬行算法LPC的性能明显优于其他聚焦爬行算法. Entity search is a promising research topic because it will provide Web information in detail to the users. A key problem of entity search is collecting Web pages quickly and completely for the relevant entities on a specific domain. To deal with this issue, a website is modeled as a graph on a set of connected important states. Then a novel algorithm named LPC is proposed to learn the optimal link sequences leading to the goal pages which entities are embedded in. The LPC algorithm uses a two-stage strategy. In the first stage, it uses an undirected graphical learning model CRF to capture sequential link patterns leading to goal pages. The conditional exponential models of CRF are able to exploit a variety of features including state and transition features extracted around hyperlinks and HTML pages. In the second stage, the links in the crawling frontier queue are prioritized based on reinforcement learning and the trained CRF model. A discount reward approach from reinforcement learning is employed to compute the reward score using the CRF model learnt during path classification phase. The experimental results on massive real data show that the optimal prediction ability of CRF helps LPC outperforms other focused crawlers.

作者黄健斌孙鹤立

机构地区西安电子科技大学国家示范性软件学院西安交通大学计算机科学与技术系

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第12期2059-2066,共8页 Journal of Computer Research and Development

基金陕西省自然科学基础研究计划基金项目(SJ08-ZT14)

关键词实体搜索聚焦爬行链接路径预测条件随机场增强学习 entity search focused Web crawling linked-path prediction conditional random field reinforcement learning

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1Nie Z,Ma Y,Shi S,et al.Web object retrieval[C]//Proc of the 16th ACM Int Conf on World Wide Web.New York:ACM,2007:81-90.
2Chakrabarti S,Vandenberg M H,Dom B E.Focused crawling:A new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11-16):1623-1640.
3Cho J,Hector G-M,Page L.Efficient crawling through URL ordering[J].Computer Networks and ISDN Systems,1998,30(1-7):161-172.
4Najork M,Wiener I N.Breadth-first search crawling yields high-quality pages[C]//Proc of the 10th ACM Int Conf on World Wide Web.New York:ACM,2001:114-118.
5Menczer F,Pant G,Ruiz M E.Evaluating topic-driven Web crawlers[C]//Proc of the 24th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2001:241-249.
6Ester M,Kriegel H -P,Schubert M.Accurate and efficient crawling for relevant websites[C]//Proc of the 30th Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment Press,2004:396-407.
7王辉,左万利,王晖昱,宁爱军,孙志伟,满春雷.基于质心向量的增量式主题爬行[J].计算机研究与发展,2009,46(2):217-224. 被引量：4
8Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proc of the 18th Int Conf on Machine Learning.San Francisco:Morgan Kaufmann,2001:282-289.
9黄健斌,姬红兵,孙鹤立.基于混合跳链条件随机场的异构Web记录集成方法[J].软件学报,2008,19(8):2149-2158. 被引量：8
10Rennie J,McCallum A.Using reinforcement learning to spider the Web efficiently[C]//Proc of the 16th Int Conf on Machine Learning.San Francisco:Morgan Kaufmann,1999:335-343.

二级参考文献26

1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量：3
2周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量：112
3Davison B D. Topical locality in the Web [C] //Proc of SIGIR. New York: ACM, 2000:272-279
4Hofmann T. Probabilistic latent semantic analysis[C]//Proc of the 15th Conf on Uncertainty in Artificial Intelligence. New York: ACM, 1999:289-296
5Hofmann T. Probabilistic latent semantic indexing [C] // Proc of SIGIR. New York: ACM, 1999:103-110
6Barbosa L, Freire J. An adaptive crawler for locating hidden- Web entry points [C]//Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:441-450
7Barbosa L, Freire J. Combining cl.assifiers to identify online databases [C] //Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:431-439
8Barbosa L, Freire J. Siphoning hidden-Web data through keyword-based interfaces [C] //Proc of SBBD. Brazil: UnB, 2004:309-321
9Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources [C]//Proc of WISE. Los Alamitos, CA: IEEE Computer Society, 2003:125-133
10Han E, Karypis G. Centroid-based document classification: Analysis & experimental results [C]//Proc of European Conf on Principles of Data Mining and Knowledge Discovery (PKDD). Berlin: Springer, 2000:424-431

共引文献10

1丁艳辉,李庆忠,董永权,彭朝晖.基于集成学习和二维关联边条件随机场的Web数据语义标注方法[J].计算机学报,2010,33(2):267-278. 被引量：6
2丁艳辉,李庆忠,董永权,彭朝晖.2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects[J].Journal of Computer Science & Technology,2010,25(4):761-770.
3张奇,金培权,岳丽华.基于CRF的网页动态关系抽取研究[J].中国科学技术大学学报,2010,40(11):1197-1202. 被引量：2
4张乃洲,李石君,余伟,张卓.使用联合链接相似度评估爬取Web资源[J].计算机学报,2010,33(12):2267-2280. 被引量：6
5卓林,杨舟,赵朋朋,崔志明.基于二维混合条件随机场的Web记录抽取模型[J].计算机工程,2011,37(5):59-61.
6张春元.基于CRFs的新闻网页主题内容自动抽取方法[J].广西师范大学学报（自然科学版）,2011,29(1):138-142. 被引量：1
7张春元.基于条件随机场的文本分类模型[J].计算机技术与发展,2011,21(7):77-80. 被引量：5
8张传岩,洪晓光,彭朝晖,李庆忠.基于SVM和扩展条件随机场的Web实体活动抽取[J].软件学报,2012,23(10):2612-2627. 被引量：15
9赵永霄,哈力旦.阿布都热依木,张振东.面向增量同生主题的维吾尔文爬虫的研究[J].计算机应用研究,2014,31(11):3269-3272. 被引量：1
10田雪筠.网络竞争情报主题采集技术研究[J].图书与情报,2014(5):132-137. 被引量：5

同被引文献25

1Albert R, Barabasi A. Statistical mechanics of complex networks[J]. Reviews of Modern Physics, 2002, 74(1): 47-97.
2Albert R, Jeong H, Barabasi A. Emergence of scaling in randomnetworks[J]. Science, 1999, 286(5439): 509-512.
3Lancichinetti A, Kivela M, Saramgki J. Characterizing the community structure of complex networks [J]. PLoS ONE, 2010, 5(8): c11976.
4Newman M. The structure and function of complex networks [J]. SIAM Review, 2003, 45(2): 167-256.
5Huang J, Sun H, Liu Y, et al. Towards online multiresolution community detection in large-scale networks [J]. PLoS ONE, 2010, 6(8): e23829.
6Girvan M, Newman M. Community structure in social and biological networks [J]. Proceedings of the National Academy of Science, 2002, 99(12) : 7821-7826.
7Newman M. Detecting community structure in networks [J]. The European Physical Journal B-Condensed Matter and Complex Systems, 2004, 38(2): 321-330.
8Ng R, Han J. CLARANS: A method for clustering objects for spatial data mining [J]. IEEE Trans on Knowledge and Data Engineering, 2002, 14(5): 1003-1016.
9Hartigan M, Wong A. K-means clustering algorithm [J]. Journalof the Royal Statistical Society, 1979, 28(1): 100 -108.
10Ester M, Kriegel H, Sander J, et al. A density based algorithm for discovering clusters in large spatial databases with noise [C] //Proc of the 2nd Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 1996: 226- 231.

引证文献1

1黄健斌,白杨,康剑梅,钟翔,张鑫,孙鹤立.一种基于同步动力学模型的网络社团发现方法[J].计算机研究与发展,2012,49(10):2198-2207. 被引量：3

二级引证文献3

1亓慧.基于核心图增量聚类的复杂网络划分算法改进[J].山西大学学报（自然科学版）,2015,38(2):270-275. 被引量：1
2翟菊叶,张浩.基于OTUCM模型的网络社团在线识别[J].情报理论与实践,2018,41(7):129-135. 被引量：1
3杨旭,钱晓东.基于改进的Vicsek模型的社会网络同步聚类算法[J].数据分析与知识发现,2020,4(4):119-128. 被引量：1

1张普宁,刘元安,吴帆,唐碧华,李论.带有匹配估计方法物联网基于内容的实体搜索机制[J].上海交通大学学报,2016,50(7):1060-1064. 被引量：2
2杨成龙,李德识.面向物联网的传感器实体搜索系统[J].计算机工程与设计,2015,36(10):2823-2827. 被引量：2
3杨丹,陈默,孙良旭,王刚.异构信息空间中支持多模态融合实体搜索的多层时态数据模型[J].计算机科学,2015,42(4):147-150.
4孙卫红,菊秋芳.计算机网页制作入门[J].统计与经济,2000(2):43-44.
5IE窗口自动最大化[J].科技展望（幻想大王）,2006(20):19-19.
6谭骏珊,陈可钦.聚焦爬行中网页爬行算法的改进[J].电脑知识与技术,2008,0(12Z):2145-2146. 被引量：2
7张素智,张琳,曲旭凯.基于最短路径的加权属性图聚类算法研究[J].计算机应用与软件,2016,33(11):212-214. 被引量：7
8“超连接”时代的4大安全漏洞[J].网络安全和信息化,2016,0(9):5-5.
9王琰炎,王裴岩,蔡东风.一种用于专利实体的实体消歧方法[J].沈阳航空航天大学学报,2015,32(1):77-83. 被引量：3
10王秋月,覃雄派,曹巍,覃飙.扩展知识图谱上的实体关系检索[J].计算机应用,2016,36(4):985-991. 被引量：5

计算机研究与发展

2010年第12期

浏览历史

内容加载中请稍等...

基于链接路径预测的聚焦Web实体搜索被引量：1

参考文献11

二级参考文献26

共引文献10

同被引文献25

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于链接路径预测的聚焦Web实体搜索 被引量：1

参考文献11

二级参考文献26

共引文献10

同被引文献25

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于链接路径预测的聚焦Web实体搜索被引量：1