期刊文献+

基于链接路径预测的聚焦Web实体搜索 被引量:1

Focused Web Entity Search Using the Linked-Path Prediction Model
下载PDF
导出
摘要 实体搜索是一个有前景的研究领域,因为它能够为用户提供更为详细的Web信息.快速、完全地收集特定领域实体所在的网页是实体搜索中的一个关键问题.为了解决这个问题,将Web网站建模为一组互连的状态构成的图,提出一种链接路径预测学习算法LPC,该模型能够学习大型网站中从主页通向目标网页的最优路径,从而指导爬虫快速定位到含有Web实体的目标网页.LPC算法分为两个阶段:首先,使用概率无向图模型CRF,学习从网站主页通往目标网页的链接路径模型,CRF模型能够融合超连接和网页中的各种特征,包括状态特征和转移特征;其次,结合增强学习技术和训练的CRF模型对爬行前端队列的超链接进行优先级评分.一种来自增强学习的折扣回报方法通过利用路径分类阶段学习的CRF模型来计算连接的回报值。在多个领域大量真实数据上的实验结果表明,所提出的适用CRF模型指导的链接路径预测爬行算法LPC的性能明显优于其他聚焦爬行算法. Entity search is a promising research topic because it will provide Web information in detail to the users. A key problem of entity search is collecting Web pages quickly and completely for the relevant entities on a specific domain. To deal with this issue, a website is modeled as a graph on a set of connected important states. Then a novel algorithm named LPC is proposed to learn the optimal link sequences leading to the goal pages which entities are embedded in. The LPC algorithm uses a two-stage strategy. In the first stage, it uses an undirected graphical learning model CRF to capture sequential link patterns leading to goal pages. The conditional exponential models of CRF are able to exploit a variety of features including state and transition features extracted around hyperlinks and HTML pages. In the second stage, the links in the crawling frontier queue are prioritized based on reinforcement learning and the trained CRF model. A discount reward approach from reinforcement learning is employed to compute the reward score using the CRF model learnt during path classification phase. The experimental results on massive real data show that the optimal prediction ability of CRF helps LPC outperforms other focused crawlers.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第12期2059-2066,共8页 Journal of Computer Research and Development
基金 陕西省自然科学基础研究计划基金项目(SJ08-ZT14)
关键词 实体搜索 聚焦爬行 链接路径预测 条件随机场 增强学习 entity search focused Web crawling linked-path prediction conditional random field reinforcement learning
  • 相关文献

参考文献11

  • 1Nie Z,Ma Y,Shi S,et al.Web object retrieval[C]//Proc of the 16th ACM Int Conf on World Wide Web.New York:ACM,2007:81-90.
  • 2Chakrabarti S,Vandenberg M H,Dom B E.Focused crawling:A new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11-16):1623-1640.
  • 3Cho J,Hector G-M,Page L.Efficient crawling through URL ordering[J].Computer Networks and ISDN Systems,1998,30(1-7):161-172.
  • 4Najork M,Wiener I N.Breadth-first search crawling yields high-quality pages[C]//Proc of the 10th ACM Int Conf on World Wide Web.New York:ACM,2001:114-118.
  • 5Menczer F,Pant G,Ruiz M E.Evaluating topic-driven Web crawlers[C]//Proc of the 24th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM,2001:241-249.
  • 6Ester M,Kriegel H -P,Schubert M.Accurate and efficient crawling for relevant websites[C]//Proc of the 30th Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment Press,2004:396-407.
  • 7王辉,左万利,王晖昱,宁爱军,孙志伟,满春雷.基于质心向量的增量式主题爬行[J].计算机研究与发展,2009,46(2):217-224. 被引量:4
  • 8Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proc of the 18th Int Conf on Machine Learning.San Francisco:Morgan Kaufmann,2001:282-289.
  • 9黄健斌,姬红兵,孙鹤立.基于混合跳链条件随机场的异构Web记录集成方法[J].软件学报,2008,19(8):2149-2158. 被引量:8
  • 10Rennie J,McCallum A.Using reinforcement learning to spider the Web efficiently[C]//Proc of the 16th Int Conf on Machine Learning.San Francisco:Morgan Kaufmann,1999:335-343.

二级参考文献26

  • 1王辉,左万利,袁华.一种基于质心与本体的文本分类方法[J].计算机研究与发展,2007,44(z2):6-11. 被引量:3
  • 2周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量:112
  • 3Davison B D. Topical locality in the Web [C] //Proc of SIGIR. New York: ACM, 2000:272-279
  • 4Hofmann T. Probabilistic latent semantic analysis[C]//Proc of the 15th Conf on Uncertainty in Artificial Intelligence. New York: ACM, 1999:289-296
  • 5Hofmann T. Probabilistic latent semantic indexing [C] // Proc of SIGIR. New York: ACM, 1999:103-110
  • 6Barbosa L, Freire J. An adaptive crawler for locating hidden- Web entry points [C]//Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:441-450
  • 7Barbosa L, Freire J. Combining cl.assifiers to identify online databases [C] //Proc of the 16th Int World Wide Web Conf. New York: ACM, 2007:431-439
  • 8Barbosa L, Freire J. Siphoning hidden-Web data through keyword-based interfaces [C] //Proc of SBBD. Brazil: UnB, 2004:309-321
  • 9Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources [C]//Proc of WISE. Los Alamitos, CA: IEEE Computer Society, 2003:125-133
  • 10Han E, Karypis G. Centroid-based document classification: Analysis & experimental results [C]//Proc of European Conf on Principles of Data Mining and Knowledge Discovery (PKDD). Berlin: Springer, 2000:424-431

共引文献10

同被引文献25

  • 1Albert R, Barabasi A. Statistical mechanics of complex networks[J]. Reviews of Modern Physics, 2002, 74(1): 47-97.
  • 2Albert R, Jeong H, Barabasi A. Emergence of scaling in randomnetworks[J]. Science, 1999, 286(5439): 509-512.
  • 3Lancichinetti A, Kivela M, Saramgki J. Characterizing the community structure of complex networks [J]. PLoS ONE, 2010, 5(8): c11976.
  • 4Newman M. The structure and function of complex networks [J]. SIAM Review, 2003, 45(2): 167-256.
  • 5Huang J, Sun H, Liu Y, et al. Towards online multiresolution community detection in large-scale networks [J]. PLoS ONE, 2010, 6(8): e23829.
  • 6Girvan M, Newman M. Community structure in social and biological networks [J]. Proceedings of the National Academy of Science, 2002, 99(12) : 7821-7826.
  • 7Newman M. Detecting community structure in networks [J]. The European Physical Journal B-Condensed Matter and Complex Systems, 2004, 38(2): 321-330.
  • 8Ng R, Han J. CLARANS: A method for clustering objects for spatial data mining [J]. IEEE Trans on Knowledge and Data Engineering, 2002, 14(5): 1003-1016.
  • 9Hartigan M, Wong A. K-means clustering algorithm [J]. Journalof the Royal Statistical Society, 1979, 28(1): 100 -108.
  • 10Ester M, Kriegel H, Sander J, et al. A density based algorithm for discovering clusters in large spatial databases with noise [C] //Proc of the 2nd Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 1996: 226- 231.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部