期刊文献+

基于URL类型优先级的入口页面查询算法 被引量:1

Entry page search algorithm based on URL-type prior probabilities
下载PDF
导出
摘要 入口页面(主页)查询结果只有一个,并且用户的查询词常常是简短的页面名称,由于它要求更高的精准度,一般认为是较为困难的.依据语言模型分析,挖掘出对中文入口页面(entry page)检索有意义的查询域作为基准检索的内容域,同时考虑到非内容网页优先级(URL-type等)特征的重要性,建立综合内容域和非内容网页特征的检索模型.通过URL类型优先级(URL-type prior)的概率统计,发现入口页面和其相关的子页面之间存在比较大的联系.据此提出基于相关子页面的入口页面提取算法PERS(page extracted from relevant sub-page).对比实验数据表明,PERS算法对检索的性能有较大提高. Entry page (home page) retrieval has the goal to retrieve just one right document, and the queries are usually short Web-page names. As a result, finding precisely an entry page with a high initial is quite difficult. According to unigram language model, the authors extract the field of Web page contents for baseline retrieval, which are useful for finding Chinese entry page, and then we build a new model combined content-field and non-contents features of Web pages (e. g. URL-type prior , proved to have the strongest predictive power). According to the prior probabilities of URL-type, the relationship between entry page and its sub-pages is discovered. Based on the relationship, we propose a new algorithm that entry page is extracted from relevant subpages (PERS). At last, we get the result from re-rank, and achieve a great advance on performance of entry page retrieval by using PERS.
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2006年第3期63-67,共5页 Journal of Shandong University(Natural Science)
基金 国家发改委CNGI的资助项目(CNGI-04-12-2A)
关键词 入口页面检索 URL类型优先级 信息检索 Entry page retrieval URL-type priority information retrieval
  • 相关文献

参考文献9

  • 1Kraaij W, Westerveld T, Hiemstra D. The importance of prior probabilities for entrypage search[A]. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York, USA:ACM Press, 2002.1 - 2.
  • 2北京大学网络实验室.SEWM 22004中文Web检索测试指南[Z].北京:北京大学网络实验室,2004.
  • 3Ricardo Baeza-Yate, Berthier Ribeiro-Neto. Modeminformation retrieval[M].北京:机械工业出版社,2005.
  • 4D Hiemstra. Using language models for information retrieval.PhD thesis [M]. University of Twente, The Netherlands: Centre for Telematics and Information Technology, 2001.
  • 5丁国栋.统计语言建模中的平滑技术[EB/OL].http://159.226.40.18/reports/smoothing% 20for% 20slm. ppt, 2004-04/2006-03.
  • 6Hodgson J. Do HTML tags flag semantic content? [J]. IEEE Internet Computing, 2001, 5(1):20-25.
  • 7T Upstill, N Craswell, D Hawking. Query-independent evidence in home page finding[J]. ACM Transactions on Information Systems, 2003, 21(3) :3 - 5.
  • 8E M Voorhees, D K Harman. The tenth text retrieval conference (TREC-2001)[J]. National Institute of Standards and Technology, NIST, 2002, 10(2) : 1 - 2.
  • 9北京大学网络实验室.中文Web信息检索评测[Z].北京:北京大学网络实验室,2006.

同被引文献6

  • 1文坤梅,卢正鼎.搜索引擎中基于分类的网页更新方法研究[J].计算机科学,2004,31(B09):1-2. 被引量:1
  • 2孟涛,王继民,闫宏飞.网页变化与增量搜集技术[J].软件学报,2006,17(5):1051-1067. 被引量:22
  • 3Edwards J,McCurley K,Tomlin J.An adaptive model for optimizing performance of an incremental Web crawler[C]∥Proceedings of the 10th Int'l Conference on World Wide Web.New York:ACM Press,2001:106-113.
  • 4Castillo C,Baeza-Yates R.A new model for Web crawling[C]∥Proceedings of the 11th World Wide Web Conference.New York:ACM Press,2002:1-4.
  • 5Yan H F,Wang J Y,Li X M,et al.Architectural design and evaluation of an efficient Web-crawling system[J].Journal of Systems and Software,2002,60(3):185-193.
  • 6Kraaij W,Westerveld T,Hiemstra D.The importance of prior probabilities for entry page search[C]∥Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:1-2.

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部