基于URL类型优先级的入口页面查询算法被引量：1

Entry page search algorithm based on URL-type prior probabilities

下载PDF

导出

摘要入口页面(主页)查询结果只有一个,并且用户的查询词常常是简短的页面名称,由于它要求更高的精准度,一般认为是较为困难的.依据语言模型分析,挖掘出对中文入口页面(entry page)检索有意义的查询域作为基准检索的内容域,同时考虑到非内容网页优先级(URL-type等)特征的重要性,建立综合内容域和非内容网页特征的检索模型.通过URL类型优先级(URL-type prior)的概率统计,发现入口页面和其相关的子页面之间存在比较大的联系.据此提出基于相关子页面的入口页面提取算法PERS(page extracted from relevant sub-page).对比实验数据表明,PERS算法对检索的性能有较大提高. Entry page （home page） retrieval has the goal to retrieve just one right document, and the queries are usually short Web-page names. As a result, finding precisely an entry page with a high initial is quite difficult. According to unigram language model, the authors extract the field of Web page contents for baseline retrieval, which are useful for finding Chinese entry page, and then we build a new model combined content-field and non-contents features of Web pages （e. g. URL-type prior , proved to have the strongest predictive power）. According to the prior probabilities of URL-type, the relationship between entry page and its sub-pages is discovered. Based on the relationship, we propose a new algorithm that entry page is extracted from relevant subpages （PERS）. At last, we get the result from re-rank, and achieve a great advance on performance of entry page retrieval by using PERS.

作者胡俊刚董守斌陈晓志张元丰

机构地区华南理工大学广东省计算机网络重点实验室

出处《山东大学学报（理学版）》 CAS CSCD 北大核心 2006年第3期63-67,共5页 Journal of Shandong University(Natural Science)

基金国家发改委CNGI的资助项目(CNGI-04-12-2A)

关键词入口页面检索 URL类型优先级信息检索 Entry page retrieval URL-type priority information retrieval

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献9

1Kraaij W, Westerveld T, Hiemstra D. The importance of prior probabilities for entrypage search[A]. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. New York, USA:ACM Press, 2002.1 - 2.
2北京大学网络实验室．SEWM 22004中文Web检索测试指南[Z]．北京：北京大学网络实验室，2004．
3Ricardo Baeza-Yate, Berthier Ribeiro-Neto. Modeminformation retrieval[M]．北京：机械工业出版社，2005．
4D Hiemstra. Using language models for information retrieval.PhD thesis [M]. University of Twente, The Netherlands: Centre for Telematics and Information Technology, 2001.
5丁国栋．统计语言建模中的平滑技术[EB／OL].http://159.226.40.18/reports/smoothing% 20for% 20slm. ppt, 2004-04/2006-03.
6Hodgson J. Do HTML tags flag semantic content? [J]. IEEE Internet Computing, 2001, 5(1):20-25.
7T Upstill, N Craswell, D Hawking. Query-independent evidence in home page finding[J]. ACM Transactions on Information Systems, 2003, 21(3) :3 - 5.
8E M Voorhees, D K Harman. The tenth text retrieval conference (TREC-2001)[J]. National Institute of Standards and Technology, NIST, 2002, 10(2) : 1 - 2.
9北京大学网络实验室．中文Web信息检索评测[Z]．北京：北京大学网络实验室，2006．

同被引文献6

1文坤梅,卢正鼎.搜索引擎中基于分类的网页更新方法研究[J].计算机科学,2004,31(B09):1-2. 被引量：1
2孟涛,王继民,闫宏飞.网页变化与增量搜集技术[J].软件学报,2006,17(5):1051-1067. 被引量：22
3Edwards J,McCurley K,Tomlin J.An adaptive model for optimizing performance of an incremental Web crawler[C]∥Proceedings of the 10th Int'l Conference on World Wide Web.New York:ACM Press,2001:106-113.
4Castillo C,Baeza-Yates R.A new model for Web crawling[C]∥Proceedings of the 11th World Wide Web Conference.New York:ACM Press,2002:1-4.
5Yan H F,Wang J Y,Li X M,et al.Architectural design and evaluation of an efficient Web-crawling system[J].Journal of Systems and Software,2002,60(3):185-193.
6Kraaij W,Westerveld T,Hiemstra D.The importance of prior probabilities for entry page search[C]∥Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,NY,USA:ACM Press,2002:1-2.

引证文献1

1陈晓志,董守斌,张凌,张元丰.基于URL类型和网页链接变化的信息采集更新算法[J].郑州大学学报（理学版）,2007,39(2):60-64. 被引量：1

二级引证文献1

1姜波,丁岳伟.基于约束树编辑距离与导航树的信息采集[J].计算机工程,2009,35(14):75-77. 被引量：9

1王胜德,胡望斌,徐宗昌.优先级Petri网的CPN Tools仿真模型研究[J].系统仿真学报,2008,20(3):814-816. 被引量：7
2吴奇.SQL数据库在医院信息管理系统应用中存在的问题及改进措施[J].电子技术与软件工程,2015(2):203-203. 被引量：5
3小孙.你真的会搜音乐吗？[J].计算机应用文摘,2008(9):48-49.
4CM.智能360离线唤醒功能无需动手操作[J].计算机与网络,2013,39(23):18-18.
5钱程,阳小兰.一种电影个性化推荐系统的研究与实现[J].计算机与数字工程,2011,39(8):73-76. 被引量：5
6温华.渐热的“．手机”新域名[J].计算机应用文摘,2015,0(14):32-33.
7陈晓志,董守斌,张凌,张元丰.基于URL类型和网页链接变化的信息采集更新算法[J].郑州大学学报（理学版）,2007,39(2):60-64. 被引量：1
8刘荣国.电脑平面设计初学者的学习策略[J].中国管理信息化,2011,14(21):94-94.
9林浒,于东海,雷为民,王阳,张伟.支持IP地址定向解析的扩展DNS服务器的设计与实现[J].小型微型计算机系统,2008,29(6):1074-1077.
10张锋.基于URL和网页类型的网页信息采集研究[J].电子制作,2017,0(2):28-29.

山东大学学报（理学版）

2006年第3期

浏览历史

内容加载中请稍等...

基于URL类型优先级的入口页面查询算法被引量：1

参考文献9

同被引文献6

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于URL类型优先级的入口页面查询算法 被引量：1

参考文献9

同被引文献6

引证文献1

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于URL类型优先级的入口页面查询算法被引量：1