摘要
入口页面(主页)查询结果只有一个,并且用户的查询词常常是简短的页面名称,由于它要求更高的精准度,一般认为是较为困难的.依据语言模型分析,挖掘出对中文入口页面(entry page)检索有意义的查询域作为基准检索的内容域,同时考虑到非内容网页优先级(URL-type等)特征的重要性,建立综合内容域和非内容网页特征的检索模型.通过URL类型优先级(URL-type prior)的概率统计,发现入口页面和其相关的子页面之间存在比较大的联系.据此提出基于相关子页面的入口页面提取算法PERS(page extracted from relevant sub-page).对比实验数据表明,PERS算法对检索的性能有较大提高.
Entry page (home page) retrieval has the goal to retrieve just one right document, and the queries are usually short Web-page names. As a result, finding precisely an entry page with a high initial is quite difficult. According to unigram language model, the authors extract the field of Web page contents for baseline retrieval, which are useful for finding Chinese entry page, and then we build a new model combined content-field and non-contents features of Web pages (e. g. URL-type prior , proved to have the strongest predictive power). According to the prior probabilities of URL-type, the relationship between entry page and its sub-pages is discovered. Based on the relationship, we propose a new algorithm that entry page is extracted from relevant subpages (PERS). At last, we get the result from re-rank, and achieve a great advance on performance of entry page retrieval by using PERS.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2006年第3期63-67,共5页
Journal of Shandong University(Natural Science)
基金
国家发改委CNGI的资助项目(CNGI-04-12-2A)