摘要
针对海量的互联网信息,提出一种计算网页理论用户关注度的方法,以期提高网络信息搜索的效率,改进搜索排名的准确性。本文以中文论坛的新闻网页为研究对象,利用百度指数提供的用户搜索信息,通过正文抽取、特征项提取、关注度计算等步骤,实现面向内容分析的网页理论用户关注度的计算,最后对150条网页进行实验和回归分析。结果表明,特征词提取的最佳个数为3,理论用户关注度与实际用户关注度(点击量)的相关系数达0.8以上,说明该方法具有一定的准确性。
Due to the mass information on Internet, efficient information retrieval has become the topic of interest for both academia and industry. In order to improve the accuracy of search engine' s rank algorithm, this paper proposes an algorithm to determine the theoretical degree of user attention to webpages. We select news webpages from Chinese forum as the object of our study. With Baidu Index, we design a content-oriented algorithm for theoretical degree of user attention to webpages through such steps as extracting web content, selecting feature vectors of webpage and so on. Experiment and regression analysis are conducted on 150 webpages. The result indicates that the optimal number of feature selection is 3 and the correlation coefficient between the theoretical degree of user attention and the actual degree of user attention ( net page views) is over 0. 8, proving the validity of our method.
出处
《情报学报》
CSSCI
北大核心
2012年第8期837-845,共9页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金资助项目(70971099)
中央高校基本科研业务费专项资金资助.
关键词
用户关注度
百度指数
网页特征项
回归分析
degree of user attention, baidu index, feature vector of webpage, regression analysis