摘要
网页去重具有很重要的实际意义,也是信息检索领域近几年研究的热点。分析现有的网页去重算法,并对经典的DSC(digital syntactic clustering)网页去重算法进行改进。为每篇文档生成一个特征向量集合,用该特征向量集合筛选shin-gles;然后进行相似性比较。实验表明,该算法对重复网页判定具有很好的准确率和召回率。
Removing duplicated Webpages can improve the performance of search engines, and it has been one of research issues in today~ information retrieving research. The main popular duplicated Webpages detecting methods is analysed, and algorithm is modified the traditional DSC to select the shingles through the feature vectors of the document, and then compared the similarity of two documents. The experimental results show that the method has achieved a good performance in recall and precision.
出处
《科学技术与工程》
北大核心
2013年第8期2250-2253,共4页
Science Technology and Engineering
基金
国家自然科学基金(60970022)资助
关键词
搜索引擎
网页去重
特征项
shingle
search engine duplicated Webpages detecting feature item shingle