摘要
互联网上大量重复网页的存在,严重地影响信息的检索质量.因此提出了一种基于特征码散列的网页去重算法,利用特征码对特征语句集散列以缩小其比较的范围,提高了网页去重的准确性.实验表明该算法准确率高,性能良好,基于上述算法实现了基于Web的企业情报竞争平台中的网页去重.
A large number of duplicated Web pages are on the internet, which has seriously affected the quality of information retrieval. Therefore, the paper has proposed an algorithm based on the signature hashing of the duplicated web pages, using the distributed signature characteristics of the sentences to narrow the relative scope of the pages and to improve the accuracy. The experiment result shows the high accuracy rate and good performance of the algorithm, which helps realize the goal based on the web platform of business competitive intelligence.
出处
《云南民族大学学报(自然科学版)》
CAS
2008年第4期380-382,共3页
Journal of Yunnan Minzu University:Natural Sciences Edition
基金
昆明市科技型中小企业技术创新资助项目(CL2007061)
关键词
网页去重
特征码
特征语句集
duplicated Web pages
signature
signature characteristics of the sentences