期刊文献+

海量网络文本去重系统的设计与实现 被引量:6

DESIGN AND IMPLEMENTATION OF MASSIVE NETWORK TEXT DEDUPLICATION SYSTEM
下载PDF
导出
摘要 如今网络和信息技术飞速发展,每天都有数以亿万计的文本数据产生。然而,不可避免地有很多文本内容是重复的。这样导致用户在利用搜索引擎搜索或者在网站上浏览内容时会看到很多相似的东西。这不仅给用户带来了不好的体验,对内容提供商来说,也需要花费更多的资源对重复冗余的内容进行存储。因此,对文本做一些相似度判断的基本处理,去除重复的文本有很重要的意义和价值。提出设计和实现一种基于simhash的文本去重系统。该系统可以对每天新产生的文本内容进行相似度计算,对于相似的内容只生成一份唯一标识并进行入库处理,有效排除了相似度太高的重复文本。 With the rapid development of the Internet and information technology in the present world, there are a large number of texts generated every day. However, it is unavoidable that many textual content is duplicated, which may lead users to see a lot of similar things when they search through search engines or browse content on websites. It not only brings a bad experience to users, but also requires more resources for content providers to store these repetitive and redundant contents. Therefore, it is of great significance and value to do some basic processing of text similarity judgment and remove duplicate text. A text deduplication system was designed and implemented based on simhash. The system can perform similarity calculation on the newly generated text content every day. For the similar content, only a unique identifier is generated and stored into the database, which effectively excludes duplicate texts with a high degree of similarity.
作者 汤建明 寇小强 Tang Jianming;Kou Xiaoqiang(National Computer System Engineering Research Institute of China,Beijing 100083,China)
出处 《计算机应用与软件》 北大核心 2018年第12期33-37,共5页 Computer Applications and Software
关键词 文本去重 Simhash 相似度 Text deduplication Simhash Similarity
  • 相关文献

参考文献6

二级参考文献95

共引文献257

同被引文献32

引证文献6

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部