期刊文献+

基于Simhash的大数据去重改进算法 被引量:2

A Big Data Deduplication Algorithm Based on Simhash
下载PDF
导出
摘要 数据去重是大数据预处理过程中最主要的一个步骤。为了提升大数据去重的效率,以及优化其在较差情况下的表现,本文以中文微博的原始数据为基础,在传统的Simhash方法的基础上,改进计算相似度的公式,将文本重复率纳入考虑,并在检索步骤中采用桶排序的思想,进行多次多级的线程分配以提高效率。实验结果表明,改进后的算法可以显著提升传统算法的效率和准确率。 Data deduplication is a main step in big data preprocess. To improve efficiency in deduplication and optimize performance in terrible condition of classic algorithm, this paper uses Chinese text data of mieroblog and modifies formula of calculating similarity based on classic Simhash algorithm. Duplication rate is considered in the advanced formula, besides, this paper draws on the experience of bucket sorting, distributes threads for several times and levels to improve efficiency. The result of experiment shows that advanced algorithm can reduce running time and improve accuracy compared with classic algorithm.
作者 周春晖
出处 《计算机与现代化》 2017年第7期38-41,共4页 Computer and Modernization
关键词 微博 大数据 去重 Simhash 多线程 mieroblog big data deduplieation Simhash multi-thread
  • 相关文献

参考文献4

二级参考文献39

  • 1Marko R.Improving Random Forests.Machine Learning.ECML Proceedings,Springer,Berlin,2004.
  • 2Ramón D,Sara Alvarez DA.Gene selection and classification of microarray data using random Forest.BMC Bioinformatics,2006,http://www.biomedcentral.com/1471-2105/7/3.
  • 3Liaw A,Wiener M.Classification and regression by randomForest.Rnews,2002,2:18-22.
  • 4Leo B.Random Forests.Statistics Department University of California Berkeley,CA 94720,January 2001.
  • 5Zhu B,Li H,Patterson H.Avoiding the disk bottleneck in the data domain deduplication file system[C]//Proceedings of the 6th USENIX Conference on File And Storage Technologies,2008:269-282.
  • 6Rhea S,Cox R,Pesterev A.Fast,inexpensive content-addressed storage in foundation[C]//Proceedings of the 2008 USENIX Annual Technical Conference,Boston,Massachusetts,June 2008:143-156.
  • 7Lillibridge M,Eshghi K,Bhagwat D,et al.Sparse indexing:Large scale,inline deduplication using sampling and locality[C]//Proceedings of the7th USENIX Conference on File And Storage Technologies,2009:111-123.
  • 8Xia W,Jiang H,Feng D,et al.Silo:a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput[C]//Proceedings of the 2011 USENIX Annual Technical Conference,2011:26-28.
  • 9Quinlan S,Dorward S.Venti:a new approach to archival storage[C]//Proceedings of the USENIX Conference on File And Storage Technologies,January 2002:89-101.
  • 10Eshghi K,Lillibridge M,Wilcock L,et al.Jumbo store:Providing efficient incremental upload and versioning for a utility rendering service[C]//Proceedings of the 5th USENIX Conference on File And Storage Technologies,2007:22-38.

共引文献44

同被引文献19

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部