期刊文献+

面向文本拷贝检测的分布式索引 被引量:2

Distributed Index for Near Duplicate Detection
下载PDF
导出
摘要 如何对大规模文档集进行高效的拷贝检测是长期以来一直受到研究者们关注的问题。通常的拷贝检测算法都需要借助倒排索引。因此良好的索引结构对于算法性能至关重要。同时,随着文档集规模的增大,单机实现的索引已经不能满足拷贝检测的需求,需要引入分布式存储的索引。为了适应文档集规模的不断增大,良好的分布式索引应该同时具备较高的效率和可扩展性。为此该文比较了两种不同的分布式索引结构,Term-Split索引和Doc-Split索引,并且给出了Map-Reduce范式下建立这两种索引的实现,以及以这两种索引为基础的文本拷贝检测方法,Term-Split方法和Doc-Split方法。在WT10G文档集上进行的实验表明Doc-Split方法具有更好的效率和可扩展性。 How to effectively detect near duplicate documents on large corpus is a hot topic in recent years.Usually,near duplicate detection algorithms use Inverted Index to improve their efficiency.However,as the corpus size increases,single machine implementation of index structure is intractable.Therefore Distributed Index structure is required for near duplicate detection.To process rapidly increasing data size,the distributed index structures should have both high efficiency and scalability.In this paper,we compare two different distributed index structures,Term-Split Index and Doc-Split Index,and provide the Map-Reduce implementation.Based on those two index structures,we propose two different approaches,Term-Split Approach and Doc-Split Approach,to detect near duplicate documents using Map-Reduce paradigm.Finally,we compare the performance of the two different approaches on WT10G corpus.Experimental results show that the Doc-Split Approach is more efficient and has better scalability.
出处 《中文信息学报》 CSCD 北大核心 2011年第1期91-97,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(61073069 61003092) 国家高技术研究发展计划(863计划)资助项目(2009AA01A346)
关键词 拷贝检测 重复检测 MAP-REDUCE near duplicate detection copy detection Map-Reduce
  • 相关文献

参考文献11

  • 1A. Z. Broder, S. C. Glassman, M. S. Manasse et al. Syntactic clustering of the web[J]. Computer Networks, 1997, 29(8-13): 1157-1166.
  • 2D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages [C]//Proceedings of the 1st Latin American Web Congress. Washington, DC, USA: IEEE Computer Society, 2003: 37.
  • 3J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections [C]//ACM SIGMOD Record. New York, NY, USA: ACM, 2000: 355-366.
  • 4T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents [J]. JASIST, 2003, 54(3): 203-215.
  • 5E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network [C]//Proceedings of the 11th ACM SIGKDD. New York, NY, USA: ACM, 2005: 678-684.
  • 6R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search[C]//Proceedings of the 16th WWW. New York, NY, USA: ACM, 2007: 131- 140.
  • 7J. Dean and J. Ghemawat. Map-Reduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2004, 51(1): 107-113.
  • 8M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections[C]//Proceeding of 31th SIGIR, New York, NY, USA: ACM, 2008:563 570.
  • 9Pang Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining[M]. Addison-Wesley,2005.
  • 10C D Manning, P Raghavan, H Schutze, An Introduction to Information Retriveval[M]. Cambridge University Press, 2008.

同被引文献20

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部