面向文本拷贝检测的分布式索引被引量：2

Distributed Index for Near Duplicate Detection

下载PDF

导出

摘要如何对大规模文档集进行高效的拷贝检测是长期以来一直受到研究者们关注的问题。通常的拷贝检测算法都需要借助倒排索引。因此良好的索引结构对于算法性能至关重要。同时,随着文档集规模的增大,单机实现的索引已经不能满足拷贝检测的需求,需要引入分布式存储的索引。为了适应文档集规模的不断增大,良好的分布式索引应该同时具备较高的效率和可扩展性。为此该文比较了两种不同的分布式索引结构,Term-Split索引和Doc-Split索引,并且给出了Map-Reduce范式下建立这两种索引的实现,以及以这两种索引为基础的文本拷贝检测方法,Term-Split方法和Doc-Split方法。在WT10G文档集上进行的实验表明Doc-Split方法具有更好的效率和可扩展性。 How to effectively detect near duplicate documents on large corpus is a hot topic in recent years.Usually,near duplicate detection algorithms use Inverted Index to improve their efficiency.However,as the corpus size increases,single machine implementation of index structure is intractable.Therefore Distributed Index structure is required for near duplicate detection.To process rapidly increasing data size,the distributed index structures should have both high efficiency and scalability.In this paper,we compare two different distributed index structures,Term-Split Index and Doc-Split Index,and provide the Map-Reduce implementation.Based on those two index structures,we propose two different approaches,Term-Split Approach and Doc-Split Approach,to detect near duplicate documents using Map-Reduce paradigm.Finally,we compare the performance of the two different approaches on WT10G corpus.Experimental results show that the Doc-Split Approach is more efficient and has better scalability.

作者张玥俞昊旻张奇黄萱菁

机构地区复旦大学计算机科学技术学院

出处《中文信息学报》 CSCD 北大核心 2011年第1期91-97,共7页 Journal of Chinese Information Processing

基金国家自然科学基金资助项目(61073069 61003092) 国家高技术研究发展计划(863计划)资助项目(2009AA01A346)

关键词拷贝检测重复检测 MAP-REDUCE near duplicate detection copy detection Map-Reduce

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1A. Z. Broder, S. C. Glassman, M. S. Manasse et al. Syntactic clustering of the web[J]. Computer Networks, 1997, 29(8-13): 1157-1166.
2D. Fetterly, M. Manasse, and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages [C]//Proceedings of the 1st Latin American Web Congress. Washington, DC, USA: IEEE Computer Society, 2003: 37.
3J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections [C]//ACM SIGMOD Record. New York, NY, USA: ACM, 2000: 355-366.
4T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents [J]. JASIST, 2003, 54(3): 203-215.
5E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network [C]//Proceedings of the 11th ACM SIGKDD. New York, NY, USA: ACM, 2005: 678-684.
6R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search[C]//Proceedings of the 16th WWW. New York, NY, USA: ACM, 2007: 131- 140.
7J. Dean and J. Ghemawat. Map-Reduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2004, 51(1): 107-113.
8M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections[C]//Proceeding of 31th SIGIR, New York, NY, USA: ACM, 2008:563 570.
9Pang Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining[M]. Addison-Wesley,2005.
10C D Manning, P Raghavan, H Schutze, An Introduction to Information Retriveval[M]. Cambridge University Press, 2008.

同被引文献20

1Richard M C McCreadie,Craig Macdonald,IadhOunis. Comparing Distributed Indexing:To MapReduce or Not[A].Boston,USA,2009.8-17.
2IoannisKonstantinou,Evangelos Angelou,DimitriosTsoumakos. Distributed Indexing of Web Scale Datasets for the Cloud[A].2010.1-6.
3Chang F,Dean J,Ghemawat S. Bigtable:A distributed storage system for structured Data[A].2006.205-218.
4张路,袁晓洁,刘芳,窦志成.大规模数据集的分布式索引机制研究[J].微电子学与计算机,2008,25(10):121-124. 被引量：3
5吴炜,苏永红,李瑞轩,卢正鼎.基于DHT的分布式索引技术研究与实现[J].计算机科学,2010,37(2):65-70. 被引量：8
6陈俊,杜旭,程文青,徐晶.对等网络点播系统中一种分布式索引结构[J].华中科技大学学报（自然科学版）,2011,39(3):66-70. 被引量：2
7马灿,孟丹,熊劲.基于分布式索引和目录聚合的海量小文件存储研究[J].高技术通讯,2012,22(10):1035-1040. 被引量：5
8冯汝伟,谢强,丁秋林.基于文本聚类与分布式Lucene的知识检索[J].计算机应用,2013,33(1):186-188. 被引量：10
9窦晓峰,陈胜,王熠航,麦联叨,由建宏.应用分布式索引提高海量数据查询性能[J].计算机系统应用,2014,23(6):259-261. 被引量：7
10黄斌,彭宇行,彭小宁.云计算环境中高效分布式索引技术[J].武汉大学学报（信息科学版）,2014,39(11):1375-1381. 被引量：5

引证文献2

1王哲,邱宇,刘梓健,徐培瑶.读写分离技术架构下海量数据的索引构建研究[J].自动化与仪器仪表,2019(2):49-52. 被引量：3
2邵武长.一种基于MapReduce的分布式索引方法[J].物联网技术,2014,4(7):65-66.

二级引证文献3

1徐向丽.电力安全生产管理中的问题研究[J].中小企业管理与科技,2019,0(20):1-2. 被引量：2
2卢人杰,刘宁,李頔,樊映,罗兴智.基于大数据的电力调度运行安全风险预警算法研究[J].电子设计工程,2022,30(10):163-166. 被引量：15
3颜清,曹璐,李金讯,张蓓蕾,王鹏.基于密度划分的数字电网分布式数据存储研究[J].自动化与仪器仪表,2022(7):185-188. 被引量：4

1钟锐,刘立刚.基于Map-Reduce的FP-Growth算法研究[J].赣南师范学院学报,2013,34(6):58-61.
2李震,杜中军.云计算环境下的改进型Map-Reduce模型[J].计算机工程,2012,38(11):27-29. 被引量：7
3指南针.利用C—term做漂亮的签名档[J].网迷,2000(6):84-85.
4柯白杨.关系数据库设计中的范式及其应用[J].福建电脑,2007,23(3):89-90. 被引量：3
5张研.VC++编程实现基于倒排索引的信息检索[J].电脑编程技巧与维护,2012(11):14-18.
6苟胜难.网络数据库设计中的范式要求与网络要求[J].重庆三峡学院学报,2003,19(4):86-88. 被引量：2
7张换香,张晓琳,王月明.基于Map-Reduce的XML前缀编码方案[J].阴山学刊（自然科学版）,2015,29(4):33-36.
8余淼,杨丹,赵俊芹.垂直搜索引擎的关键技术研究[J].软件导刊,2007,6(12):31-33. 被引量：5
9丁琳琳,信俊昌,王国仁,黄山.基于Map-Reduce的海量数据高效Skyline查询处理[J].计算机学报,2011,34(10):1785-1796. 被引量：44
10马惠兰.关系数据库结构优化的算法设计及应用[J].西北民族大学学报（自然科学版）,1998,23(1):122-124.

中文信息学报

2011年第1期

浏览历史

内容加载中请稍等...

面向文本拷贝检测的分布式索引被引量：2

参考文献11

同被引文献20

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

面向文本拷贝检测的分布式索引 被引量：2

参考文献11

同被引文献20

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

面向文本拷贝检测的分布式索引被引量：2