Pass-Join-K:多分段匹配的相似性连接算法

Pass-Join-K: Similarity Join Method Based on Multi-Match Partition

下载PDF

导出

摘要相似性连接是数据清理工作的基本模型，获得了大量数据库工作者的关注。研究了基于编辑距离的相似性连接问题，即在两个字符串集合中寻找编辑距离小于一个阈值的字符串对，并在Pass-Join算法的基础上，提出了一个新的Pass-Join-K算法。Pass-Join—K算法在长短字符串上都有很好的表现。该算法的主要思想是利用Pass-Join算法的划分原理，以多次匹配的方式，达到更加严格地选取候选配对的目的。实验结果显示，Pass-Join-K算法减少了候选对的数量，在实际数据集上相比元算法在运行时间上有2～5倍的提升。 Similarity join is the basic model of data cleaning in the database research and has attracted lots of attention from the database community. This paper studies the edit distance based similarity join, which finds similar strings from two large sets of strings whose edit distance is less than a given threshold, and proposes an improved Pass-Join algorithm, named Pass-Join-K. Pass-Join-K is efficient both for short strings and long strings. The main idea of Pass-Join-K is to divide the query string into more parts based on the partition strategy of Pass-Join, and filter the candidate string pairs more strictly by multi-match. The experimental results show that Pass-Join-K can decrease the candidate pairs, and run 2-5 times more quickly than the origin algorithm which outperforms state-of-the-art methods on real datasets.

作者余海洋林琛陈珂江弋邹权

机构地区厦门大学信息科学与技术学院厦门大学深圳研究院广东石油化工学院计算机科学与技术系

出处《计算机科学与探索》 CSCD 2013年第10期924-932,共9页 Journal of Frontiers of Computer Science and Technology

基金国家自然科学基金Nos.61102136 61001013 福建省自然科学基金No.2011J05158 深圳市科技创新基础研究No.JCYJ20120618155655087~~

关键词编辑距离相似性连接多次匹配数据清理 Pass—Join—K算法 edit distance similarity join multi-match data cleaning Pass-Join-K

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning[C]//Proceedings of the 22nd International Conference on Data Engineering (ICDE '06). Washington, DC, USA: IEEE Computer Society, 2006: 5.
2Gravano L, Ipeirotis P G, Jagadish H V, et al. Approximate string joins in a database (almost) for fi'ee[C]//Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01), Rome, Italy, 2001. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc, 2001: 491-500.
3Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins[C]//Proceedings of the 32nd International Conference on Very large data bases (VLDB '06), Seoul, Korea, 2006: 918-929.
4Bayardo R J, Ma Yiming, Srikant R. Scaling up all pairs sim- ilarity search[C]//Proceedings of the 16th International Con- ference on World Wide Web (WWW '07), Alberta, Canada, 2007. New York, NY, USA: ACM, 2007: 131-140.
5Xiao Chuan, Wang Wei, Lin Xuemin, et al. Efficient similarity joins for near duplicate detection[C]//Proceedings of the 17th International Conference on World Wide Web (WWW '08), Beijing, China, 2008. New York, NY, USA: ACM, 2008: 131-140.
6Xiao Chuan, Wang Wei, Lin Xuemin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints[J]. Proceedings of the VLDB Endowment, 2008, 1(1): 933-944.
7Wang Jiannan, Feng Jianhua, Li Guoliang. Tile-join: efficient Trie-based string similarity joins with edit-distance con- straints[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2): 1219-1230.
8Li Guoliang, Deng Dong, Wang Jiannan, et al. Pass-join: a partition-based method for similarity joins[J]. Proceedings of the VLDB Endowment, 2011, 5(3): 253-264.
9Wagner R A, Fischer M J. The string-to-string correctionproblem[J]. Journal of the ACM, 1974, 21(1): 168-173.
10Masek W J, Paterson M. A faster algorithm computing string edit distances[J]. Journal of Computer and System Sciences, 1980, 20(1): 18-31.

1徐媛媛,陈华辉.基于MapReduce的增量式数据集的相似性连接[J].计算机应用研究,2014,31(11):3369-3374. 被引量：2
2陈懿诚,骆吉洲,李建中.Part-Join:基于划分的字符串相似性连接[J].计算机应用研究,2014,31(10):3002-3006.
3周健雯,李聪聪,熊赟,朱扬勇.一种基于R*树的自相似性连接算法[J].计算机应用与软件,2014,31(8):50-53. 被引量：1
4庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5. 被引量：16
5韩恺,岳丽华,龚育昌.利用关系数据库系统对半结构化数据进行近似查询[J].中国科学技术大学学报,2005,35(5):674-682. 被引量：3
6董明秀,王鹏,汪洋,李秋虹,汪卫.基于索引的内存相似性连接算法[J].计算机工程,2016,42(1):18-24.
7庞俊,谷峪,许嘉,于戈.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13. 被引量：15
8吕春梅,韩帅,胡朝举.等级保护在信息安全中的应用研究[J].计算机光盘软件与应用,2011(13):33-33.
9王伟.浅析云盘技术及存储原理[J].电脑知识与技术,2015,0(4):218-218.
10郝建柏,陈贤富,黄双福,杨俊.一种基于模糊近邻标签传递的半监督分类算法[J].微电子学与计算机,2010,27(2):30-33. 被引量：6

计算机科学与探索

2013年第10期

浏览历史

内容加载中请稍等...

Pass-Join-K:多分段匹配的相似性连接算法

参考文献15

相关作者

相关机构

相关主题

浏览历史