String similarity join with different similarity thresholds based on novel indexing techniques 被引量：2

String similarity join with different similarity thresholds based on novel indexing techniques

导出

摘要 String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a so- lution for string similarity joins that supports different simi- larity thresholds in a single operator. In order to support dif- ferent thresholds, we devised two novel indexing techniques： partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing tech- niques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while pro- viding a more flexible threshold specification. String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a so- lution for string similarity joins that supports different simi- larity thresholds in a single operator. In order to support dif- ferent thresholds, we devised two novel indexing techniques： partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing tech- niques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while pro- viding a more flexible threshold specification.

作者 Chuitian RONG Yasin N. SILVA Chunqing LI

机构地区 School of Computer Science & Software Engineering School of Mathematical & Natural Sciences

出处《Frontiers of Computer Science》 SCIE EI CSCD 2017年第2期307-319,共13页 中国计算机科学前沿（英文版）

基金 This work was supported by China Scholarship Council and the National Natural Science Foundation of China （Grant Nos. 61402329 and 51378350）.

关键词 similarity join similarity aware index similarity thresholds similarity join, similarity aware index, similarity thresholds

分类号 TP311.132 [自动化与计算机技术—计算机软件与理论] U412.1 [交通运输工程—道路与铁道工程]

引文网络
相关文献

同被引文献15

1单锦辉,姜瑛,孙萍.软件测试研究进展[J].北京大学学报（自然科学版）,2005,41(1):134-145. 被引量：138
2邱晓康,李宣东.一个面向路径的软件测试辅助工具[J].电子学报,2004,32(F12):231-234. 被引量：12
3王金宝,高宏,李建中,杨东华.RM树:一种支持字符串相似性操作的索引[J].计算机学报,2011,34(11):2142-2154. 被引量：6
4陈翔,陈继红,鞠小林,顾庆.回归测试中的测试用例优先排序技术述评[J].软件学报,2013,24(8):1695-1712. 被引量：44
5王开云,孔思淇,付云生,潘泽友,马卫东,赵强.两种基于双向比较的最长公共子串算法[J].计算机研究与发展,2013,50(11):2444-2454. 被引量：9
6巩敦卫,任丽娜.回归测试数据进化生成[J].计算机学报,2014,37(3):489-499. 被引量：13
7王金宝,高宏,李建中,杨东华.外存中高效的字符串相似性查询处理[J].计算机研究与发展,2015,52(3):738-748. 被引量：3
8刘军娜,邢琪,赵卫东.程序相似度检测算法[J].计算机与数字工程,2015,43(12):2145-2149. 被引量：2
9姜淑娟,王令赛,薛猛,张艳梅,于巧,姚慧冉.基于模式组合的粒子群优化测试用例生成方法[J].软件学报,2016,27(4):785-801. 被引量：20
10丁蕊,董红斌,张岩,冯宪彬.基于关键点路径的快速测试用例自动生成方法[J].软件学报,2016,27(4):814-827. 被引量：20

引证文献2

1蒲在毅.云计算加密数据关键词代数签名索引匹配搜索[J].计算机应用与软件,2020,37(6):53-58.
2钱忠胜,宋涛.面向关键字流图的相似程序间测试用例的重用[J].软件学报,2021,32(9):2691-2712. 被引量：4

二级引证文献4

1龙鲜菊,王骏,王伟,梁玮.基于知识图谱的硬件在环测试用例复用方法研究[J].汽车与驾驶维修,2023(11):40-43.
2李志博,李清宝,兰明敬.基于ART优化选择策略的遗传算法生成测试数据方法[J].计算机科学,2024,51(6):95-103.
3钱忠胜,俞情媛,张丁,姚昌森,秦朗悦,成轶伟.结合SVM与XGBoost的链式多路径覆盖测试用例生成[J].软件学报,2024,35(6):2795-2820.
4孔婧,李瑛,李美静,刘益玮.测试用例复用技术研究综述[J].计算机科学与应用,2023,13(9):1729-1738.

1Yue WANG,Hongzhi WANG,Jianzhong LI,Hong GAO.Efficient graph similarity join for information integration on graphs[J].Frontiers of Computer Science,2016,10(2):317-329. 被引量：4
2刘昭青.希腊高世迈公司租用的中远船舶首用德国GL操作性CO_2指标[J].航海技术,2008(4):32-32.
3Youquan He Hanxing Qiu.A Method of Cracks Image Segmentation Based on the Means of Multiple Thresholds[J].通讯和计算机（中英文版）,2012,9(10):1147-1151. 被引量：4
4XU Mad,YU Haiyan,SHEN Jiang.New Algorithm for CBR-RBR Fusion with Robust Thresholds[J].Chinese Journal of Mechanical Engineering,2012,25(6):1255-1263. 被引量：14
5熊蓉,褚健.A global vision system: using hue thresholds to exact feature and recognize[J].Journal of Harbin Institute of Technology(New Series),2001,8(3):233-238. 被引量：1
6Tan Wei,Zhao Xingcheng,Liu Yuxiang,Gu Shaoxiang,Feng Wenjiang.Reduced-complexity MPA decoder based on multi-level dynamic input thresholds[J].The Journal of China Universities of Posts and Telecommunications,2017,24(1):40-46.
7王晓华,闫雪梅,王晓光.Improved Support Vector Machine Approach Based on Determining Thresholds Automatically[J].Journal of Beijing Institute of Technology,2007,16(3):300-304.
8Jin GUO,Ji-Feng ZHANG,Yanlong ZHAO.ADAPTIVE TRACKING OF A CLASS OF FIRST-ORDER SYSTEMS WITH BINARY-VALUED OBSERVATIONS AND FIXED THRESHOLDS[J].Journal of Systems Science & Complexity,2012,25(6):1041-1051. 被引量：7
9Jianjun Mu,Xiaopeng Jiao,Xinmei Wang.Exact thresholds for low-density parity-check codes over the binary erasure channel[J].Progress in Natural Science:Materials International,2009,19(7):897-900.
10陈公平.UC STRING定义探讨与应用[J].汽车实用技术,2011(5):4-8. 被引量：1

Frontiers of Computer Science

2017年第2期

浏览历史

内容加载中请稍等...

String similarity join with different similarity thresholds based on novel indexing techniques 被引量：2

同被引文献15

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史