String similarity search and join： a survey 被引量：4

String similarity search and join： a survey

导出

摘要 String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems. String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.

作者 Minghe YU Guoliang LI Dong DENG Jianhua FENG

机构地区 Department of Computer Science

出处《Frontiers of Computer Science》 SCIE EI CSCD 2016年第3期399-417,共19页 中国计算机科学前沿（英文版）

基金 This work was partly supported by the National Grand Fundamental Research 973 Program of China （2015CB358700）, the National Natural Science Foundation of China （Grant Nos. 61422205, 61472198）, Beijing Higher Education Young Elite Teacher Project（YETP0105）, Tsinghua-Tencent Joint Laboratory for Internet In- novation Technology, ＂NEXT Research Center＂, Singapore （WBS：R-252- 300-001-490）, Huawei, Shenzhou, FDCT/ll6/2013/A3, MYRG105（Y1- L3）-FST13-GZ, National High-Tech R＆D （863） Program of China （2012AA012600）, and the Chinese Special Project of Science and Tech- nology （2013zx01039-002-002）.

关键词 string similarity similarity search similarity join TOP-K string similarity, similarity search, similarity join, top-k

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP311.1 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献91

1Zhang C J, Chen L, Tong Y, Liu Z. Cleaning uncertain data with a noisy crowd. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 6-17 filtering algorithms.
2Papotti P, Naumann F, Kruse S. Estimating data integration and clean- ing effort. In: Proceedings of International Conference on Extending Database Technology. 2015, 61-72.
3Chn X, Morcos J, Ilyas I E Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD Interna- tional Conference on Management of Data, 2015, 1247-1261.
4Verma P, Kesswani N. Web usage mining framework for data cleaning and IP address identification. 2014, arXiv: 1408.5460vl.
5Maceio V J, Chiang F, Down D G. Models for distributed, large scale data cleaning. In: Proceedings of Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 369-380.
6Almeida R, Oliveira P, Braga L, Barroso J. Ontologies for reusing data cleaning knowledge. In: Proceedings of International Catholic Stew- ardship Council. 2012, 238-241.
7Fan J, Li G, Zhou L, Chert S, Hu J. SEAL: spatio-textual similarity search. Proceedings of the VLDB Endowment, 2012, 5(9): 824-835.
8Yu M, Li G, Wang T, Feng J, Gong Z. Efficient for location-aware publish/subscribe. IEEE Transactions on Knowl- edge and Data Engineering, 2015, 27(4): 950-963.
9Li G, Ooi B C, Feng J, Wang J, Zhou L. EASE: an effective 3-in-1 key- word search method for unstructured, semi-structured and structured data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 903-914.
10Badgeley M A, Sealfon S C, Chikina M D. Hybrid Bayesian-rank in- tegration approach improves the predictive power of genomic dataset aggregation. Bioinformatics, 2015, 31(2): 209-215.

同被引文献11

1刘义,陈荦,景宁,刘露.海量空间数据的并行Top-k连接查询[J].计算机研究与发展,2011,48(S3):163-172. 被引量：7
2林学民,王炜.集合和字符串的相似度查询[J].计算机学报,2011,34(10):1853-1862. 被引量：35
3庞俊,谷峪,许嘉,于戈.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13. 被引量：15
4刘义,景宁,陈荦,熊伟.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851. 被引量：60
5徐媛媛,陈华辉.基于MapReduce的增量式数据集的相似性连接[J].计算机应用研究,2014,31(11):3369-3374. 被引量：2
6马友忠,慈祥,孟小峰.海量高维向量的并行Top-k连接查询[J].计算机学报,2015,38(1):86-98. 被引量：10
7戴健,丁治明.基于MapReduce快速kNN Join方法[J].计算机学报,2015,38(1):99-108. 被引量：10
8庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5. 被引量：16
9孙琛琛,申德荣,寇月,聂铁铮,于戈.面向关联数据的联合式实体识别方法[J].计算机学报,2015,38(9):1739-1754. 被引量：9
10王洪亚,杨利宏,刘晓强.Top-k相似连接算法性能优化[J].软件学报,2016,27(12):3051-3066. 被引量：4

引证文献4

1马友忠,张智辉,林春杰.大数据相似性连接查询技术研究进展[J].计算机应用,2018,38(4):978-986. 被引量：15
2王晓霞,孙德才.一种基于Q-sample的局部相似连接并行算法[J].计算机科学,2019,46(12):38-44. 被引量：1
3王晓霞,孙德才.一种基于MapReduce的局部相似自连接算法[J].计算机技术与发展,2020,30(2):88-93. 被引量：2
4孙琛琛,申德荣,肖迎元,李玉坤.面向查询式实体解析的多属性数据索引技术[J].软件学报,2022,33(6):2331-2347. 被引量：2

二级引证文献19

1廖大强.基于云计算的密集型数据库资源快速检索方法研究[J].内蒙古民族大学学报（自然科学版）,2019,34(3):215-220. 被引量：2
2陶婧.基于Spark的分布式大数据并行化聚类方法研究[J].湖北第二师范学院学报,2019,36(8):49-53. 被引量：4
3帅爱华,陈烨.基于代码保护的分布式系统可信数据自动筛选系统研究[J].自动化与仪器仪表,2019,0(12):26-29. 被引量：2
4刘忻.信息化环境下网络资源安全共享方法研究[J].新一代信息技术,2019,2(20):71-76.
5高永强.密钥共享下跨用户密文数据去重挖掘方法[J].沈阳工业大学学报,2020,42(2):203-207. 被引量：10
6陈勇.基于大数据的掌上医疗器械检索平台研究[J].自动化与仪器仪表,2020,0(3):171-174. 被引量：2
7云微.大规模混合网络数据库模糊查询算法改进仿真[J].计算机仿真,2020,37(5):246-249. 被引量：1
8高挺挺,王晓艺.基于PDM框架的可移动农业机械化术语查询系统设计[J].自动化与仪器仪表,2020(8):76-79. 被引量：1
9胡海.基于大数据分析的图书信息采编质量管理系统[J].现代科学仪器,2020(6):172-176.
10吴斌.基于智能数据分析的医院服务质量优化决策支持系统研究[J].电子设计工程,2021,29(13):114-119. 被引量：4

1LANG Rongling,WANG Yuan,GAO Fei,Pan Lei.Fault Diagnosis of Airborne Equipments Based on Similarity Search＊[J].Chinese Journal of Electronics,2013,22(4):855-860.
2YIN Hong,YANG Shuqiang,MA Shaodong,LIU Fei,CHEN Zhikun.A Novel Parallel Scheme for Fast Similarity Search in Large Time Series[J].China Communications,2015,12(2):129-140. 被引量：6
3TENG Yiping,CHENG Xiang,SU Sen,WANG Yulong,SHUANG Kai.Privacy-Preserving Top-k Keyword Similarity Search over Outsourced Cloud Data[J].China Communications,2015,12(12):109-121. 被引量：1
4程春玲,余志虎,张登银,徐小龙.基于SSC-tree流聚类的入侵检测算法[J].系统工程与电子技术,2012,34(3):625-630.
5万小军,彭宇新.A New Retrieval Model Based on TextTiling for Document Similarity Search[J].Journal of Computer Science & Technology,2005,20(4):552-558. 被引量：2
6Yizhou Sun,Jiawei Han.Meta-Path-Based Search and Mining in Heterogeneous Information Networks[J].Tsinghua Science and Technology,2013,18(4):329-338. 被引量：16
7Chun-Ling Cheng,Chun-Ju Sun,Xiao-Long Xu,Deng-Yin Zhang.A Multi-dimensional Index Structure Based on Improved VA-file and CAN in the Cloud[J].International Journal of Automation and computing,2014,11(1):109-117. 被引量：2

Frontiers of Computer Science

2016年第3期

浏览历史

内容加载中请稍等...

String similarity search and join： a survey 被引量：4

参考文献91

同被引文献11

引证文献4

二级引证文献19

相关作者

相关机构

相关主题

浏览历史