期刊文献+

String similarity search and join: a survey 被引量:4

String similarity search and join: a survey
原文传递
导出
摘要 String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems. String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2016年第3期399-417,共19页 中国计算机科学前沿(英文版)
基金 This work was partly supported by the National Grand Fundamental Research 973 Program of China (2015CB358700), the National Natural Science Foundation of China (Grant Nos. 61422205, 61472198), Beijing Higher Education Young Elite Teacher Project(YETP0105), Tsinghua-Tencent Joint Laboratory for Internet In- novation Technology, "NEXT Research Center", Singapore (WBS:R-252- 300-001-490), Huawei, Shenzhou, FDCT/ll6/2013/A3, MYRG105(Y1- L3)-FST13-GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Tech- nology (2013zx01039-002-002).
关键词 string similarity similarity search similarity join TOP-K string similarity, similarity search, similarity join, top-k
  • 相关文献

参考文献91

  • 1Zhang C J, Chen L, Tong Y, Liu Z. Cleaning uncertain data with a noisy crowd. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 6-17 filtering algorithms.
  • 2Papotti P, Naumann F, Kruse S. Estimating data integration and clean- ing effort. In: Proceedings of International Conference on Extending Database Technology. 2015, 61-72.
  • 3Chn X, Morcos J, Ilyas I E Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD Interna- tional Conference on Management of Data, 2015, 1247-1261.
  • 4Verma P, Kesswani N. Web usage mining framework for data cleaning and IP address identification. 2014, arXiv: 1408.5460vl.
  • 5Maceio V J, Chiang F, Down D G. Models for distributed, large scale data cleaning. In: Proceedings of Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 369-380.
  • 6Almeida R, Oliveira P, Braga L, Barroso J. Ontologies for reusing data cleaning knowledge. In: Proceedings of International Catholic Stew- ardship Council. 2012, 238-241.
  • 7Fan J, Li G, Zhou L, Chert S, Hu J. SEAL: spatio-textual similarity search. Proceedings of the VLDB Endowment, 2012, 5(9): 824-835.
  • 8Yu M, Li G, Wang T, Feng J, Gong Z. Efficient for location-aware publish/subscribe. IEEE Transactions on Knowl- edge and Data Engineering, 2015, 27(4): 950-963.
  • 9Li G, Ooi B C, Feng J, Wang J, Zhou L. EASE: an effective 3-in-1 key- word search method for unstructured, semi-structured and structured data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 903-914.
  • 10Badgeley M A, Sealfon S C, Chikina M D. Hybrid Bayesian-rank in- tegration approach improves the predictive power of genomic dataset aggregation. Bioinformatics, 2015, 31(2): 209-215.

同被引文献11

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部