摘要
String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.
String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.
基金
This work was partly supported by the National Grand Fundamental Research 973 Program of China (2015CB358700), the National Natural Science Foundation of China (Grant Nos. 61422205, 61472198), Beijing Higher Education Young Elite Teacher Project(YETP0105), Tsinghua-Tencent Joint Laboratory for Internet In- novation Technology, "NEXT Research Center", Singapore (WBS:R-252- 300-001-490), Huawei, Shenzhou, FDCT/ll6/2013/A3, MYRG105(Y1- L3)-FST13-GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Tech- nology (2013zx01039-002-002).