期刊文献+

实体解析中基于相似性传递的增量分组研究 被引量:1

Research on incremental grouping based on transferred similarity in entity resolution
原文传递
导出
摘要 本文探讨一种适应于大数据集的基于相似性传递的记录增量分组方法.论文首先分析如何逐步推算出记录之间的相似性,然后提出如何基于排序键构建基准组,如何基于相似性传递增量更新基准组,以及如何基于并查集实现基准组中的增量更新,最后通过实验验证提出方法的可行性和高效性.实验结果显示,提出的方法比传统方法更能提高分组质量,提升分组效率.论文没有对属性值本身存在的数据质量问题进行详细分析研究,并没有设计排序键生成算法.提出的方法不仅能有助于解决数据清洗、信息集成与管理等技术中的记录漏配问题,而且具有较好的可扩展性可重用性和不受领域限制等优点因为它仅从纯数据处理的角度来设计算法. This paper investigates an approach to record incremental grouping based on transferred similarity for large data sets.The paper first analyzes how to gradually calculate similarity between records,then proposes how to construct reference group based on sorting key,how to incrementally update reference group based on transferred similarity,and how to perform incremental updates in reference group based on union-find,finally proves the feasibility and efficiency of the proposed method through experiments.Experimental results show that the proposed method can improve grouping quality and improve grouping efficiency more than traditional methods.There is no detailed analysis of the data quality problem existing in the attribute value itself,and there is no design of the sorting key generation algorithm.The proposed method can not only help solve the problem of missing record pairs in data cleaning,information integration and management,but also has advantages such as better scalability,reusability,and freedom from the domain,because it only designs algorithms from the perspective of pure data processing.
作者 高广尚 GAO Guangshang(Research Center for Modern Enterprise Management,Guilin University of Technology,Guilin 541004,China;School of Management,Guilin University of Technology,Guilin 541004,China)
出处 《系统工程理论与实践》 EI CSSCI CSCD 北大核心 2019年第5期1287-1297,共11页 Systems Engineering-Theory & Practice
基金 国家自然科学基金(71761008) 广西高校人文社会科学重点研究基地基金(16YB010)~~
关键词 排序键 相似性传递 并查集 实体解析 数据质量 sorting key transferred similarityunion-find entity resolution data quality
  • 相关文献

参考文献7

二级参考文献67

  • 1金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量:18
  • 2霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量:9
  • 3Bertossi L, Kolahi S, Lakshmanan L. Data cleaning and query answering with matching dependencies and matching functions. In: Abiteboul S, B6hm K, Koch C, Tan KL, eds. Proc. of the 27th Int'l Conf. on Data Engineering. Hannover: IEEE Computer Society, 2011. 268-279. [doi: 10.1145/1938551,1938585].
  • 4Dong X, Halevy AY, Yu C. Data integration with uncertainty. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D Aberer K, Deshpande A, Florescu D, Chart CY, Ganti V, Kanne CC, Klas WJ, Neuhold E, eds. Proc. of the 33rd Int'l Conf. on Very Large Data Bases. Vienna: ACM Press, 2007. 687-698.
  • 5Ji S, Li G, Li C, Feng JH. Efficient interactive fuzzy keyword search. In: Proc. of the 18th Int'l Conf. on World Wide Web. Madrid: ACM Press, 2009. 371-380. [doi: 10.1145/1526709.1526760].
  • 6Timothy C, Justin Z. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 2003,54(3):203-215. [doi: 10.1002/asi, 10170].
  • 7Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntaetie clustering of the Web. Computer Networks and ISDN Systems, 1997, 29(8):1157-1166. [doi: 10,1016/S0169-7552(97)00031-7].
  • 8Li G, Deng D, Wang J, Feng JH. Pass-Join: A partition-based method for similarity joins. VLDB Endowment, 2011,5(3):253-264. [doi: 10.14778/2078331.2078340].
  • 9Wang J, Feng J, Li G. Trie-Join: Efficient trie-based string similarity joins with edit-distance constraints. VLDB Endowment, 2010, 3(1-2):1219-1230. [doi: 10.14778/1920841.1920992].
  • 10Xiao C, Wang W, Lin X. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. VLDB Endowment, 2008,1(1):933-944, [doi: 10.14778/1453856.1453957].

共引文献182

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部