期刊文献+

数据仓库中重复记录清理算法研究 被引量:4

Research of data cleaning algorithm in data warehouse
下载PDF
导出
摘要 针对重复记录清理中的"排序、识别、合并"算法存在的问题进行了改进。改进后的重复记录清理算法在保证记录匹配率的情况下有效地提高了记录排序的效率;在重复记录识别时,考虑了匹配字段的文字数量、在2个字段中出现的频率、在记录中各字段的重要性(权重)、中文字段的语义和语义重点偏后等5个因素;合并重复记录时采用了聚类和实用算法并用的策略,有效地提高了数据仓库中重复记录清理算法的准确性和健壮性。 This paper describes some advices for improving the problems in the "scheduling, detecting, merging" algorithm of duplicate elimination. The improved duplicate elimination algorithm has effectively promoted the efficiency of scheduling record on the environment that record matching rate was keeping high. In detecting duplicate records, it takes into account 4 factors. For instance, the number of characters, the frequency of character be found in the 2 fields, the importance (weight) of field in records ,the Chinese semantic and the semantic focus is always in the back location etc; In merging duplicate records, it uses both the cluster algorithm and practical algorithm to do that. It makes the data cleaning algorithm in data warehouse more accurate and healthier.
出处 《信息化纵横》 2009年第7期4-6,共3页
关键词 数据清理 重复记录清理 重复记录识别 数据仓库 data cleaning: duplicate elimination duplicate detecting data warehouse
  • 相关文献

参考文献11

二级参考文献33

  • 1刘海涛.依存语法和机器翻译[J].语言文字应用,1997(3):91-95. 被引量:43
  • 2王源.中国化学文献检索系统的建库方针和今后发展[J].现代图书情报技术,1987(1):9-12. 被引量:2
  • 3车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量:64
  • 4郭艳华,周昌乐.一种汉语语句依存关系网协动生成方法研究[J].杭州电子工业学院学报,2000,20(4):24-32. 被引量:11
  • 5[5]China State Bureau of Technical Supervision.National Standard of the People's Republic of China Classification and Codes for the Features 1:5000 1:10000 1:25000 1:50000 1:100000 Topographic Maps.(GB/T 15660-1995)[S].1995.[国家技术监督局.中华人民共和国国家标准1:5000 1:10000 1:25000 1:50000 1:100000地形图要素分类与代码(GB/T15660-1995)[S].1995.]
  • 6[6]China State Bureau of Technical Supervision.National Standard of the People's Republic of China Specifications for Feature Classification and Codes for Fundamental Geographic Information (GB/T 13923-2006)[S].2006.]国家技术监督局.中华人民共和国国家标准1:5000 1:10000 1:25000 1:500001:100000地形图要素分类与代码(GB/T 13923-2006)[S].2006]
  • 7[8]Zhang X Y.Concept Integration of Document Databases Using Different Indexing Languages.[J] Information Processing & Management,2006,42:121-135.
  • 8[13]Senellart P.Extraction of Information in Large Graphs.Automatic Search for Synonyms[R].Technical Report 2001-1990,Universite Catholique de Louvain,Louvain-la-Neuve,Belgium,2001.
  • 9[17]Riesthuis G J A.Theory of Compatibility of Information Languages[A].Compatibility and Integration of Order System,Research Seminar Proceedings of the TIP/ISKO Meeting[C],1996.
  • 10[18]Resnik,P.Using Information Content to Evaluato Semantic Similarity in a Taxonomy[A].The 14th International Joint Conference on Artificial Intelligence[C].Montreal,1995.

共引文献71

同被引文献87

引证文献4

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部