摘要
针对重复记录清理中的"排序、识别、合并"算法存在的问题进行了改进。改进后的重复记录清理算法在保证记录匹配率的情况下有效地提高了记录排序的效率;在重复记录识别时,考虑了匹配字段的文字数量、在2个字段中出现的频率、在记录中各字段的重要性(权重)、中文字段的语义和语义重点偏后等5个因素;合并重复记录时采用了聚类和实用算法并用的策略,有效地提高了数据仓库中重复记录清理算法的准确性和健壮性。
This paper describes some advices for improving the problems in the "scheduling, detecting, merging" algorithm of duplicate elimination. The improved duplicate elimination algorithm has effectively promoted the efficiency of scheduling record on the environment that record matching rate was keeping high. In detecting duplicate records, it takes into account 4 factors. For instance, the number of characters, the frequency of character be found in the 2 fields, the importance (weight) of field in records ,the Chinese semantic and the semantic focus is always in the back location etc; In merging duplicate records, it uses both the cluster algorithm and practical algorithm to do that. It makes the data cleaning algorithm in data warehouse more accurate and healthier.