摘要
高质量的数据是构建数据仓库的最重要因素,低质量的数据可能对决策产生不利影响。来自不同数据源的相似重复记录是数据仓库构建中影响数据质量的主要问题之一,在源数据进入数据仓库之前尽可能地消除相似重复记录能很大程度地提高数据质量。为此,比较了现有的相似重复记录消除算法,改进了SNM算法,并通过实验比较了传统SNM方法与改进SNM算法。实验结果显示:在相似重复记录消除方面,SNM改进算法具有明显的优势。
High quality data is the most important factor to build the data warehouse. The low quality data may be bad for decision making. An approximately duplicate record from different data sources is one of the main data quality issues to build data warehouse. To eliminate approximately duplicate data as far as possible before the source data enters into a data warehouse can greatly improve the quality of data. Firstly, the existing approximately duplicate records elimination algorithms were compared, and then SNM algorithm was improved. The authors compared traditional SNM method and SNM improved algorithm by the experiment, and the results show: SNM improved algorithm has obvious advantages in eliminating duplicate records.
出处
《重庆理工大学学报(自然科学)》
CAS
2016年第4期91-96,共6页
Journal of Chongqing University of Technology:Natural Science
基金
国家自然科学基金资助项目(71473185)
关键词
SNM算法
SNM改进算法
相似重复记录消除
SNM algorithm
SNM improved algorithm
approximately duplicate records elimination