摘要
为了有效解决数据清洗领域中相似重复记录的检测问题,提出了一种基于遗传神经网络的相似重复记录检测方法。该方法计算两条记录对应字段间的相似度,构建基于神经网络的检测模型,利用遗传算法对网络模型的权值进行优化,使用遗传神经网络组合多个字段上的相似度来检测相似重复记录。在不同领域数据集上的测试结果表明,该方法能够提高相似重复记录检测的准确率和检测精度。
In order to solve the problem of approximately duplicate record detection in the field of data cleaning effectively,a method based on neural network and genetic algorithm is proposed.Firstly,this method measures the similarity of each corresponding field pairs in the two records.Then a model based on neural network for detection is constructed,and genetic algorithm is adopted to optimize the weights of the neural network model.Finally,the neural network trained on some samples is used to classify the record pair in duplicate or non-duplicate one.Experimental results on a range of datasets show that this method improves the accuracy and precision of duplicate detection over traditional methods.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第7期1550-1553,共4页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2009AAJ127)
关键词
相似重复记录检测
神经网络
遗传算法
数据清洗
数据集成
approximately duplicate record detection
neural network
genetic algorithm
data cleaning
data integration