摘要
针对海量数据下相似重复记录检测算法的低查准率和低效率问题,采用综合加权法和基于字符串长度过滤法对数据集进行相似重复检测。综合加权法通过结合用户经验和数理统计法计算各属性的权重。基于字符串长度过滤法在相似检测过程中利用字符串间的长度差异提前结束编辑距离算法的计算,减少待匹配的记录数。实验结果表明,通过综合加权法计算的权重向量更加全面、准确反映出各属性的重要性,基于字符串的长度过滤法减少了记录间的比对时间,能够有效地解决海量数据的相似重复记录检测问题。
For the problem of low precision and low time efficiency of approximate duplicate records detection algorithm in massive data,integrated weighted method and filtration method based on the length of strings were adopted to do the approximate duplicate records detection in dataset.Integrated weighted method integrated user experience and mathematical statistics to calculate the weight of each attribute to make weight calculation more scientific.The filtration method based on the length of strings made use of the length difference between strings to terminate the edit distance algorithm earlier which reduced the number of the records to be matched during the detection process.The experimental results show that the weight vector calculated by the integrated weighted method makes the importance of each field more comprehensive and accurate.The filtration method based on the length of strings reduces the comparison time among records and effectively solves the problem of the detection of approximate duplicate records under massive data.
出处
《计算机应用》
CSCD
北大核心
2013年第8期2208-2211,共4页
journal of Computer Applications
基金
江苏省科技支撑项目(BE2011156)
关键词
海量数据
相似重复记录
综合加权法
编辑距离
massive data
approximate duplicate record
integrated weighted method
edit distance