摘要
k近邻(kNN)算法是缺失数据填补的常用算法,但由于需要逐个计算所有记录对之间的相似度,因此其填补耗时较高。为提高算法效率,提出结合局部敏感哈希(LSH)的k NN数据填补算法LSH-k NN。首先,对不存在缺失的完整记录进行局部敏感哈希,为之后查找近似最近邻提供索引;其次,针对枚举型、数值型以及混合型缺失数据分别提出对应的局部敏感哈希方法,对每一条待填补的不完整记录进行局部敏感哈希,按得到的哈希值找到与其疑似相似的候选记录;最后在候选记录中通过逐个计算相似度来找到其中相似程度最高的k条记录,并按照k NN算法对不完整记录进行填补。通过在4个真实数据集上的实验表明,结合局部敏感哈希的k NN填补算法LSH-k NN相对经典的k NN算法能够显著提高填补效率,并且保持准确性基本不变。
k-Nearest Neighbor( k NN) algorithm is commonly used in data imputation. It is of poor efficiency because of the similarity computation between every tow records. To solve the efficiency problem, an improved k NN data imputation algorithm combined with Locality Sensitive Hashing( LSH) named LSH-k NN was proposed. First, all the complete records were indexed in LSH way. Then corresponding LSH ways for nominal, numeric and mixed-type incomplete data were put forward, and LSH values for all the incomplete records were computed in the proposed way to find candidate similar records.Finally, the incomplete records' real distance to candidate similar records were calculated, and the top-k similar records for k NN imputation were found. The experimental results show that the proposed method LSH-k NN has higher efficiency than traditional k NN as well as keeping almost the same accuracy.
出处
《计算机应用》
CSCD
北大核心
2016年第2期397-401,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(61371196)
中国博士后科学基金特别资助项目(201003797)
解放军理工大学预研基金项目(20110604
41150301)~~
关键词
数据质量
数据完整性
数据填补
K近邻算法
局部敏感哈希
data quality
data integrity
data imputation
k-nearest neighbor(k NN) algorithm
Locality Sensitive Hashing(LSH)