期刊文献+

结合局部敏感哈希的k近邻数据填补算法 被引量:4

k-nearest neighbor data imputation algorithm combined with locality sensitive Hashing
下载PDF
导出
摘要 k近邻(kNN)算法是缺失数据填补的常用算法,但由于需要逐个计算所有记录对之间的相似度,因此其填补耗时较高。为提高算法效率,提出结合局部敏感哈希(LSH)的k NN数据填补算法LSH-k NN。首先,对不存在缺失的完整记录进行局部敏感哈希,为之后查找近似最近邻提供索引;其次,针对枚举型、数值型以及混合型缺失数据分别提出对应的局部敏感哈希方法,对每一条待填补的不完整记录进行局部敏感哈希,按得到的哈希值找到与其疑似相似的候选记录;最后在候选记录中通过逐个计算相似度来找到其中相似程度最高的k条记录,并按照k NN算法对不完整记录进行填补。通过在4个真实数据集上的实验表明,结合局部敏感哈希的k NN填补算法LSH-k NN相对经典的k NN算法能够显著提高填补效率,并且保持准确性基本不变。 k-Nearest Neighbor( k NN) algorithm is commonly used in data imputation. It is of poor efficiency because of the similarity computation between every tow records. To solve the efficiency problem, an improved k NN data imputation algorithm combined with Locality Sensitive Hashing( LSH) named LSH-k NN was proposed. First, all the complete records were indexed in LSH way. Then corresponding LSH ways for nominal, numeric and mixed-type incomplete data were put forward, and LSH values for all the incomplete records were computed in the proposed way to find candidate similar records.Finally, the incomplete records' real distance to candidate similar records were calculated, and the top-k similar records for k NN imputation were found. The experimental results show that the proposed method LSH-k NN has higher efficiency than traditional k NN as well as keeping almost the same accuracy.
出处 《计算机应用》 CSCD 北大核心 2016年第2期397-401,共5页 journal of Computer Applications
基金 国家自然科学基金资助项目(61371196) 中国博士后科学基金特别资助项目(201003797) 解放军理工大学预研基金项目(20110604 41150301)~~
关键词 数据质量 数据完整性 数据填补 K近邻算法 局部敏感哈希 data quality data integrity data imputation k-nearest neighbor(k NN) algorithm Locality Sensitive Hashing(LSH)
  • 相关文献

参考文献13

  • 1GARCIA-LAENCINA P J, SANCHO-GOMEZ J-L, FIGUEIRAS-VIDAL A R, et al. K nearest neighbors with mutual information for simultaneous classification and missing data imputation[J]. Neurocomputing, 2009, 72(7/8/9): 1483-1493.
  • 2WANG H, WANG S. Discovering patterns of missing data in survey databases: An application of rough sets[J]. Expert System with Applications, 2009, 36(3): 6256-6260.
  • 3LITTLE R J A, RUBIN D B. Statistical analysis with missing data[M]. New York: John Wiley & Sons, 2002: 19-20.
  • 4DONDERS A R, VAN DER HEIJDEN G J, STIJNEN T, et al. Review: a gentle introduction to imputation of missing values[J]. Journal of Clinical Epidemiology, 2006, 59(10): 1087-1091.
  • 5AITTOKALLIO T. Dealing with missing values in large-scale studies: microarray data imputation and beyond[J]. Briefings in Bioinformatics, 2010, 11(2): 253-264.
  • 6ANAGNOSTOPOULOS C, TRIANTAFILLOU P. Scaling out big data missing value imputations: pythia vs. godzilla[C]//KDD '14: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 651-660.
  • 7Rajaraman A,Ullman J D.大数据:互联网大规模数据挖掘与分布式处理[M].王斌,译.北京:人民邮电出版社,2012:150-155.
  • 8BRODER A Z, CHARKAR M, FRIEZE A M, et al. Min-wise independent permutations[J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
  • 9DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//SCG '04: Proceedings of the twentieth Annual Symposium on Computational Geometry. New York: ACM, 2004: 253-262.
  • 10ANDONI A, INDYK P. LSH Algorithm and Implementation (E2LSH) [EB/OL]. [2015-06-22]. http://web.mit.edu/andoni/www/LSH.

共引文献6

同被引文献35

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部