摘要
基于k近邻的缺失值填充方法通常使用样本间的距离来度量样本的相似性,在计算距离时,没有区分样本各属性的权重,即所有属性对距离的贡献是一样的。然而,在非均匀分布的不平衡数据集中,样本的异质性往往体现在取值不常见的属性上,即样本之间的相似性受属性取值概率影响,此时用传统的距离公式来度量相似性是不够准确的。因此,文章针对非均匀分布的不平衡数据集提出了一种自适应k近邻缺失值填充方法(AkNNI):首先,引入属性的概率密度,动态调整各个属性的重要性,凸显稀疏值与缩小频繁值在距离计算上的贡献,从而更好地表达样本的异质性以及捕捉样本之间的相似性;然后,针对高缺失率下数据集中完备样本稀少的情况,综合考虑了样本的相似性和完整性,设计了新的k近邻的选择流程。实验选取了6个非均匀分布数据集,对比了AkNNI方法与其他5种经典填充方法的填充效果,验证了填充后的数据集在k近邻分类器的分类效果,深入探索了3种评估指标的相互关系。实验结果表明AkNNI方法具有更高的填充准确度和分类准确度:在6种缺失值填充算法中,AkNNI方法在各个数据集上取得的平均RMSE最低、平均皮尔逊相关系数最高以及平均分类准确率最高。同时,在高缺失率下,AkNNI方法在各个数据集上仍能保持较低的RMSE、较高的皮尔逊相关系数和较高的分类准确度。
Missing value imputation methods based on k-nearest neighbor typically use the distance between samples to measure the similarity of the samples and do not differentiate the weights of the attributes when calculating the distance,i.e.,all attributes contribute equally to the distance.However,in a non-uniform distributed imba-lanced dataset,the heterogeneity of the samples is often reflected in the attributes with uncommon values,and the similarity between the samples is affected by the probability of the attributes values,and the similarity calculated by traditional distance formula is not accurate enough at this time.Therefore,an adaptive k-nearest neighbor missing value imputation method named AkNNI is proposed in the article for non-uniformly distributed imbalanced datasets.Firstly,the probability density of the attributes is introduced to dynamically adjust the importance of each attribute,highlighting the contribution of sparse values and reducing the contribution of frequent values in the calculation of distances,so as to better express the heterogeneity of samples as well as capture the similarity between samples;then,for the case of scarcity of complete samples in the dataset under high missing rates,the new selection process of k-nearest neighbors is designed by considering the sample similarity and completeness together.Experiments were conducted to select six non-uniformly distributed datasets,compare the imputation effect of the AkNNI method with other five classical imputation methods,verify the classification effect of the imputed datasets in the k-nearest neighbor classifier,and also explore the interrelationships of the three evaluation metrics in depth.The experimental results demonstrate that AkNNI method has higher imputation accuracy and classification accuracy:among the six missing value imputation methods,the AkNNI method achieves the lowest average RMSE,the highest average Pearson correlation coefficient,and the highest average classification accuracy on each dataset.Meanwhile,AkNNI still maintains lower RMSE,higher Pearson s correlation coefficient,and higher classification accuracy at high missing rates on each dataset.
作者
梁路
林俊跃
霍颖翔
LIANG Lu;LIN Junyue;HUO Yingxiang(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China;School of Future Technology,South China University of Technology,Guangzhou 510006,China)
出处
《华南师范大学学报(自然科学版)》
CAS
北大核心
2024年第4期80-90,共11页
Journal of South China Normal University(Natural Science Edition)
基金
国家自然科学基金项目(62072120)。
关键词
欧氏距离
K近邻
缺失值填充
概率密度
非均匀分布
euclidean distance
k-nearest neighbor
missing value imputation
probability density
non-uniform distribution