基于概率密度的自适应k近邻缺失值填充方法

Adaptive k-Nearest Neighbor Missing Value Imputation Method Based on Probability Density

下载PDF

导出

摘要基于k近邻的缺失值填充方法通常使用样本间的距离来度量样本的相似性,在计算距离时,没有区分样本各属性的权重,即所有属性对距离的贡献是一样的。然而,在非均匀分布的不平衡数据集中,样本的异质性往往体现在取值不常见的属性上,即样本之间的相似性受属性取值概率影响,此时用传统的距离公式来度量相似性是不够准确的。因此,文章针对非均匀分布的不平衡数据集提出了一种自适应k近邻缺失值填充方法(AkNNI):首先,引入属性的概率密度,动态调整各个属性的重要性,凸显稀疏值与缩小频繁值在距离计算上的贡献,从而更好地表达样本的异质性以及捕捉样本之间的相似性;然后,针对高缺失率下数据集中完备样本稀少的情况,综合考虑了样本的相似性和完整性,设计了新的k近邻的选择流程。实验选取了6个非均匀分布数据集,对比了AkNNI方法与其他5种经典填充方法的填充效果,验证了填充后的数据集在k近邻分类器的分类效果,深入探索了3种评估指标的相互关系。实验结果表明AkNNI方法具有更高的填充准确度和分类准确度:在6种缺失值填充算法中,AkNNI方法在各个数据集上取得的平均RMSE最低、平均皮尔逊相关系数最高以及平均分类准确率最高。同时,在高缺失率下,AkNNI方法在各个数据集上仍能保持较低的RMSE、较高的皮尔逊相关系数和较高的分类准确度。 Missing value imputation methods based on k-nearest neighbor typically use the distance between samples to measure the similarity of the samples and do not differentiate the weights of the attributes when calculating the distance,i.e.,all attributes contribute equally to the distance.However,in a non-uniform distributed imba-lanced dataset,the heterogeneity of the samples is often reflected in the attributes with uncommon values,and the similarity between the samples is affected by the probability of the attributes values,and the similarity calculated by traditional distance formula is not accurate enough at this time.Therefore,an adaptive k-nearest neighbor missing value imputation method named AkNNI is proposed in the article for non-uniformly distributed imbalanced datasets.Firstly,the probability density of the attributes is introduced to dynamically adjust the importance of each attribute,highlighting the contribution of sparse values and reducing the contribution of frequent values in the calculation of distances,so as to better express the heterogeneity of samples as well as capture the similarity between samples;then,for the case of scarcity of complete samples in the dataset under high missing rates,the new selection process of k-nearest neighbors is designed by considering the sample similarity and completeness together.Experiments were conducted to select six non-uniformly distributed datasets,compare the imputation effect of the AkNNI method with other five classical imputation methods,verify the classification effect of the imputed datasets in the k-nearest neighbor classifier,and also explore the interrelationships of the three evaluation metrics in depth.The experimental results demonstrate that AkNNI method has higher imputation accuracy and classification accuracy:among the six missing value imputation methods,the AkNNI method achieves the lowest average RMSE,the highest average Pearson correlation coefficient,and the highest average classification accuracy on each dataset.Meanwhile,AkNNI still maintains lower RMSE,higher Pearson s correlation coefficient,and higher classification accuracy at high missing rates on each dataset.

作者梁路林俊跃霍颖翔 LIANG Lu;LIN Junyue;HUO Yingxiang(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China;School of Future Technology,South China University of Technology,Guangzhou 510006,China)

机构地区广东工业大学计算机学院华南理工大学未来技术学院

出处《华南师范大学学报（自然科学版）》 CAS 北大核心 2024年第4期80-90,共11页 Journal of South China Normal University(Natural Science Edition)

基金国家自然科学基金项目(62072120)。

关键词欧氏距离 K近邻缺失值填充概率密度非均匀分布 euclidean distance k-nearest neighbor missing value imputation probability density non-uniform distribution

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

1王震,王新春,杨培宏,费鹏宇,郑学奎.基于多策略融合斑马优化算法的特征选择方法[J].现代电子技术,2024,47(18):149-155.
2陈江.基于物联网技术的传感器网络样本数据定向挖掘方法[J].自动化技术与应用,2024,43(10):104-107.
3张佳亮,袁俊,姚印彬,单晓波.快速连接件盾构隧道上浮机器学习预测研究[J].城市道桥与防洪,2024(10):239-243.
4刘迪洋,张清华,朱冠华.基于优化VMD参数与VGG模型的轴承故障诊断[J].机床与液压,2024,52(18):195-202.
5黄哲.解锁医院档案的潜能[J].文化产业,2024(31):28-30.
6张涛.复数法巧证三角恒等变换公式[J].高中数理化,2024(17):74-75.
7屈荣荣,龚佩佩,许峰峰.错配修复蛋白MLH1+、MSH2+、MSH6+和PMS2+表达与结肠癌临床病理特征、预后的相关性研究[J].黑龙江医学,2024,48(19):2354-2356.
8王毅,陈曦,方志策,杜宝裕.基于数据驱动的降雨型浅层滑坡易发性时空建模方法[J].资源环境与工程,2024,38(5):612-619.
9赵锴,叶丹.基于机器学习的矿床描述文本多标签分类[J].中国矿业,2024,33(10):153-161.
10刘晓佳,陈诗雨,程海霞,陈宏.星形细胞瘤4级患者临床病理特征及生物信息学分析[J].中国临床神经科学,2024,32(4):381-390.

华南师范大学学报（自然科学版）

2024年第4期

浏览历史

内容加载中请稍等...

基于概率密度的自适应k近邻缺失值填充方法

相关作者

相关机构

相关主题

浏览历史