摘要
在机器学习应用中,缺失值填补作为一种预处理技术,能有效提高数据的可用性,然而在缺失值较多或不均衡时,这些技术的效果并不理想.主动学习场景允许机器与用户交互,以获取少量关键数据,提高分类精度.针对主动获取数据量有限的问题,提出基于协同过滤加权预测的主动学习缺失值填补算法(Collaborative Filtering weighted prediction based Active Learning,CFAL).首先采用基于样本和基于属性的协同过滤方法分别预测缺失值;然后根据预测值的差异对数据进行排序,差异大的少量数据进行主动获取,差异小的少量数据利用预测值的平均值进行填补;重复该过程直到主动获取数据达到所给阈值上限,其余缺失值则使用预测值均值填补.实验在七个UCI常用数据集上进行,结果表明,与流行的几种填补算法相比,CFAL算法能更好地提升数据质量,应用于C4.5,kNN等算法能获得更高的分类精度.
In machine learning applications,missing value imputation is an effective preprocessing technique designed to increase data availability.However,if there are many missing values or the values of different attributes are imbalanced,these techniques may not produce satisfactory results.The active learning scenario allows the machine to interact with the users(also known as oracle)to get a small amount of critical data and improve classification accuracy.Most of the existing methods focus on obtaining class labels,and rarely discuss obtaining missing values.This paper studies the active learning problem,in which the number of missing values which can be actively obtained is pre-specified.We propose a missing value imputation algorithm called Collaborative Filtering weighted prediction based Active Learning(CFAL).First,both user-based and item-based collaborative filtering approaches are employed to predict missing values.Second,the missing values are sorted according to the bias of different prediction approaches.Missing values with high deviation are actively obtained,while those with low deviation are filled with the average prediction.This process repeats until the number of active acquisitions achieves the pre-specified value.Remaining missing values are filled with average prediction.We compare CFAL with popular missing valueimputation algorithms including EBN(Imputation algorithm of missing values based on EM and Bayesian network),Mean,NB(Na6 ve-Bayes),and kNN(k Nearest Neighbors)on seven popular UCI(University of California,Irvine)datasets.Results show while coupled with classifiers such as C4.5 and kNN,CFAL produces better classification accuracy than its counterparts.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第4期758-765,共8页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(61379089)
关键词
数据缺失
协同过滤
预测填补
主动学习
分类
data missing
collaborative filtering
predictive imputation
active learning
classification