摘要
在数据挖掘以及机器学习等领域,都需要涉及一个数据预处理过程,以消除数据中所包含的错误、噪声、不一致数据或缺失值。其中,缺失值的填充是一个非常具有挑战性的任务,因为填充效果的好坏会极大的影响学习算法及挖掘算法的后续处理过程。目前已有的一些填充算法,如基于粗糙集的和基于最近邻法的算法等,在一定程度上能够处理缺失值问题。与以上方法不同,提出了一种扩展的基于信息增益的缺失值填充算法,它充分利用数据集中各属性之间隐含的关系对缺失的数据进行填充。大量的实验表明,提出的扩展的基于信息增益的缺失值填充算法是有效的。
In the data mining or machine learning field, a data preprocessing procedure is often needed to eliminate errors, noises, inconsistent data or missing data that are contained in the dataset. Among them, the missing data filling is a very challenging task, because the filling results greatly affect the following procedures of the learning or mining algorithms. While some existing filling algorithms, such as rough set based and nearest neighbor based algorithms etc, can deal with the missing data problem to some extent. Different from these methods, an extended information gain (IG) based on algorithm is proposed for dealing with missing data, which fully utilizes the underlying relationships between attributes of the dataset. Extensive experiments show that the proposed algorithm is efficient.
出处
《计算机工程与设计》
CSCD
北大核心
2006年第24期4810-4812,共3页
Computer Engineering and Design
关键词
机器学习
缺失值填充
信息增益
分类准确率
machine learning
missing data imputation, information gain
classification accuracy