摘要
在大数据应用中,多数建模方法是在完备数据集基础上进行的,但在数据采集过程或存储过程中容易出现数据缺失的现象,导致无法建模。为此,提出一种基于聚类的递归充填方法。使用同类簇的均值对不完备数据进行预填充,形成初始完备数据集,针对得到的完整数据进行聚类,并运用同类簇的均值修正初始充填值。根据充填效果误差判定充填稳定性,并进行多次递归聚类修正充填值,直到前后两次充填较为稳定或迭代次数超过阈值时停止迭代。实验结果表明,与均值充填、K最近邻充填、聚类充填及粗糙集不完备数据分析等方法相比,该方法能够进行更为精准的充填,使得最终充填更加接近真实数据。
In big data applications,most modeling methods are based on a complete data set,but data missing in the data acquisition process or storing process tend to result in failure to modeling.Therefore,a clustering-based recursive filling method is proposed.The incomplete data is pre-filled using the mean of the same cluster to form an initial complete data set.The complete data obtained are clustered,and the initial filling is corrected using the mean of the same cluster.The filling stability is determined according to the deviation of filling results,and the filling value is corrected through multiple times of recursive clustering until the last two times of filling is stable or the number of iterations exceeds the threshold.Experimental results show that compared with the methods of mean filling,K nearest neighbor filling,cluster filling and incomplete data analysis for rough sets,the method can implement more precise filling,making the final filling more close to real data.
作者
李国和
杨绍伟
吴卫江
郑艺峰
LI Guohe;YANG Shaowei;WU Weijiang;ZHENG Yifeng(Beijing Key Lab of Petroleum Data Mining ,China University of Petroleum(Beijing), Beijing 102249,China;College of Geophysics and Information Engineering,China University of Petroleum(Beijing), Beijing 102249,China;Key Laboratory of Data Science and Intelligence Application ,Minnan Normal University,Zhangzhou,Fujian 363000,China;School of Computer Sciences,Minnan Normal University,Zhangzhou,Fujian 363000,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2019年第9期32-39,共8页
Computer Engineering
基金
国家自然科学基金(61701213)
国家油气重点专项子课题(G-5800-08-ZS-WX)
中国石油大学(北京)克拉玛依校区科研启动基金(RCYJ2016B-03-001)
福建省教育厅中青年基金(JA15300)
关键词
缺失值
预充填
聚类
递归充填
平方误差
missing value
prefilling
clustering
recursive filling
square error