期刊文献+

基于泛化中心聚类的不完备数据集填补方法 被引量:11

Missing Data Imputation Approach Based on Generalized Centroids Clustering Algorithm
下载PDF
导出
摘要 随着信息技术、云计算、互联网以及社交网络等技术的不断发展,数据规模呈爆炸态势增长.在海量数据带来丰富信息的同时,如何对海量信息进行高效的预处理成为研究的热点.其中,对于缺失数据的处理就是数据预处理技术中一项重要的挑战.传统的缺失数据的填补方法大部分都只考虑不完备集中数据完全缺失情况下的填补,然而,在海量数据集中,由于人为或者机械等原因会对数据造成一定程度的损坏,有些数据会完全缺失,而有些数据只是部分缺失,传统的填补方法未对不同程度上损坏的数据进行划分,全部按照完全缺失数据进行填补分析,忽略了部分缺失数据对数据填补结果的影响.因此,提出一种基于泛化中心聚类的填补方法(GCF),采用泛化中心聚类思想对数据进行分簇,并对随机损坏数据与聚类结果一起进行缺失数据的填补,以提高填补后数据集的正确率.实验表明,针对不同缺失度的数据集样本,提出的GCF策略在填补正确率方面都具有良好的表现. With the development of information technology,cloud technology,internet and social network,The scale of the data has grown explosively.Althouth mass data can provide wealthy information,and at the same time,how to preprocess the information efficiently has become a research focus.Among them,preprocessing the missing data is an important challenge in the pretreatment,Mosttraditional filling method for missing data only consider filling incomplete centralized data in the completely missing cases.However,due to artificial or mechanical and other reasons in mass data,this will cause a certain degree of damage to the data.Some data will be completely missing,and some missing is only partially,the traditional filling method didn't divide the data in different degrees of damage.They all analysis completely missing ,but ignore the influence of partially missing data.In this paper,a kind of method based on generalized center-clustering fill (GCF) has been proposed,thispaperadoptsthe idea of generalization center clustering to cluster the data,and fill the missing databetween the random damage data and clustering results in order to improve the accuracy of the dataset filled.Experimental results show that the proposed GCF strategy in the accuracy of filling missing datasets that has different degree have good performance.
出处 《小型微型计算机系统》 CSCD 北大核心 2017年第9期2017-2021,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61472169 61472072)资助 国家科技支撑计划项目(2012BAF13B08)资助 国家"九七三"重点基础研究发展计划前期研究专项项目(2014CB360509)资助 辽宁省科学事业公益研究基金项目(2015003003)资助
关键词 海量数据 不完备数据集 泛化中心聚类 数据损坏度 mass data incomplete data generalized centroids clustering stochastic damage
  • 相关文献

参考文献11

二级参考文献227

  • 1张敏,于剑.基于划分的模糊聚类算法[J].软件学报,2004,15(6):858-868. 被引量:176
  • 2Han Jiawei,Kamber M. Data Mining:Concepts and Techniques. San Francisco, US: Morgan Kaufmann, 2001
  • 3MacQueen J B. Some methods for classification and analysis of multivariate observation//Proceeding 5^th Berkley Symposium, on Mathematical Statistics and Probability. 1967, I:281-297. University of California Press, 1967, Xvii, 666
  • 4Huang Zhexue. Clustering Large Data Sets with Mixed Numeric and Categorical Values//PAKDD'97. Singapore, World Scientific, 1997:21-35
  • 5Huang Zhexue. Extensions to the k Means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998,2 : 283-304
  • 6Michael K, Ng M, Li Junjie, et al. On the impact of dissimilarity measure in K-Modes clustering algorithm. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2007,29 (3) : 503-507
  • 7Li Cen, Biswas Gautam. Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 2002,14 :673-690
  • 8Hsu C C, Chen Chinlong, Su Yuwei. Hierarchical clustering of mixed data based on distance hierarchy. Information Sciences, 2007 :4474-4492
  • 9Hsu C C. Generalizing self-organizing map for categorical data. IEEE Transaction on Neural Network, 2006,17 (2) : 294-304
  • 10Ganti V, Ramakrishnanz J G R. CACTUS, clustering categorical data using summaries//Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego:ACM Press, 1999 : 73-83

共引文献393

同被引文献89

引证文献11

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部