摘要
降维是指将样本从输入空间通过线性或非线性方法映射到一个低维空间,从而获得一个关于原数据集的低维表示的过程,它是高维数据挖掘的重要预处理手段之一。文中以适应数据流挖掘需求和保证降维后数据的可用性为目标,设计了基于主成分分析的并行化数据流降维算法PSPCA。该算法使用滑动窗口机制来确定处理数据的范围,同时合并了PCA的标准化过程,改变了相关系数矩阵的计算方法,将有关计算过程基于MapReduce并行化,还将所设计的算法基于流平台Storm进行了实现。并以聚类算法K-means为例,通过实验,对比了K-means在降维前和降维后的数据集上的聚类效果。实验结果表明,PSPCA适用于数据流降维,且降维后的数据能将原数据的信息量保留在合理范围内,能保证后续数据挖掘的准确性。
The dimensional reduction refers to the process of mading sample mapping from the input space to a low-dimensional space through linear or nonlinear methods,so as to obtain a low-dimensional representation of the original data set,which has become one of the important pretreatment methods for high dimensional data mining. To meet the data stream mining requirements and ensure the data availability after dimensional reduction,a parallel data stream dimensionality reduction algorithm based on principal component analysis is designed,called the PSPCA. This algorithm uses a sliding time window mechanism to determine the scope of data processing,combines standardization process of PCA,changes the calculation method of correlation matrix,and parallelizes related calculations based on MapReduce. The algorithm is realized on Storm platform. And an experiment is conducted using K-means algorithm for compared with the clustering effects on the original dataset. The dimensional reduction dataset is designed.Experimental results show that PSPCA for reducing the dimensions of stream data and the data after dimensional reduction can reasonably keep the original data information. Thus,the accuracy of the further data mining can be guaranteed.
出处
《南京邮电大学学报(自然科学版)》
北大核心
2015年第5期99-104,共6页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
国家自然科学基金(61302158
61571238)
中兴通讯产学研基金(KH0040314059)资助项目