期刊文献+

基于主成分分析的并行化数据流降维算法研究 被引量:8

Parallel data stream dimensional reduction algorithm based on principal component analysis
下载PDF
导出
摘要 降维是指将样本从输入空间通过线性或非线性方法映射到一个低维空间,从而获得一个关于原数据集的低维表示的过程,它是高维数据挖掘的重要预处理手段之一。文中以适应数据流挖掘需求和保证降维后数据的可用性为目标,设计了基于主成分分析的并行化数据流降维算法PSPCA。该算法使用滑动窗口机制来确定处理数据的范围,同时合并了PCA的标准化过程,改变了相关系数矩阵的计算方法,将有关计算过程基于MapReduce并行化,还将所设计的算法基于流平台Storm进行了实现。并以聚类算法K-means为例,通过实验,对比了K-means在降维前和降维后的数据集上的聚类效果。实验结果表明,PSPCA适用于数据流降维,且降维后的数据能将原数据的信息量保留在合理范围内,能保证后续数据挖掘的准确性。 The dimensional reduction refers to the process of mading sample mapping from the input space to a low-dimensional space through linear or nonlinear methods,so as to obtain a low-dimensional representation of the original data set,which has become one of the important pretreatment methods for high dimensional data mining. To meet the data stream mining requirements and ensure the data availability after dimensional reduction,a parallel data stream dimensionality reduction algorithm based on principal component analysis is designed,called the PSPCA. This algorithm uses a sliding time window mechanism to determine the scope of data processing,combines standardization process of PCA,changes the calculation method of correlation matrix,and parallelizes related calculations based on MapReduce. The algorithm is realized on Storm platform. And an experiment is conducted using K-means algorithm for compared with the clustering effects on the original dataset. The dimensional reduction dataset is designed.Experimental results show that PSPCA for reducing the dimensions of stream data and the data after dimensional reduction can reasonably keep the original data information. Thus,the accuracy of the further data mining can be guaranteed.
出处 《南京邮电大学学报(自然科学版)》 北大核心 2015年第5期99-104,共6页 Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金 国家自然科学基金(61302158 61571238) 中兴通讯产学研基金(KH0040314059)资助项目
关键词 数据流 PCA 并行化 STORM data stream PCA parallelization Storm
  • 相关文献

参考文献15

  • 1ASTROM K J, WITFENMARK B. Adaptive control [ M ]. Dover: Dover Pubns ,2013.
  • 2BABCOCK B, BABU S, DATAR M, et al. Models and is- sues in data stream systems [ C ]//Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2002 : 1 - 16.
  • 3GAROFALAKIS M, GEHRKE J, RASTOGI R. Querying and mining data streams:You only get one look a tutorial [ C ]//SIGMOD Conference. 2002:635.
  • 4SHLENS J. A tutorial on principal component analysis [ R ]. New York : Cornell University,2002.
  • 5JIN X,ZHAO M,CHOW T W S,et al. Motor bearing fault diagnosis using trace ratio linear discriminant analysis [ J ]. IEEE Transactions on Industrial Electronics,2014,61 (5) : 2441 - 2451.
  • 6曾理,张雄伟,陈亮,杨吉斌,贾冲.基于压缩感知的K-L分解语音稀疏表示算法[J].数据采集与处理,2013,28(3):267-273. 被引量:9
  • 7赵峰,黄庆明,高文.一种基于奇异值分解的图像匹配算法[J].计算机研究与发展,2010,47(1):23-32. 被引量:26
  • 8金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:161
  • 9ANDERSON Q. Storm real-time processing cookbook [ M ]. Birmingham : Packt Publishing ,2013.
  • 10LEIBIUSKY J, EISBRUCH G, SIMONASSI D. Getting started with storm [ M ]. Sebastopool: OReilly Media, 2012.

二级参考文献111

共引文献521

同被引文献88

引证文献8

二级引证文献39

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部