期刊文献+

大规模时间序列分析框架的研究与实现 被引量:9

An Analysis Framework for Large -Scale Time Series
下载PDF
导出
摘要 工业互联时代,每天数以亿计的传感器源源不断生成时间序列,用以记录工业设备的温度、振动、压力、曲度和张力等参数.如何从这些非结构化的时间序列中挖掘出有价值信息,并运用于状态监测、故障诊断和控制决策,引起了广泛的关注和研究.随着数据规模日益增长,能够提供较为完备数据分析算法库的主流单机环境如Matlab、R等已无法较好地应对大规模时间序列分析场景下的数据处理需求.而现有的并行分析算法数量有限,常与平台相互绑定,更换平台需对算法进行二次开发,可扩展性较差.本文旨在设计一种通用的近似解分析框架,支持第三方算法快速实现并行化,解决因数据规模过大而导致的算法适用性问题.分析框架主要包含任务划分、治理和合并三个步骤.任务划分通过冗余保留了数据的局部相关性,生成相互独立的子任务,减少分布式节点之间的数据通信和同步开销.对于任务划分问题,本文提出了近似解代价模型,得到了最优的任务划分方案.基于Spark平台设计并实现了原型系统,实验结果表明,该系统在确保分析结果准确性的前提下,其加速能力随着并行程度保持近似线性的增长,解决了单机算法的数据规模受限问题.同时,该系统易于集成与扩展,使数据分析人员免于算法重复开发. In the era of Industrial Internet,an explosive volume of time series is continuously generated by various sensors,such as temperature,vibration,pressure,inclination and strain.It is crucial to analyze these unstructured time series to extract valuable information for state monitoring,fault diagnosis and control decision.In specialized fields,abundant data mining and analysis tools have been provided to seek the potentially available knowledge from time series.For example,R and Matlab contain a great quantity of algorithms for matrix manipulation.However,these single-machine environments are no longer effective,or even invalid,especially when the volume and velocity of time series is extremely big.Although there are already several commercials or open-source software for distributed computing,most of them only provide a limited number of parallel algorithms,and these parallel algorithms are platform-dependent.The same algorithm should be developed repeatedly on different platforms,that is to say,it is difficult to extend one parallel algorithm from its original platform to other distributed platforms.This paper studied a Large-scale Time Series Analysis Framework(LTSAF)as a general-purpose tool to parallelize third-party algorithms in a quick manner,so that the existing algorithms are able to analyze the massive time series data efficiently.Based on the insight of divide and conquer,LTSAF proposed an approximate solution if the exact solution cannot be obtained within a feasible period.The solution includes three steps,division,data-parallel computation and combination.The analysis task is firstly divided into a number of subtasks,and each subtask contains a segment of original time series and necessary redundancy to keep data locality.Division makes the subtasks data independent,so the synchronization overhead of intermediate results is greatly reduced.The independent subtasks are small enough,which are then solved by stand-alone algorithms directly.The solutions of subtasks are finally combined to create an approximate solution to the original problem by removing the dirty parts.From theoretical aspects,a time-space cost optimization model is established to determine how to divide the subtasks efficiently,and the optimal length of segmentation is deduced to make a balance between time and space efficiency.This paper also developed a prototype Spark system that could transplant the verified algorithms.Experiments showed the approximate solution made stand-alone algorithms applicable to deal with large-scale time series while obtaining precise results.The parallel algorithms greatly decrease the processing time of large datasets and the scaleup of these algorithms are no longer limited by size of input data.In addition,because subtasks are completely data-independent of one another,the speedup of parallel algorithms increased approximately linearly with the degree of parallelism,in that case,it is easy to estimate the overall computation time of massive time series datasets in advance.As a cross-langrage and cross-platform approach,this prototype system is facile to integrate third-party libraries,so that system users can avoid repetitive development of the existing algorithms,but focus on data analysis.
作者 滕飞 黄齐川 李天瑞 王晨 田春华 TENG Fei;HUANG Qi-Chuan;LI Tian-Rui;WANG Chen;TIAN Chun-Hua(School of Information Science and Technology,Southwest Jiaotong University,Chengdu 610031;State Key Laboratory of Rail Transit Engineering Informatization,China Railway First Survey and Design Institute Group,Xi’an 710043;National Engineering Laboratory for Big Data Software,Tsinghua University,Beijing 100084)
出处 《计算机学报》 EI CSCD 北大核心 2020年第7期1279-1292,共14页 Chinese Journal of Computers
基金 国家重点研发计划项目(2018YFB1701502) 四川省科技计划(2019YJ0214)资助.
关键词 时间序列 算法并行化 近似解 分治 SPARK time series data parallel approximate solution divide and conquer spark
  • 相关文献

参考文献7

二级参考文献100

  • 1汪卫,周皓峰,袁晴晴,楼宇波,施伯乐.基于图论的频繁模式挖掘[J].计算机研究与发展,2005,42(2):230-235. 被引量:17
  • 2李先通,李建中,高宏.一种高效频繁子图挖掘算法[J].软件学报,2007,18(10):2469-2480. 被引量:35
  • 3ZHOU Y L, ZHAO P. Vibration fault diagnosis method of centrifu- gal pump based on EMD complexity feature and least square support vector machine [J]. Energy Procedia, 2012, 17, Part A: 939 - 945.
  • 4ZVOKELJ M, ZUPAN S, PREBIL I. Non-linear multivariate and multiscale monitoring and signal denoising strategy using Kernel Principal Component Analysis combined with Ensemble Empirical Mode Decomposition method [J]. Mechanical Systems and Signal Processing, 2011, 25(7): 2631 -2653.
  • 5LEI Y, HE Z, ZI Y. EEMD method and WNN for fault diagnosis of lo- comotive roller bearings [J]. Expert Systems with Applications, 2011, 38(6): 7334 - 7341.
  • 6CHEN W T, WANG Z Z, XIE H B, et al. Characterization of surface EMG signal based on fuzzy entropy [J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2007, 15(2): 266 - 272.
  • 7WU Z, HUANG N E. Ensemble empirical mode decomposition: a noise-assisted data analysis method [J]. Advances in Adaptive Data Analysis, 2009, 1(1): 1 - 41.
  • 8AN X, JIANG D, LI S, et al. Application of the ensemble empiri- cal mode decomposition and Hilbert transform to pedestal looseness study of direct-drive wind turbine [J]. Energy, 2011, 36(9): 5508 - 5520.
  • 9CHEN W, ZHUANG J, YU W, et al. Measuring complexity using FuzzyEn, ApEn, and SampEn [J]. Medical Engineering & Physics, 2009, 31(1): 61 - 68.
  • 10L1U C, LI K, ZHAO L, et al. Analysis of heart rate variability us- ing fuzzy measure entropy [J]. Computers in Biology and Medicine, 2013, 43(2): 100 - 108.

共引文献219

同被引文献107

引证文献9

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部