摘要
工业互联时代,每天数以亿计的传感器源源不断生成时间序列,用以记录工业设备的温度、振动、压力、曲度和张力等参数.如何从这些非结构化的时间序列中挖掘出有价值信息,并运用于状态监测、故障诊断和控制决策,引起了广泛的关注和研究.随着数据规模日益增长,能够提供较为完备数据分析算法库的主流单机环境如Matlab、R等已无法较好地应对大规模时间序列分析场景下的数据处理需求.而现有的并行分析算法数量有限,常与平台相互绑定,更换平台需对算法进行二次开发,可扩展性较差.本文旨在设计一种通用的近似解分析框架,支持第三方算法快速实现并行化,解决因数据规模过大而导致的算法适用性问题.分析框架主要包含任务划分、治理和合并三个步骤.任务划分通过冗余保留了数据的局部相关性,生成相互独立的子任务,减少分布式节点之间的数据通信和同步开销.对于任务划分问题,本文提出了近似解代价模型,得到了最优的任务划分方案.基于Spark平台设计并实现了原型系统,实验结果表明,该系统在确保分析结果准确性的前提下,其加速能力随着并行程度保持近似线性的增长,解决了单机算法的数据规模受限问题.同时,该系统易于集成与扩展,使数据分析人员免于算法重复开发.
In the era of Industrial Internet,an explosive volume of time series is continuously generated by various sensors,such as temperature,vibration,pressure,inclination and strain.It is crucial to analyze these unstructured time series to extract valuable information for state monitoring,fault diagnosis and control decision.In specialized fields,abundant data mining and analysis tools have been provided to seek the potentially available knowledge from time series.For example,R and Matlab contain a great quantity of algorithms for matrix manipulation.However,these single-machine environments are no longer effective,or even invalid,especially when the volume and velocity of time series is extremely big.Although there are already several commercials or open-source software for distributed computing,most of them only provide a limited number of parallel algorithms,and these parallel algorithms are platform-dependent.The same algorithm should be developed repeatedly on different platforms,that is to say,it is difficult to extend one parallel algorithm from its original platform to other distributed platforms.This paper studied a Large-scale Time Series Analysis Framework(LTSAF)as a general-purpose tool to parallelize third-party algorithms in a quick manner,so that the existing algorithms are able to analyze the massive time series data efficiently.Based on the insight of divide and conquer,LTSAF proposed an approximate solution if the exact solution cannot be obtained within a feasible period.The solution includes three steps,division,data-parallel computation and combination.The analysis task is firstly divided into a number of subtasks,and each subtask contains a segment of original time series and necessary redundancy to keep data locality.Division makes the subtasks data independent,so the synchronization overhead of intermediate results is greatly reduced.The independent subtasks are small enough,which are then solved by stand-alone algorithms directly.The solutions of subtasks are finally combined to create an approximate solution to the original problem by removing the dirty parts.From theoretical aspects,a time-space cost optimization model is established to determine how to divide the subtasks efficiently,and the optimal length of segmentation is deduced to make a balance between time and space efficiency.This paper also developed a prototype Spark system that could transplant the verified algorithms.Experiments showed the approximate solution made stand-alone algorithms applicable to deal with large-scale time series while obtaining precise results.The parallel algorithms greatly decrease the processing time of large datasets and the scaleup of these algorithms are no longer limited by size of input data.In addition,because subtasks are completely data-independent of one another,the speedup of parallel algorithms increased approximately linearly with the degree of parallelism,in that case,it is easy to estimate the overall computation time of massive time series datasets in advance.As a cross-langrage and cross-platform approach,this prototype system is facile to integrate third-party libraries,so that system users can avoid repetitive development of the existing algorithms,but focus on data analysis.
作者
滕飞
黄齐川
李天瑞
王晨
田春华
TENG Fei;HUANG Qi-Chuan;LI Tian-Rui;WANG Chen;TIAN Chun-Hua(School of Information Science and Technology,Southwest Jiaotong University,Chengdu 610031;State Key Laboratory of Rail Transit Engineering Informatization,China Railway First Survey and Design Institute Group,Xi’an 710043;National Engineering Laboratory for Big Data Software,Tsinghua University,Beijing 100084)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第7期1279-1292,共14页
Chinese Journal of Computers
基金
国家重点研发计划项目(2018YFB1701502)
四川省科技计划(2019YJ0214)资助.
关键词
时间序列
算法并行化
近似解
分治
SPARK
time series
data parallel
approximate solution
divide and conquer
spark