大规模时间序列分析框架的研究与实现被引量：9

An Analysis Framework for Large -Scale Time Series

下载PDF

导出

摘要工业互联时代,每天数以亿计的传感器源源不断生成时间序列,用以记录工业设备的温度、振动、压力、曲度和张力等参数.如何从这些非结构化的时间序列中挖掘出有价值信息,并运用于状态监测、故障诊断和控制决策,引起了广泛的关注和研究.随着数据规模日益增长,能够提供较为完备数据分析算法库的主流单机环境如Matlab、R等已无法较好地应对大规模时间序列分析场景下的数据处理需求.而现有的并行分析算法数量有限,常与平台相互绑定,更换平台需对算法进行二次开发,可扩展性较差.本文旨在设计一种通用的近似解分析框架,支持第三方算法快速实现并行化,解决因数据规模过大而导致的算法适用性问题.分析框架主要包含任务划分、治理和合并三个步骤.任务划分通过冗余保留了数据的局部相关性,生成相互独立的子任务,减少分布式节点之间的数据通信和同步开销.对于任务划分问题,本文提出了近似解代价模型,得到了最优的任务划分方案.基于Spark平台设计并实现了原型系统,实验结果表明,该系统在确保分析结果准确性的前提下,其加速能力随着并行程度保持近似线性的增长,解决了单机算法的数据规模受限问题.同时,该系统易于集成与扩展,使数据分析人员免于算法重复开发. In the era of Industrial Internet,an explosive volume of time series is continuously generated by various sensors,such as temperature,vibration,pressure,inclination and strain.It is crucial to analyze these unstructured time series to extract valuable information for state monitoring,fault diagnosis and control decision.In specialized fields,abundant data mining and analysis tools have been provided to seek the potentially available knowledge from time series.For example,R and Matlab contain a great quantity of algorithms for matrix manipulation.However,these single-machine environments are no longer effective,or even invalid,especially when the volume and velocity of time series is extremely big.Although there are already several commercials or open-source software for distributed computing,most of them only provide a limited number of parallel algorithms,and these parallel algorithms are platform-dependent.The same algorithm should be developed repeatedly on different platforms,that is to say,it is difficult to extend one parallel algorithm from its original platform to other distributed platforms.This paper studied a Large-scale Time Series Analysis Framework(LTSAF)as a general-purpose tool to parallelize third-party algorithms in a quick manner,so that the existing algorithms are able to analyze the massive time series data efficiently.Based on the insight of divide and conquer,LTSAF proposed an approximate solution if the exact solution cannot be obtained within a feasible period.The solution includes three steps,division,data-parallel computation and combination.The analysis task is firstly divided into a number of subtasks,and each subtask contains a segment of original time series and necessary redundancy to keep data locality.Division makes the subtasks data independent,so the synchronization overhead of intermediate results is greatly reduced.The independent subtasks are small enough,which are then solved by stand-alone algorithms directly.The solutions of subtasks are finally combined to create an approximate solution to the original problem by removing the dirty parts.From theoretical aspects,a time-space cost optimization model is established to determine how to divide the subtasks efficiently,and the optimal length of segmentation is deduced to make a balance between time and space efficiency.This paper also developed a prototype Spark system that could transplant the verified algorithms.Experiments showed the approximate solution made stand-alone algorithms applicable to deal with large-scale time series while obtaining precise results.The parallel algorithms greatly decrease the processing time of large datasets and the scaleup of these algorithms are no longer limited by size of input data.In addition,because subtasks are completely data-independent of one another,the speedup of parallel algorithms increased approximately linearly with the degree of parallelism,in that case,it is easy to estimate the overall computation time of massive time series datasets in advance.As a cross-langrage and cross-platform approach,this prototype system is facile to integrate third-party libraries,so that system users can avoid repetitive development of the existing algorithms,but focus on data analysis.

作者滕飞黄齐川李天瑞王晨田春华 TENG Fei;HUANG Qi-Chuan;LI Tian-Rui;WANG Chen;TIAN Chun-Hua(School of Information Science and Technology,Southwest Jiaotong University,Chengdu 610031;State Key Laboratory of Rail Transit Engineering Informatization,China Railway First Survey and Design Institute Group,Xi’an 710043;National Engineering Laboratory for Big Data Software,Tsinghua University,Beijing 100084)

机构地区西南交通大学信息科学与技术学院中铁一院轨道交通工程信息化国家重点实验室清华大学大数据系统软件国家工程实验室

出处《计算机学报》 EI CSCD 北大核心 2020年第7期1279-1292,共14页 Chinese Journal of Computers

基金国家重点研发计划项目(2018YFB1701502) 四川省科技计划(2019YJ0214)资助.

关键词时间序列算法并行化近似解分治 SPARK time series data parallel approximate solution divide and conquer spark

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1陈志,李天瑞,李明,杨燕.基于计算统一设备架构的高铁故障诊断方法[J].计算机应用,2015,35(10):2819-2823. 被引量：3
2张鹏,段磊,秦攀,左劼,唐常杰,元昌安,彭舰.基于Spark的Top-k对比序列模式挖掘[J].计算机研究与发展,2017,54(7):1452-1464. 被引量：7
3吴信东,嵇圣硙.MapReduce与Spark用于大数据分析之比较[J].软件学报,2018,29(6):1770-1791. 被引量：77
4朱虎明,李佩,焦李成,杨淑媛,侯彪.深度神经网络并行化研究综述[J].计算机学报,2018,41(8):1861-1881. 被引量：56
5黄宜华.大数据机器学习系统研究进展[J].大数据,2015,1(1):28-47. 被引量：51
6严玉良,董一鸿,何贤芒,汪卫.FSMBUS:一种基于Spark的大规模频繁子图挖掘算法[J].计算机研究与发展,2015,52(8):1768-1783. 被引量：21
7秦娜,金炜东,黄进,李智敏.高速列车转向架故障信号的聚合经验模态分解和模糊熵特征分析[J].控制理论与应用,2014,31(9):1245-1251. 被引量：12

二级参考文献100

1汪卫,周皓峰,袁晴晴,楼宇波,施伯乐.基于图论的频繁模式挖掘[J].计算机研究与发展,2005,42(2):230-235. 被引量：17
2李先通,李建中,高宏.一种高效频繁子图挖掘算法[J].软件学报,2007,18(10):2469-2480. 被引量：35
3ZHOU Y L, ZHAO P. Vibration fault diagnosis method of centrifu- gal pump based on EMD complexity feature and least square support vector machine [J]. Energy Procedia, 2012, 17, Part A: 939 - 945.
4ZVOKELJ M, ZUPAN S, PREBIL I. Non-linear multivariate and multiscale monitoring and signal denoising strategy using Kernel Principal Component Analysis combined with Ensemble Empirical Mode Decomposition method [J]. Mechanical Systems and Signal Processing, 2011, 25(7): 2631 -2653.
5LEI Y, HE Z, ZI Y. EEMD method and WNN for fault diagnosis of lo- comotive roller bearings [J]. Expert Systems with Applications, 2011, 38(6): 7334 - 7341.
6CHEN W T, WANG Z Z, XIE H B, et al. Characterization of surface EMG signal based on fuzzy entropy [J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2007, 15(2): 266 - 272.
7WU Z, HUANG N E. Ensemble empirical mode decomposition: a noise-assisted data analysis method [J]. Advances in Adaptive Data Analysis, 2009, 1(1): 1 - 41.
8AN X, JIANG D, LI S, et al. Application of the ensemble empiri- cal mode decomposition and Hilbert transform to pedestal looseness study of direct-drive wind turbine [J]. Energy, 2011, 36(9): 5508 - 5520.
9CHEN W, ZHUANG J, YU W, et al. Measuring complexity using FuzzyEn, ApEn, and SampEn [J]. Medical Engineering & Physics, 2009, 31(1): 61 - 68.
10L1U C, LI K, ZHAO L, et al. Analysis of heart rate variability us- ing fuzzy measure entropy [J]. Computers in Biology and Medicine, 2013, 43(2): 100 - 108.

共引文献219

1周晓,马圣杰.基于集成学习的转子部件脱落故障诊断方法[J].数字制造科学,2022(1):16-22.
2王瑞玺,尚东方,鲍可馨.基于大数据的海港船舶疫情风险防控平台设计与实现[J].中国水运（下半月）,2022,22(8):42-44. 被引量：1
3夏润亮,刘启兴,李涛,刘晓燕,高云飞,吴丹.基于集成学习的黄河未控区径流预测研究[J].应用基础与工程科学学报,2020(3):740-749. 被引量：7
4孙如飞,张焱,陈瑞祥,孙飞飞,陈龙赞.流处理技术在水利物联网领域的应用[J].人民黄河,2021,43(S02):264-267. 被引量：1
5危前进,魏继鹏,古天龙,常亮,文益民.粗糙集多目标并行属性约简算法[J].软件学报,2022,33(7):2599-2617. 被引量：2
6徐霞军,秦绪涛,杨强,朱云飞.大数据技术在核电设备缺陷分析中的初步应用[J].核动力工程,2020,41(S01):68-72. 被引量：6
7牛文生.基于天地一体化信息网络的智能航空客运系统[J].航空学报,2019,40(1):231-244. 被引量：11
8张磊,陈东,王建新,高献伟,段晓毅.机器学习算法与应用[J].北京电子科技学院学报,2017,25(4):51-56. 被引量：3
9俞伟丰,张文瑞.降低固定式架车机同步机械故障的技术改造[J].中国高新技术企业,2015(18):37-38. 被引量：2
10颜云华,吴志丹.基于MEMD的高速列车转向架故障的排列熵特征分析[J].电子技术应用,2016,42(5):124-127. 被引量：5

同被引文献107

1尹进,胡祥培,郑毅,周子轩.社会化商务中基于经验及推荐的消费者感知信任模糊融合模型[J].中国管理科学,2020,0(1):122-133. 被引量：18
2陈万志,赵宇璇.智慧校园隐式用户行为的数据挖掘方法[J].辽宁工程技术大学学报（自然科学版）,2020(5):434-439. 被引量：13
3黄学平,薛安荣.基于数据库划分的关联规则算法[J].计算机工程与设计,2008,29(12):3005-3007. 被引量：5
4厍向阳,朱命昊,赵亚敏.求解0/1背包问题的改进人工鱼群算法研究[J].计算机工程与应用,2011,47(21):43-46. 被引量：16
5赵新勇,安实.伴随车检测技术应用研究[J].交通运输系统工程与信息,2012,12(3):36-40. 被引量：7
6周心林,赵雷.数据流上多滑动窗口聚集查询的优化算法[J].小型微型计算机系统,2013,34(4):774-777. 被引量：7
7韩敏,许美玲,任伟杰.多元混沌时间序列的相关状态机预测模型研究[J].自动化学报,2014,40(5):822-829. 被引量：13
8刘海鸥,黄文娜,苏妍嫄,张亚明.大数据深度融合的移动图书馆情境化推荐[J].情报科学,2019,37(1):68-73. 被引量：26
9刘海鸥.面向大数据知识服务推荐的移动SNS信任模型[J].图书馆论坛,2014,34(10):68-75. 被引量：19
10范全润,段振华.基于聚类和划分的SAT分治判定[J].软件学报,2015,26(9):2155-2166. 被引量：1

引证文献9

1刘海鸥,黄文娜,张源强,苏妍嫄.移动社交网络情境化推荐关键问题研究综述[J].小型微型计算机系统,2020,41(9):1812-1819. 被引量：5
2梁建海,方英武,宋新海,苗壮.基于极值分段特征标识的时间序列分类方法[J].控制工程,2022,29(8):1528-1536. 被引量：1
3钟运琴,朱月琴,焦守涛.边缘大数据分析预测建模方法研究[J].高技术通讯,2022,32(10):1067-1075. 被引量：1
4孔明,魏东,冉义兵,毕国鹏.基于Fork/Join的事务日志伴随模式挖掘方法[J].小型微型计算机系统,2023,44(2):239-247.
5吴鹏,翟嘉伊,汪健,张凤荔.基于AFSA优化的灰色模型的车流量预测方法[J].计算机与数字工程,2022,50(12):2727-2730. 被引量：1
6申彦,敬露艺,张士翔.基于Spark的分布式时序分类学习模型[J].计算机工程与设计,2023,44(4):1042-1049. 被引量：1
7宋春雷,赵旭俊,高亚星,晋广印.采用分段特征表示的异常序列检测算法[J].计算机工程与应用,2023,59(9):262-271. 被引量：1
8康江龙.足球运动员训练机能监控系统设计[J].自动化与仪器仪表,2023(10):104-107.
9姚红,梁竹.基于时间序列的局部离群数据挖掘优化算法[J].计算机仿真,2024,41(3):514-518.

二级引证文献10

1刘海鸥,姚苏梅,何旭涛,苏妍嫄.基于深度学习的在线健康社区抑郁症用户画像研究[J].小型微型计算机系统,2021,42(3):572-577. 被引量：10
2张磊.基于情景感知技术的网络信息资源个性化推荐方法[J].河北北方学院学报（自然科学版）,2022,38(1):21-26. 被引量：4
3王晰巍,乌吉斯古楞,刘宇桐,罗然.面向智能推荐的AI人机交互:研究热点及未来机会[J].情报学报,2023,42(4):495-509. 被引量：6
4马占海,严嘉正,张俊超.多源异构的低占用率电力大数据识别系统设计[J].电子设计工程,2023,31(20):86-90. 被引量：1
5程晓晓,蒲兵舰,张国平,丁萌萌.基于二元密度聚类的物资价格时延计算方法研究[J].吉林大学学报（信息科学版）,2023,41(5):820-826.
6李昌建,于海波.基于灰色模型的工程造价指数组合预测模型构建[J].现代科学仪器,2024,41(1):176-181.
7李斌,何辉,赵中英,郭景维.基于区块链的多源网络大数据安全访问权限认证仿真[J].电信科学,2024,40(2):107-115. 被引量：3
8赵蕾,夏吉安,吴洋,崔辉.基于Spark平台的分类算法性能比较分析[J].计算机与数字工程,2024,52(3):688-691. 被引量：1
9李忠,欧阳斌,严路,徐兴华,崔小鹏,邱少华.基于时间序列相似性搜索的电磁阻拦装置故障诊断方法[J].高电压技术,2024,50(7):3258-3269.
10陆成刚,王丽君,王庆月.谷歌趋势主题热度的地理分布[J].计算机科学与应用,2021,11(1):8-18.

1余胜辉,李玲娟.基于Spark的层次聚类算法的并行化研究[J].计算机技术与发展,2020,30(6):19-22. 被引量：6
2陈凯,曹云刚,杨秀春,潘梦,张敏.基于CPU-GPU异构混合编程的遥感数据时空融合[J].地理信息世界,2019,26(6):6-13. 被引量：2
3范瑛.大数据时代计算机数据库连接访问技术分析与研究[J].数字技术与应用,2020,38(5):114-115. 被引量：4
4赵亮,陈志奎.大数据算法库教学实验平台设计与实现[J].实验技术与管理,2020,37(6):197-201. 被引量：12
5贾伟博.环境工程中城市污水处理的研究[J].市场周刊·理论版,2019(63):204-204.
6沈自虎,吴淑玮,葛艺晓,张守田.输电网接线图增量自动成图算法[J].计算机系统应用,2020,29(5):128-135. 被引量：5
7杨柳青.民国期刊缩微胶片数字化的分期[J].数字与缩微影像,2020(2):6-9.
8甘肃省档案学会一九八二年学术活动实施方案[J].档案,1982,0(S01):31-32.
9管志军.环保企业评估中常用估值模型的适用性分析[J].市场周刊·理论版,2019(62):13-13.
10温建利,林常源.组合工艺处理农村生活污水的应用分析[J].中国资源综合利用,2020,38(6):199-201. 被引量：1

计算机学报

2020年第7期

浏览历史

内容加载中请稍等...

大规模时间序列分析框架的研究与实现被引量：9

参考文献7

二级参考文献100

共引文献219

同被引文献107

引证文献9

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

大规模时间序列分析框架的研究与实现 被引量：9

参考文献7

二级参考文献100

共引文献219

同被引文献107

引证文献9

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

大规模时间序列分析框架的研究与实现被引量：9