期刊文献+

基于HBase的工业时序大数据分布式存储性能优化策略 被引量:8

Performance optimization strategy of distributed storage for industrial time series big data based on HBase
下载PDF
导出
摘要 在自动化的工业场景中,大量工业设备产生的时序性日志数据量呈爆炸式增长,业务场景对时序数据的访问需求进一步提升。虽然目前基于分布式列族的数据库HBase能够存储工业时序大数据,但由于未考虑特定业务场景中数据与访问行为特征的关联,现有策略无法较好地满足工业时序数据的特定访问需求。针对上述问题,基于分布式存储系统HBase,利用工业场景中数据与访问行为特征的关联性,提出面向海量工业时序数据的分布式存储性能优化策略。针对工业时序数据特点引发的负载倾斜问题,提出基于冷热数据分区及访问行为分类的负载均衡优化策略。使用逻辑回归模型(LR)对数据进行冷热分类,并将热数据分散存储在不同节点;同时,为进一步降低存储集群中跨节点的通信开销,以提升工业时序数据高维索引的查询效率,提出索引主数据同Region化策略,设计索引RowKey字段及拼接规则,将索引存放到与它对应的主数据相同的Region中。在真实工业时序数据上的实验结果表明,引入优化策略后的数据负载分布倾斜度降低28.5%,查询效率提升27.7%,验证了所提优化策略能够有效地挖掘特定时序数据的访问模式,合理地分配负载,降低数据访问开销,有能力满足对特定时序大数据的访问需求。 In automated industrial scenarios,the amount of time series log data generated by a large number of industrial devices has exploded,and the demand for access to time series data in business scenarios has further increased.Although HBase,a distributed column family database,can store industrial time series big data,the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered.In view of the above problem,based on the distributed storage system HBase,and using the correlation between data and access behavior characteristics in industrial scenarios,a distributed storage performance optimization strategy for massive industrial time series data was proposed.Aiming at the load tilt problem caused by characteristics of industrial time series data,a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed.The data were classified into cold and hot ones by using a Logistic Regression(LR)model,and the hot data were distributed and stored in different nodes.In addition,in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data,a strategy of putting the index and main data into a same Region was proposed.By designing the index RowKey field and splicing rules,the index was stored with its corresponding main data in the same Region.Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5%and the query efficiency is improved by 27.7%after introducing the optimization strategy,demonstrating the proposed strategy can mine access patterns for specific time series data effectively,distribute load reasonably,reduce data access overhead,and meet access requirements for specific time series big data.
作者 杨力 陈建廷 向阳 YANG Li;CHEN Jianting;XIANG Yang(College of Electronic and Information Engineering,Tongji University,Shanghai 201804,China)
出处 《计算机应用》 CSCD 北大核心 2023年第3期759-766,共8页 journal of Computer Applications
基金 国家重点研发计划项目(2019YFB1704402)。
关键词 分布式存储 时序大数据 工业大数据 负载均衡 HBASE distributed storage time series big data industrial big data load balancing HBase
  • 相关文献

参考文献6

二级参考文献110

  • 1张笑东,夏筱筠,吕海峰,公绪超,廉梦佳.大数据网络并行计算环境中生理数据流动态负载均衡[J].吉林大学学报(工学版),2020,50(1):247-254. 被引量:11
  • 2杨学军,窦勇,胡庆丰.Progress and Challenges in High Performance Computer Technology[J].Journal of Computer Science & Technology,2006,21(5):674-681. 被引量:7
  • 3The Top500 Supercomputer Sites, 2009. Processor Generation share for 11/2008[EB/OL], http://www.top500. org/charts/list/30/archtype, 2008-11-01/2009-05-14.
  • 4Wu X F, Taylor V, Lively C, et al. Performance analysis and optimization of parallel scientific applications on CMP clusters [J]. Scalable Computing: Practice and Experience, 2009, 10(1) : 61-74.
  • 5Hennessy J L, Patterson D A. Computer architecture: a quantitative approach [M]. San Francisco, CA: Morgan Kaufrman Publishers, 2007.
  • 6Yasar O, Dag H. Trends in parallel computing[ J]. Parallel Computing, 2007, 33(2):81-82.
  • 7Devine K D, Boman E G, Heaphy R T, et al. New challenges in dynamic load balancing [J]. Applied Numerical Mathematics, 2005, 52(2-3) :133-152.
  • 8Willebeek-LeMair M H,Reeves A P. Strategies for dynamic load balancing on highly parallel computers[ J]. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(9) : 979-993.
  • 9Bhatclc A, Kumar S, Mci C,et al. Overcoming scaling challenges in biomolecular simulations across multiple platforms[ A]. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium ( IPDPS 08 ) [ C ]. Washington: IEEE Computer Society, 2008,1-12.
  • 10Deer D. For programmers, multicore chips mean multiple challenges[J]. Computer, 2007, 40(9) :17-19.

共引文献536

同被引文献52

引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部