摘要
随着LHAASO高海拔宇宙线等高能物理实验规模的不断扩大,每年需要存储PB级的海量物理数据。高能物理海量存储系统一般采用随机的数据放置策略,没有考虑数据访问场景和服务器节点、存储设备的差异性。针对以上问题,提出一种异构存储环境下基于随机森林算法的数据放置策略,根据存储设备性能差异划分快慢存储池,同时对后期文件的读写访问场景进行预测和识别,综合考虑当前设备负载为数据找到最佳的放置位置。使用真实物理实验数据验证了算法的有效性。
With the continuous developments of high energy physics experiments such as Large High Air Altitude Shower Observatory(LHAASO),a large amount of data at PB scale will be collected,stored and analyzed every year.At present,random data placement strategy which doesn’t fully consider the differences among data access scenarios,servers and storage devices is generally used.A data placement strategy based on random-forest algorithm is proposed.Storage devices are separated into storage pools(Fast pool,Normal pool)according to their performance.The algorithm will predict and identify a new file’s access pattern,and find one best place for it considering the load of target devices.This paper evaluates the performance of the algorithm with data samples collected from production storage system of LHAASO experiment.
作者
程振京
程耀东
陈刚
汪璐
李海波
胡庆宝
CHENG Zhenjing;CHENG Yaodong;CHEN Gang;WANG Lu;LI Haibo;HU Qingbao(Computing Center,Institute of High Energy Physics,Chinese Academy of Sciences,Beijing 100049,China;University of Chinese Academy of Sciences,Beijing 100049,China;Tianfu Cosmic Ray Research Center,Institute of High Energy Physics,ChineseAcademy of Sciences,Chengdu 610041,China)
出处
《计算机工程与应用》
CSCD
北大核心
2020年第21期60-64,共5页
Computer Engineering and Applications
基金
国家自然科学基金(No.11675201,No.11575223,No.11605223,No.11805226)。
关键词
随机森林
分布式存储系统
异构存储
存储池
数据放置策略
访问场景
random forest
distributed storage system
heterogeneous storage
storage pool
data placement strategy
access scenario