In a growing number of information processing applications,data takes the form of continuous data streams rather than traditional stored databases.Monitoring systems that seek to provide monitoring services in cloud e...In a growing number of information processing applications,data takes the form of continuous data streams rather than traditional stored databases.Monitoring systems that seek to provide monitoring services in cloud environment must be prepared to deal gracefully with huge data collections without compromising system performance.In this paper,we show that by using a concept of urgent data,our system can shorten the response time for most 'urgent' queries while guarantee lower bandwidth consumption.We argue that monitoring data can be treated differently.Some data capture critical system events;the arrival of these data will significantly influence the monitoring reaction speed which is called urgent data.High speed urgent data collections can help system to react in real time when facing fatal errors.A cloud environment in production,MagicCube,is used as a test bed.Extensive experiments over both real world and synthetic traces show that when using urgent data,monitoring system can lower the response latency compared with existing monitoring approaches.展开更多
With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-sc...With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.展开更多
Congestion in wireless sensor networks (WSNs) not only causes severe information loss but also leads to excessive energy consumption. To address this problem, a novel scheme for congestion avoidance, detection and all...Congestion in wireless sensor networks (WSNs) not only causes severe information loss but also leads to excessive energy consumption. To address this problem, a novel scheme for congestion avoidance, detection and alleviation (CADA) in WSNs is proposed in this paper. By exploiting data characteristics, a small number of representative nodes are chosen from those in the event area as data sources, so that the source traffic can be suppressed proactively to avoid potential congestion. Once congestion occurs inevitably due to traffic mergence, it will be detected in a timely way by the hotspot node based on a combination of buffer occupancy and channel utilization. Congestion is then alleviated reactively by either dynamic traffic multiplexing or source rate regulation in accordance with the specific hotspot scenarios. Extensive simulation results under typical congestion scenarios are presented to illuminate the distinguished performance of the proposed scheme.展开更多
The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such...The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such as network delay.The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness.Watermark is a special kind of data inserted into the data stream with a timestamp,which helps the framework to decide whether the data received is late and thus be discarded.Traditional watermark generation strategies are periodic;they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy.This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation.This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy.We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets.The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.展开更多
The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure...The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.展开更多
As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with ...As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with high probability.Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned.From the user perspective,if the program failure cannot be detected and handled in time,it would waste resources and delay the progress of program execution.Unfortunately,the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege.Currently,automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and performance overhead caused by ARL is negligible.The good scalability of ARL makes it applicable for large-scale HPC systems.展开更多
基金supported by the National Key Technology R&D Program(Grant NO. 2012BAH17F01)NSFC-NSF International Cooperation Project(Grant NO. 61361126011)
文摘In a growing number of information processing applications,data takes the form of continuous data streams rather than traditional stored databases.Monitoring systems that seek to provide monitoring services in cloud environment must be prepared to deal gracefully with huge data collections without compromising system performance.In this paper,we show that by using a concept of urgent data,our system can shorten the response time for most 'urgent' queries while guarantee lower bandwidth consumption.We argue that monitoring data can be treated differently.Some data capture critical system events;the arrival of these data will significantly influence the monitoring reaction speed which is called urgent data.High speed urgent data collections can help system to react in real time when facing fatal errors.A cloud environment in production,MagicCube,is used as a test bed.Extensive experiments over both real world and synthetic traces show that when using urgent data,monitoring system can lower the response latency compared with existing monitoring approaches.
基金supported by the GHFund A(No.ghfund202107010337).
文摘With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed.
基金Project supported by the National Natural Science Foundation of China (Nos. 60673180, 90412011 and 90612004)the International Science and Technology Cooperative Program of China (No. 2006DFA11080)+1 种基金the Research Program of Federal Ministry of Education and Research of Germany (No. 01BU0680)the Lion Project of Science Foundation of Ireland to Lei Shu (No. SFI/08/CE/ I1380)
文摘Congestion in wireless sensor networks (WSNs) not only causes severe information loss but also leads to excessive energy consumption. To address this problem, a novel scheme for congestion avoidance, detection and alleviation (CADA) in WSNs is proposed in this paper. By exploiting data characteristics, a small number of representative nodes are chosen from those in the event area as data sources, so that the source traffic can be suppressed proactively to avoid potential congestion. Once congestion occurs inevitably due to traffic mergence, it will be detected in a timely way by the hotspot node based on a combination of buffer occupancy and channel utilization. Congestion is then alleviated reactively by either dynamic traffic multiplexing or source rate regulation in accordance with the specific hotspot scenarios. Extensive simulation results under typical congestion scenarios are presented to illuminate the distinguished performance of the proposed scheme.
基金This work was supported by National Key Research and Development Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018).
文摘The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such as network delay.The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness.Watermark is a special kind of data inserted into the data stream with a timestamp,which helps the framework to decide whether the data received is late and thus be discarded.Traditional watermark generation strategies are periodic;they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy.This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation.This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy.We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets.The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.
基金the National Key R&D Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018)the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing(2019A12).
文摘The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.
基金This work was supported by National Key R&D Program of China(2020YFB150001)the National Natural Science Foundation of China(Grant No.62072018).
文摘As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with high probability.Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned.From the user perspective,if the program failure cannot be detected and handled in time,it would waste resources and delay the progress of program execution.Unfortunately,the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege.Currently,automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and performance overhead caused by ARL is negligible.The good scalability of ARL makes it applicable for large-scale HPC systems.