期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
Differentiating Data Collection for Cloud Environment Monitoring 被引量:2
1
作者 MENG You LUAN Zhongzhi QIAN Depei 《China Communications》 SCIE CSCD 2014年第4期13-24,共12页
In a growing number of information processing applications,data takes the form of continuous data streams rather than traditional stored databases.Monitoring systems that seek to provide monitoring services in cloud e... In a growing number of information processing applications,data takes the form of continuous data streams rather than traditional stored databases.Monitoring systems that seek to provide monitoring services in cloud environment must be prepared to deal gracefully with huge data collections without compromising system performance.In this paper,we show that by using a concept of urgent data,our system can shorten the response time for most 'urgent' queries while guarantee lower bandwidth consumption.We argue that monitoring data can be treated differently.Some data capture critical system events;the arrival of these data will significantly influence the monitoring reaction speed which is called urgent data.High speed urgent data collections can help system to react in real time when facing fatal errors.A cloud environment in production,MagicCube,is used as a test bed.Extensive experiments over both real world and synthetic traces show that when using urgent data,monitoring system can lower the response latency compared with existing monitoring approaches. 展开更多
关键词 cloud computing cloud monitoring urgent data rule engine CONSTRAINT
下载PDF
Software approaches for resilience of high performance computing systems:a survey
2
作者 Jie JIA Yi LIU +2 位作者 Guozhen ZHANG Yulin GAO Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第4期43-56,共14页
With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-sc... With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed. 展开更多
关键词 RESILIENCE high-performance computing fault tolerance CHALLENGE
原文传递
Congestion avoidance, detection and alleviation in wireless sensor networks 被引量:1
3
作者 Wei-wei FANG Ji-ming CHEN +2 位作者 Lei SHU Tian-shu CHU De-pei QIAN 《Journal of Zhejiang University-Science C(Computers and Electronics)》 SCIE EI 2010年第1期63-73,共11页
Congestion in wireless sensor networks (WSNs) not only causes severe information loss but also leads to excessive energy consumption. To address this problem, a novel scheme for congestion avoidance, detection and all... Congestion in wireless sensor networks (WSNs) not only causes severe information loss but also leads to excessive energy consumption. To address this problem, a novel scheme for congestion avoidance, detection and alleviation (CADA) in WSNs is proposed in this paper. By exploiting data characteristics, a small number of representative nodes are chosen from those in the event area as data sources, so that the source traffic can be suppressed proactively to avoid potential congestion. Once congestion occurs inevitably due to traffic mergence, it will be detected in a timely way by the hotspot node based on a combination of buffer occupancy and channel utilization. Congestion is then alleviated reactively by either dynamic traffic multiplexing or source rate regulation in accordance with the specific hotspot scenarios. Extensive simulation results under typical congestion scenarios are presented to illuminate the distinguished performance of the proposed scheme. 展开更多
关键词 Wireless sensor network (WSN) Congestion control CORRELATION Traffic multiplexing Rate regulation
原文传递
Adaptive watermark generation mechanism based on time series prediction for stream processing 被引量:1
4
作者 Yang SONG Yunchun LI +3 位作者 Hailong YANG Jun XU Zerong LUAN Wei LI 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第6期59-73,共15页
The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such... The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time.In reality,streaming data usually arrives out-of-order due to factors such as network delay.The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness.Watermark is a special kind of data inserted into the data stream with a timestamp,which helps the framework to decide whether the data received is late and thus be discarded.Traditional watermark generation strategies are periodic;they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy.This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation.This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy.We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets.The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy. 展开更多
关键词 data stream processing WATERMARK time series based prediction dynamic adjustment
原文传递
Accelerating the cryo-EM structure determination in RELION on GPU cluster
5
作者 Xin YOU Hailong YANG +1 位作者 Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第3期21-39,共19页
The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure... The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations. 展开更多
关键词 cryo-EM structure determination performance optimization GPU acceleration RELION
原文传递
User-level failure detection and auto-recovery of parallel programs in HPC systems
6
作者 Guozhen ZHANG Yi LIU +2 位作者 Hailong YANG Jun XU Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第6期31-42,共12页
As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with ... As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with high probability.Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned.From the user perspective,if the program failure cannot be detected and handled in time,it would waste resources and delay the progress of program execution.Unfortunately,the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege.Currently,automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and performance overhead caused by ARL is negligible.The good scalability of ARL makes it applicable for large-scale HPC systems. 展开更多
关键词 high performance computing parallel program failure detection failure auto-recovery
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部