In Internet service fault management based on active probing, uncertainty and noises will affect service fault management. In order to reduce the impact, challenges of Internet service fault management are analyzed in...In Internet service fault management based on active probing, uncertainty and noises will affect service fault management. In order to reduce the impact, challenges of Internet service fault management are analyzed in this paper. Bipartite Bayesian network is chosen to model the dependency relationship between faults and probes, binary symmetric channel is chosen to model noises, and a service fault management approach using active probing is proposed for such an environment. This approach is composed of two phases: fault detection and fault diagnosis. In first phase, we propose a greedy approximation probe selection algorithm (GAPSA), which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault diagnosis probe selection algorithm (FDPSA), which selects probes to obtain more system information based on the symptoms observed in previous phase. To deal with dynamic fault set caused by fault recovery mechanism, we propose a hypothesis inference algorithm based on fault persistent time statistic (FPTS). Simulation results prove the validity and efficiency of our approach.展开更多
In this paper, a method of intelligent fault tolerant management on electromechanical equipment is presented. It is based on condition monitoring of equipment and realized by condition prediction and condition contro...In this paper, a method of intelligent fault tolerant management on electromechanical equipment is presented. It is based on condition monitoring of equipment and realized by condition prediction and condition control. An example is introduced and analyzed in this paper.展开更多
Condition monitoring is increasingly used to anticipate and detect failures of industrial machines.Failures of machines can cause high maintenance or replacement costs.If neglected,it may result in catastrophic accide...Condition monitoring is increasingly used to anticipate and detect failures of industrial machines.Failures of machines can cause high maintenance or replacement costs.If neglected,it may result in catastrophic accidents leading to production shrinkage.The potential failure would negatively affect the profitability of the company,including production shut down,cost of spare parts,cost of labor,damage of reputation,risk of injury to people and the environment.In recent years,condition-based maintenance( CBM) and prognostic and health management( PHM) are developed and formed a strong connection among science,engineering,computer,reliability,communication,management,etc.Computerized maintenance management systems( CMMS) store a lot of data regarding the fault diagnosis and life prediction of the machinery equipment.It's too necessary to uncover useful knowledge from the huge amount of data.It's vital to find the ways to obtain useful and concise information from these data.This information can be of great influence in the decision making of managers.This article is a review of intelligent approaches in machinery faults diagnosis and prediction based on PHM and CBM.展开更多
Difference similitude matrix (DSM) is effective in reducing information system with its higher reduction rate and higher validity. We use DSM method to analyze the fault data of computer networks and obtain the fault ...Difference similitude matrix (DSM) is effective in reducing information system with its higher reduction rate and higher validity. We use DSM method to analyze the fault data of computer networks and obtain the fault diagnosis rules. Through discretizing the relative value of fault data, we get the information system of the fault data. DSM method reduces the information system and gets the diagnosis rules. The simulation with the actual scenario shows that the fault diagnosis based on DSM can obtain few and effective rules. Key words computer networks - data reduction - fault management - difference-similitude matrix CLC number TP 393 Foundation item: Supported by the National Natural Science Foundation of China (90204008)Biography: Jiang Hao (1976-), male, Ph. D candidate, research direction: computer network, data mine.展开更多
Fault management is crucial to pro vi de quality of service grantees for the future networks, and fault identification is an essential part of it. A novel fault identification algorithm is proposed in this paper, wh...Fault management is crucial to pro vi de quality of service grantees for the future networks, and fault identification is an essential part of it. A novel fault identification algorithm is proposed in this paper, which focuses on the anomaly detection of network traffic. Since the fault identification has been achieved using statistical information in mana gement information base, the algorithm is compatible with the existing simple ne twork management protocol framework. The network traffic time series is verified to be non-stationary. By fitting the adaptive autoregressive model, the series is transformed into a multidimensional vector. The training samples and identif iers are acquired from the network simulation. A k-nearest neighbor classif ier identifies the system faults after being trained. The experiment results are consistent with the given fault scenarios, which prove the accuracy of the algo rithm. The identification errors are discussed to illustrate that the novel faul t identification algorithm is adaptive in the fault scenarios with network traff ic change.展开更多
Fault detection in optical burst switching (OBS) networks will be a challenging task in the future. A novel mechanism based on probe burst (PB) and a new key concept is proposed to detect faults of OBS networks by...Fault detection in optical burst switching (OBS) networks will be a challenging task in the future. A novel mechanism based on probe burst (PB) and a new key concept is proposed to detect faults of OBS networks by sampling the health of data channels, which solve the difficulty of optical monitoring schemes while keeps the transparency of data network to Internet protocol (IP) packets. It takes full advantage of the characteristics of OBS, including architecture and signalling scheme, and introduces the excellent performances of single-hop-test used in electrical communication networks into OBS environment while avoids the shortcoming that any optical burst must undergo an optical-electric-optical (OEO) conversion. Well designed PB can provide exact criterion for judging whether protection/restoration should be excuted according to hard or soft fault identification.展开更多
Fault tolerance(FT)schemes are intended to work on a minimized and static amount of physical resources.When a host failure occurs,the conventional FT frequently proceeds with the execution on the accessible working ho...Fault tolerance(FT)schemes are intended to work on a minimized and static amount of physical resources.When a host failure occurs,the conventional FT frequently proceeds with the execution on the accessible working hosts.This methodology saves the execution state and applications to complete without disruption.However,the dynamicity of open cloud assets is not seen when taking scheduling choices.Existing optimization techniques are intended in dealing with resource scheduling.This method will be utilized for distributing the approaching tasks to the VMs.However,the dynamic scheduling for this procedure doesn’t accomplish the objective of adaptation of internal failure.The scheme prefers jobs in the activity list with the most elevated execution time on resources that can execute in a shorter timeframe,but it suffers with higher makespan;poor resource usage and unbalance load concerns.To overcome the above mentioned issue,Fault Aware Dynamic Resource Manager(FADRM)is proposed that enhances the mechanism to Multi-stage Resilience Manager at an application-level FT arrangement.Proposed FADRM method gives FT a Multi-stage Resilience Manager(MRM)in the client and application layers,and simultaneously decreases the over-head and degradations.It additionally provides safety to the application execution considering the clients,application and framework necessities.Based on experimental evaluations,Proposed Fault Aware Dynamic Resource Manager(FADRM)method 157.5 MakeSpan(MS)time,0.38 Fault Rate(FR),0.25 Failure Delay(FD)and improves 5.5 Performance Improvement Ratio(PIR)for 25,50,75 and 100 tasks and 475 MakeSpan(MS)time,0.40 Fault Rate(FR),1.30 Failure Delay(FD)and improves 6.75 improves Performance Improvement Ratio(PER)for 100,200,300 and 500 Tasks compare than existing methodologies.展开更多
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the us...With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.展开更多
It is a well-established fact that wireless sensor networks (WSNs) are very power constraint networks, but besides this, they are inherently more fault-prone than any other type of wireless network and their protocol ...It is a well-established fact that wireless sensor networks (WSNs) are very power constraint networks, but besides this, they are inherently more fault-prone than any other type of wireless network and their protocol design is very application specific. Major reasons for the faults are the unpredictable wireless communication channel, battery depletion, as well as fragility and mobility of the nodes. Furthermore, as traditional protocol design methods have proved inadequate, the cross-layer design (CLD) approach, which allows for interactions between different layers, providing more flexible and energy-efficient functionality, has emerged as a viable solution for WSNs. In this study we define a fault tolerance management module suitable to the requirements, limitations, and specifics of WSNs, encompassing methods for fault detection, fault prevention, fault management, and recovery. The suggested solution is in line with the CLD approach, which is an important factor in increasing the network performance. Through simulations the functionality of the network is evaluated, based on packet loss, delay, and energy consumption, and is compared with a similar solution not including fault management. The results achieved support the idea that the introduction of a unified approach to fault management improves the network performance as a whole.展开更多
The integrated modular avionics (IMA) architecture is an open standard in avionics industry, in which the number of functionalities implemented by software is greater than ever before. In the IMA architecture, the r...The integrated modular avionics (IMA) architecture is an open standard in avionics industry, in which the number of functionalities implemented by software is greater than ever before. In the IMA architecture, the reliability of the avionics system is highly affected by the software applications. In order to enhance the fault tolerance feature with regard to software application failures, many industrial standards propose a layered health monitoring/fault management (HM/FM) scheme to periodically check the health status of software application processes and recover the malfunctioning software process whenever an error is located. In this paper, we make an analytical study of the HM/FM system for avionics application software. We use the stochastic Petri nets (SPN) to build a formal model of each component and present a method to combine the components together to form a complete system model with respect to three interlayer query strategies. We further investigate the effectiveness of these strategies in an illustrative system.展开更多
Smart grid is the flag under which the US DoE has been mobilizing efforts to modernize the grid.Electronictization is the first step towards a smart modern grid.It is a process that transforms the grid from electrical...Smart grid is the flag under which the US DoE has been mobilizing efforts to modernize the grid.Electronictization is the first step towards a smart modern grid.It is a process that transforms the grid from electrical and electromechanical(EE)to electronic,electrical and electromechanical(EEE),laying down the very basic foundation for the modern grid.All things grid connected(ATGC)has five groups of essential hardware:1)Grid interface(smart)inverters;2)Hardware for flexible AC transmissions;3)Intelligent electronic power transformers(grid scale);4)Solid-state circuit breaker,current limiters,smart fuses and sensors;and 5)Multi-port bidirectional power&control units.Development and deployment of ATGC will be a grassroots drive to transform the grid from an old passive technology to a new active technology based on electronic power transmission,distribution,processing and protection.Grid modernization represents a win-win-win situation for the environment(Government),consumers,and grid owners/operators.展开更多
This paper presents the behavior analysis of modular multilevel converter under DC pole-to-pole short-circuit fault, which is an important issue in fault management, electrical system design and MMC based power system...This paper presents the behavior analysis of modular multilevel converter under DC pole-to-pole short-circuit fault, which is an important issue in fault management, electrical system design and MMC based power system protection and control. Firstly, the transient behavior is analyzed and the conduction overlap- ping angle γ, is defined. Secondly, seven possible short-circuit current paths induced by different γ values are identified, and the corresponding engineering short-circuit current calculation methods for both AC and DC sides are proposed. And then, the influences of impedance distribution factor κ and equivalent short-circuit resistance Rsc on short-circuit currents are elaborated the proposed analysis methods. Finally, case study is used to verify the effectiveness of展开更多
基金the National Basic Research Program of China (973 Program) (Grant No. 2003CB314806)the National High-Tech Research & Development Program of China (863 Program) (Grant Nos. 2007AA12Z321 and 2007AA01Z206)the National Natural Science Foundation of China (Grant Nos. 60603060, 60502037 and 90604019)
文摘In Internet service fault management based on active probing, uncertainty and noises will affect service fault management. In order to reduce the impact, challenges of Internet service fault management are analyzed in this paper. Bipartite Bayesian network is chosen to model the dependency relationship between faults and probes, binary symmetric channel is chosen to model noises, and a service fault management approach using active probing is proposed for such an environment. This approach is composed of two phases: fault detection and fault diagnosis. In first phase, we propose a greedy approximation probe selection algorithm (GAPSA), which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault diagnosis probe selection algorithm (FDPSA), which selects probes to obtain more system information based on the symptoms observed in previous phase. To deal with dynamic fault set caused by fault recovery mechanism, we propose a hypothesis inference algorithm based on fault persistent time statistic (FPTS). Simulation results prove the validity and efficiency of our approach.
文摘In this paper, a method of intelligent fault tolerant management on electromechanical equipment is presented. It is based on condition monitoring of equipment and realized by condition prediction and condition control. An example is introduced and analyzed in this paper.
基金Fundamental Research Funds for the Central Universities,China(No.DUT17GF214)
文摘Condition monitoring is increasingly used to anticipate and detect failures of industrial machines.Failures of machines can cause high maintenance or replacement costs.If neglected,it may result in catastrophic accidents leading to production shrinkage.The potential failure would negatively affect the profitability of the company,including production shut down,cost of spare parts,cost of labor,damage of reputation,risk of injury to people and the environment.In recent years,condition-based maintenance( CBM) and prognostic and health management( PHM) are developed and formed a strong connection among science,engineering,computer,reliability,communication,management,etc.Computerized maintenance management systems( CMMS) store a lot of data regarding the fault diagnosis and life prediction of the machinery equipment.It's too necessary to uncover useful knowledge from the huge amount of data.It's vital to find the ways to obtain useful and concise information from these data.This information can be of great influence in the decision making of managers.This article is a review of intelligent approaches in machinery faults diagnosis and prediction based on PHM and CBM.
文摘Difference similitude matrix (DSM) is effective in reducing information system with its higher reduction rate and higher validity. We use DSM method to analyze the fault data of computer networks and obtain the fault diagnosis rules. Through discretizing the relative value of fault data, we get the information system of the fault data. DSM method reduces the information system and gets the diagnosis rules. The simulation with the actual scenario shows that the fault diagnosis based on DSM can obtain few and effective rules. Key words computer networks - data reduction - fault management - difference-similitude matrix CLC number TP 393 Foundation item: Supported by the National Natural Science Foundation of China (90204008)Biography: Jiang Hao (1976-), male, Ph. D candidate, research direction: computer network, data mine.
文摘Fault management is crucial to pro vi de quality of service grantees for the future networks, and fault identification is an essential part of it. A novel fault identification algorithm is proposed in this paper, which focuses on the anomaly detection of network traffic. Since the fault identification has been achieved using statistical information in mana gement information base, the algorithm is compatible with the existing simple ne twork management protocol framework. The network traffic time series is verified to be non-stationary. By fitting the adaptive autoregressive model, the series is transformed into a multidimensional vector. The training samples and identif iers are acquired from the network simulation. A k-nearest neighbor classif ier identifies the system faults after being trained. The experiment results are consistent with the given fault scenarios, which prove the accuracy of the algo rithm. The identification errors are discussed to illustrate that the novel faul t identification algorithm is adaptive in the fault scenarios with network traff ic change.
基金Supported by the National Natural Science Foundation of China ( No. 90304004), High Technology Research and Development Program of China ( No 2005AA122310), the Projects of the Science and Technology Council of Chongqing (2005BB2062, 2005AC2089).
文摘Fault detection in optical burst switching (OBS) networks will be a challenging task in the future. A novel mechanism based on probe burst (PB) and a new key concept is proposed to detect faults of OBS networks by sampling the health of data channels, which solve the difficulty of optical monitoring schemes while keeps the transparency of data network to Internet protocol (IP) packets. It takes full advantage of the characteristics of OBS, including architecture and signalling scheme, and introduces the excellent performances of single-hop-test used in electrical communication networks into OBS environment while avoids the shortcoming that any optical burst must undergo an optical-electric-optical (OEO) conversion. Well designed PB can provide exact criterion for judging whether protection/restoration should be excuted according to hard or soft fault identification.
文摘Fault tolerance(FT)schemes are intended to work on a minimized and static amount of physical resources.When a host failure occurs,the conventional FT frequently proceeds with the execution on the accessible working hosts.This methodology saves the execution state and applications to complete without disruption.However,the dynamicity of open cloud assets is not seen when taking scheduling choices.Existing optimization techniques are intended in dealing with resource scheduling.This method will be utilized for distributing the approaching tasks to the VMs.However,the dynamic scheduling for this procedure doesn’t accomplish the objective of adaptation of internal failure.The scheme prefers jobs in the activity list with the most elevated execution time on resources that can execute in a shorter timeframe,but it suffers with higher makespan;poor resource usage and unbalance load concerns.To overcome the above mentioned issue,Fault Aware Dynamic Resource Manager(FADRM)is proposed that enhances the mechanism to Multi-stage Resilience Manager at an application-level FT arrangement.Proposed FADRM method gives FT a Multi-stage Resilience Manager(MRM)in the client and application layers,and simultaneously decreases the over-head and degradations.It additionally provides safety to the application execution considering the clients,application and framework necessities.Based on experimental evaluations,Proposed Fault Aware Dynamic Resource Manager(FADRM)method 157.5 MakeSpan(MS)time,0.38 Fault Rate(FR),0.25 Failure Delay(FD)and improves 5.5 Performance Improvement Ratio(PIR)for 25,50,75 and 100 tasks and 475 MakeSpan(MS)time,0.40 Fault Rate(FR),1.30 Failure Delay(FD)and improves 6.75 improves Performance Improvement Ratio(PER)for 100,200,300 and 500 Tasks compare than existing methodologies.
基金Acknowledgements This work was partially supported by National High-tech R&D Program of China (863 Program) (2012AA01A301, 2012AA010901), by Program for New Century Excellent Talents in University and by National Natural Science Foundation of China (Grant Nos. 61272142, 61103082, 61170261, and 61103193).
文摘With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
文摘It is a well-established fact that wireless sensor networks (WSNs) are very power constraint networks, but besides this, they are inherently more fault-prone than any other type of wireless network and their protocol design is very application specific. Major reasons for the faults are the unpredictable wireless communication channel, battery depletion, as well as fragility and mobility of the nodes. Furthermore, as traditional protocol design methods have proved inadequate, the cross-layer design (CLD) approach, which allows for interactions between different layers, providing more flexible and energy-efficient functionality, has emerged as a viable solution for WSNs. In this study we define a fault tolerance management module suitable to the requirements, limitations, and specifics of WSNs, encompassing methods for fault detection, fault prevention, fault management, and recovery. The suggested solution is in line with the CLD approach, which is an important factor in increasing the network performance. Through simulations the functionality of the network is evaluated, based on packet loss, delay, and energy consumption, and is compared with a similar solution not including fault management. The results achieved support the idea that the introduction of a unified approach to fault management improves the network performance as a whole.
基金supported by the National Grand Fundamental Research Program of China (Nos. 2010CB328105, 2009CB320504)the Tsinghua University Initiative Scientific Research Programthe National Natural Science Foundation of China (Nos. 61070182,60973107, 60973144, 61173008, 61070021)
文摘The integrated modular avionics (IMA) architecture is an open standard in avionics industry, in which the number of functionalities implemented by software is greater than ever before. In the IMA architecture, the reliability of the avionics system is highly affected by the software applications. In order to enhance the fault tolerance feature with regard to software application failures, many industrial standards propose a layered health monitoring/fault management (HM/FM) scheme to periodically check the health status of software application processes and recover the malfunctioning software process whenever an error is located. In this paper, we make an analytical study of the HM/FM system for avionics application software. We use the stochastic Petri nets (SPN) to build a formal model of each component and present a method to combine the components together to form a complete system model with respect to three interlayer query strategies. We further investigate the effectiveness of these strategies in an illustrative system.
文摘Smart grid is the flag under which the US DoE has been mobilizing efforts to modernize the grid.Electronictization is the first step towards a smart modern grid.It is a process that transforms the grid from electrical and electromechanical(EE)to electronic,electrical and electromechanical(EEE),laying down the very basic foundation for the modern grid.All things grid connected(ATGC)has five groups of essential hardware:1)Grid interface(smart)inverters;2)Hardware for flexible AC transmissions;3)Intelligent electronic power transformers(grid scale);4)Solid-state circuit breaker,current limiters,smart fuses and sensors;and 5)Multi-port bidirectional power&control units.Development and deployment of ATGC will be a grassroots drive to transform the grid from an old passive technology to a new active technology based on electronic power transmission,distribution,processing and protection.Grid modernization represents a win-win-win situation for the environment(Government),consumers,and grid owners/operators.
文摘This paper presents the behavior analysis of modular multilevel converter under DC pole-to-pole short-circuit fault, which is an important issue in fault management, electrical system design and MMC based power system protection and control. Firstly, the transient behavior is analyzed and the conduction overlap- ping angle γ, is defined. Secondly, seven possible short-circuit current paths induced by different γ values are identified, and the corresponding engineering short-circuit current calculation methods for both AC and DC sides are proposed. And then, the influences of impedance distribution factor κ and equivalent short-circuit resistance Rsc on short-circuit currents are elaborated the proposed analysis methods. Finally, case study is used to verify the effectiveness of