Dynamically reconfigurable Field Programmable Gate Array(dr-FPGA) based electronic systems on board mission-critical systems are highly susceptible to radiation induced hazards that may lead to faults in the logic or ...Dynamically reconfigurable Field Programmable Gate Array(dr-FPGA) based electronic systems on board mission-critical systems are highly susceptible to radiation induced hazards that may lead to faults in the logic or in the configuration memory. The aim of our research is to characterize self-test and repair processes in Fault Tolerant(FT) dr-FPGA systems in the presence of environmental faults and explore their interrelationships. We develop a Continuous Time Markov Chain(CTMC) model that captures the high level fail-repair processes on a dr-FPGA with periodic online Built-In Self-Test(BIST) and scrubbing to detect and repair faults with minimum latency. Simulation results reveal that given an average fault interval of 36 s, an optimum self-test interval of 48.3 s drives the system to spend 13% of its time in self-tests, remain in safe working states for 76% of its time and face risky fault-prone states for only 7% of its time. Further, we demonstrate that a well-tuned repair strategy boosts overall system availability, minimizes the occurrence of unsafe states, and accommodates a larger range of fault rates within which the system availability remains stable within 10% of its maximum level.展开更多
文摘Dynamically reconfigurable Field Programmable Gate Array(dr-FPGA) based electronic systems on board mission-critical systems are highly susceptible to radiation induced hazards that may lead to faults in the logic or in the configuration memory. The aim of our research is to characterize self-test and repair processes in Fault Tolerant(FT) dr-FPGA systems in the presence of environmental faults and explore their interrelationships. We develop a Continuous Time Markov Chain(CTMC) model that captures the high level fail-repair processes on a dr-FPGA with periodic online Built-In Self-Test(BIST) and scrubbing to detect and repair faults with minimum latency. Simulation results reveal that given an average fault interval of 36 s, an optimum self-test interval of 48.3 s drives the system to spend 13% of its time in self-tests, remain in safe working states for 76% of its time and face risky fault-prone states for only 7% of its time. Further, we demonstrate that a well-tuned repair strategy boosts overall system availability, minimizes the occurrence of unsafe states, and accommodates a larger range of fault rates within which the system availability remains stable within 10% of its maximum level.