In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the perform...In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.展开更多
The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their ...The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their earlier tasks and may join with tasks from other programs. This phenomenon is called exchangeable join (EJ), which introduces correlation to the task’s service time. In this work, we investigate the response time of multiprocessor systems with EJ with a new approach. We analyze two aspects of this kind of systems: exchangeable join (EJ) and the capacity constraint (CC). We prove that the system response time can be effectively reduced by EJ, while the reduced amount is constrained by the capacity of the multiprocessor. An upper bound model is constructed based on this analysis and a quick estimation algorithm is proposed. The approximation formula is verified by extensive simulation results, which show that the relative error of approximation is less than 5%.展开更多
The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algori...The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.展开更多
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that m...Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.展开更多
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ...Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.展开更多
Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation....Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation. However, the performance of a parallel declarative program can be impaired by a mismatch between the parallelism available in an application and the parallelism available in the architecture. A particularly attractive solution is to automatically match the parallelism of the program to the parallelism of the target hardware as a compilation step. In this paper) we present an optimizillg compilation technique called granularity analysis which identi fies and removes excess parallelism that would degrade performance. The main steps are: an analysis of the flow of data to form an attributed call graph between function (or predicate) arguments; and an asymptotic estimation of granularity of a function (or predicate) to generate approximate grain size. Compiled procedure calls can be annotated with grain size and a task scheduler can make scheduling decisions with the classilication scheme of grains to control parallelism at runtime. The resulting granularity analysis scheme is suitable for exploiting adaptive parallelism of declarative programming languages on multiprocessors.展开更多
The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocesso...The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.展开更多
In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel G...In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.展开更多
Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of...Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of such systetns. In this paper, we present performance analysis of the multiple-bus systems with different arbitration protocols.The priority protocols considered are random selection, fixed priority, rotating priority, roundrobin and FIFO. Analytica1 models are developed for each of these five dmerent priority protocoIs. Each of our analyses modeIs exactly the behavior of the correspond-ing priority protocol with little computation cost. The analytical models are validated through extensive simu1ations and are then used to carry out performance analysis and comparison of different priority protocols. Numerical results obtained from our models show that the round-robin protocol performs the best among the five protocols in the system with a few buses.展开更多
A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within th...A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within the spare block.Results of yield analysis performed on a 32 processor array are pres- ented.展开更多
It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant ...It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code. We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper, we present one of the compilation techniques, global mpister allocation,we have developed for SIMD computers, and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code. This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers. Our performance results validate our method.展开更多
As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocess...As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.展开更多
Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of te...Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of technology,the performance management of central processing unit(CPU)is changing.Power densities and thermal effects are quickly increasing in multi-core embedded technologies due to shrinking of chip size.When energy consumption reaches a threshold that creates a delay in complementary metal oxide semiconductor(CMOS)circuits and reduces the speed by 10%–15%because excessive on-chip temperature shortens the chip’s life cycle.In this paper,we address the scheduling&energy utilization problem by introducing and evaluating an optimal energy-aware earliest deadline first scheduling(EA-EDF)based technique formultiprocessor environments with task migration that enhances the performance and efficiency in multiprocessor systemon-chip while lowering energy and power consumption.The selection of core andmigration of tasks prevents the system from reaching itsmaximumenergy utilization while effectively using the dynamic power management(DPM)policy.Increase in the execution of tasks the temperature and utilization factor(u_(i))on-chip increases that dissipate more power.The proposed approach migrates such tasks to the core that produces less heat and consumes less power by distributing the load on other cores to lower the temperature and optimizes the duration of idle and sleep times across multiple CPUs.The performance of the EA-EDF algorithm was evaluated by an extensive set of experiments,where excellent results were reported when compared to other current techniques,the efficacy of the proposed methodology reduces the power and energy consumption by 4.3%–4.7%on a utilization of 6%,36%&46%at 520&624 MHz operating frequency when particularly in comparison to other energy-aware methods for MPSoCs.Tasks are running and accurately scheduled to make an energy-efficient processor by controlling and managing the thermal effects on-chip and optimizing the energy consumption of MPSoCs.展开更多
Suspicious mass traffic constantly evolves,making network behaviour tracing and structure more complex.Neural networks yield promising results by considering a sufficient number of processing elements with strong inte...Suspicious mass traffic constantly evolves,making network behaviour tracing and structure more complex.Neural networks yield promising results by considering a sufficient number of processing elements with strong interconnections between them.They offer efficient computational Hopfield neural networks models and optimization constraints used by undergoing a good amount of parallelism to yield optimal results.Artificial neural network(ANN)offers optimal solutions in classifying and clustering the various reels of data,and the results obtained purely depend on identifying a problem.In this research work,the design of optimized applications is presented in an organized manner.In addition,this research work examines theoretical approaches to achieving optimized results using ANN.It mainly focuses on designing rules.The optimizing design approach of neural networks analyzes the internal process of the neural networks.Practices in developing the network are based on the interconnections among the hidden nodes and their learning parameters.The methodology is proven best for nonlinear resource allocation problems with a suitable design and complex issues.The ANN proposed here considers more or less 46k nodes hidden inside 49 million connections employed on full-fledged parallel processors.The proposed ANN offered optimal results in real-world application problems,and the results were obtained using MATLAB.展开更多
Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of com...Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of computational systems is changing with the advancement in technology.Due to shrinking and smaller chip size power densities onchip are increasing rapidly that increasing chip temperature in multi-core embedded technologies.The operating speed of the device decreases when power consumption reaches a threshold that causes a delay in complementary metal oxide semiconductor(CMOS)circuits because high on-chip temperature adversely affects the life span of the chip.In this paper an energy-aware dynamic power management technique based on energy aware earliest deadline first(EA-EDF)scheduling is proposed for improving the performance and reliability by reducing energy and power consumption in the system on chip(SOC).Dynamic power management(DPM)enables MPSOC to reduce power and energy consumption by adopting a suitable core configuration for task migration.Task migration avoids peak temperature values in the multicore system.High utilization factor(ui)on central processing unit(CPU)core consumes more energy and increases the temperature on-chip.Our technique switches the core bymigrating such task to a core that has less temperature and is in a low power state.The proposed EA-EDF scheduling technique migrates load on different cores to attain stability in temperature among multiple cores of the CPU and optimized the duration of the idle and sleep periods to enable the low-temperature core.The effectiveness of the EA-EDF approach reduces the utilization and energy consumption compared to other existing methods and works.The simulation results show the improvement in performance by optimizing 4.8%on u_(i) 9%,16%,23%and 25%at 520 MHz operating frequency as compared to other energy-aware techniques for MPSoCs when the least number of tasks is in running state and can schedule more tasks to make an energy-efficient processor by controlling and managing the energy consumption of MPSoC.展开更多
This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this ...This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this approach.Secondly,this approach has the feature of unified abstraction for software objects implementing in processor and hardware objects implementing in ASICs,easy for mapping CGDFG programming on MPSoC.This approach cuts down the kernel overhead and reduces the code size effectively.The principle of the oriented object model,the method of scheduling,and how to map a parallel programming through CGDFG to the MPSoC are analyzed in this approach.This approach also compares the code size and execution cycles with conventional control flow scheduling,and presents respective management overhead for one application in me-dia-SoC.展开更多
In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible...In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible because it can take the advantages of backup copy de-allocation technique and overloading technique to improve schedulability. In this paper, we propose a novel efficient fault-tolerant ratemonotonic best-fit algorithm efficient fault-tolerant rate-monotonic best-fit (ERMBF) based on multiprocessors systems to enhance the schedulability. Unlike existing scheduling algorithms that start scheduling tasks with only one processor. ERMBF pre-allocates a certain amount of processors before starting scheduling tasks, which enlarge the searching spaces for tasks. Besides, when a new processor is allocated, we reassign the task copies that have already been assigned to the existing processors in order to find a superior tasks assignment configuration. These two strategies are all aiming at making as many backup copies as possible to be executed as passive status. As a result, ERMBF can use fewer processors to schedule a set of tasks without losing real-time and fault-tolerant capabilities of the system. Simulation results reveal that ERMBF significantly improves the schedulability over existing, comparable algorithms in literature.展开更多
Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model,...Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model, where k is typically a small number. Based on the Preparata, Metze, and Chien(PMC)model, the n-dimensional hypercube network is proved to be t/kdiagnosable. In this paper, based on the Maeng and Malek(MM)*model, a novel t/k-fault diagnosis(1≤k≤4) algorithm of ndimensional hypercube, called t/k-MM*-DIAG, is proposed to isolate all faulty processors within the set of nodes, among which the number of fault-free nodes identified wrongly as faulty is at most k. The time complexity in our algorithm is only O(2~n n~2).展开更多
文摘In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.
基金Project supported by the National Natural Science Foundation of0 China (Nos. 60274011 and 60574067), and the Program for NewCentury Excellent Talents in University (No. NCET-04-0094), China
文摘The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their earlier tasks and may join with tasks from other programs. This phenomenon is called exchangeable join (EJ), which introduces correlation to the task’s service time. In this work, we investigate the response time of multiprocessor systems with EJ with a new approach. We analyze two aspects of this kind of systems: exchangeable join (EJ) and the capacity constraint (CC). We prove that the system response time can be effectively reduced by EJ, while the reduced amount is constrained by the capacity of the multiprocessor. An upper bound model is constructed based on this analysis and a quick estimation algorithm is proposed. The approximation formula is verified by extensive simulation results, which show that the relative error of approximation is less than 5%.
文摘The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.
文摘Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.
基金supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625gaZ:T48 research group (Arag'on Government and European ESF)+1 种基金Consolider CSD2007-00050 (Spanish Government)HiPEAC-2 NoE (European FP7/ICT 217068)
文摘Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
文摘Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation. However, the performance of a parallel declarative program can be impaired by a mismatch between the parallelism available in an application and the parallelism available in the architecture. A particularly attractive solution is to automatically match the parallelism of the program to the parallelism of the target hardware as a compilation step. In this paper) we present an optimizillg compilation technique called granularity analysis which identi fies and removes excess parallelism that would degrade performance. The main steps are: an analysis of the flow of data to form an attributed call graph between function (or predicate) arguments; and an asymptotic estimation of granularity of a function (or predicate) to generate approximate grain size. Compiled procedure calls can be annotated with grain size and a task scheduler can make scheduling decisions with the classilication scheme of grains to control parallelism at runtime. The resulting granularity analysis scheme is suitable for exploiting adaptive parallelism of declarative programming languages on multiprocessors.
文摘The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.
基金This work was partially supported by the National 863 High Technical Grant 863-306-101the National Doctoral Subject Foundation Grant 0249136.
文摘In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.
文摘Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of such systetns. In this paper, we present performance analysis of the multiple-bus systems with different arbitration protocols.The priority protocols considered are random selection, fixed priority, rotating priority, roundrobin and FIFO. Analytica1 models are developed for each of these five dmerent priority protocoIs. Each of our analyses modeIs exactly the behavior of the correspond-ing priority protocol with little computation cost. The analytical models are validated through extensive simu1ations and are then used to carry out performance analysis and comparison of different priority protocols. Numerical results obtained from our models show that the round-robin protocol performs the best among the five protocols in the system with a few buses.
文摘A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within the spare block.Results of yield analysis performed on a 32 processor array are pres- ented.
文摘It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code. We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper, we present one of the compilation techniques, global mpister allocation,we have developed for SIMD computers, and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code. This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers. Our performance results validate our method.
基金supported by the National Natural Science Foundation of China under Grant Nos.60970002,60833004,60773146, and 60673145.
文摘As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.
文摘Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of technology,the performance management of central processing unit(CPU)is changing.Power densities and thermal effects are quickly increasing in multi-core embedded technologies due to shrinking of chip size.When energy consumption reaches a threshold that creates a delay in complementary metal oxide semiconductor(CMOS)circuits and reduces the speed by 10%–15%because excessive on-chip temperature shortens the chip’s life cycle.In this paper,we address the scheduling&energy utilization problem by introducing and evaluating an optimal energy-aware earliest deadline first scheduling(EA-EDF)based technique formultiprocessor environments with task migration that enhances the performance and efficiency in multiprocessor systemon-chip while lowering energy and power consumption.The selection of core andmigration of tasks prevents the system from reaching itsmaximumenergy utilization while effectively using the dynamic power management(DPM)policy.Increase in the execution of tasks the temperature and utilization factor(u_(i))on-chip increases that dissipate more power.The proposed approach migrates such tasks to the core that produces less heat and consumes less power by distributing the load on other cores to lower the temperature and optimizes the duration of idle and sleep times across multiple CPUs.The performance of the EA-EDF algorithm was evaluated by an extensive set of experiments,where excellent results were reported when compared to other current techniques,the efficacy of the proposed methodology reduces the power and energy consumption by 4.3%–4.7%on a utilization of 6%,36%&46%at 520&624 MHz operating frequency when particularly in comparison to other energy-aware methods for MPSoCs.Tasks are running and accurately scheduled to make an energy-efficient processor by controlling and managing the thermal effects on-chip and optimizing the energy consumption of MPSoCs.
基金This research is funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2022R 151)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘Suspicious mass traffic constantly evolves,making network behaviour tracing and structure more complex.Neural networks yield promising results by considering a sufficient number of processing elements with strong interconnections between them.They offer efficient computational Hopfield neural networks models and optimization constraints used by undergoing a good amount of parallelism to yield optimal results.Artificial neural network(ANN)offers optimal solutions in classifying and clustering the various reels of data,and the results obtained purely depend on identifying a problem.In this research work,the design of optimized applications is presented in an organized manner.In addition,this research work examines theoretical approaches to achieving optimized results using ANN.It mainly focuses on designing rules.The optimizing design approach of neural networks analyzes the internal process of the neural networks.Practices in developing the network are based on the interconnections among the hidden nodes and their learning parameters.The methodology is proven best for nonlinear resource allocation problems with a suitable design and complex issues.The ANN proposed here considers more or less 46k nodes hidden inside 49 million connections employed on full-fledged parallel processors.The proposed ANN offered optimal results in real-world application problems,and the results were obtained using MATLAB.
文摘Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of computational systems is changing with the advancement in technology.Due to shrinking and smaller chip size power densities onchip are increasing rapidly that increasing chip temperature in multi-core embedded technologies.The operating speed of the device decreases when power consumption reaches a threshold that causes a delay in complementary metal oxide semiconductor(CMOS)circuits because high on-chip temperature adversely affects the life span of the chip.In this paper an energy-aware dynamic power management technique based on energy aware earliest deadline first(EA-EDF)scheduling is proposed for improving the performance and reliability by reducing energy and power consumption in the system on chip(SOC).Dynamic power management(DPM)enables MPSOC to reduce power and energy consumption by adopting a suitable core configuration for task migration.Task migration avoids peak temperature values in the multicore system.High utilization factor(ui)on central processing unit(CPU)core consumes more energy and increases the temperature on-chip.Our technique switches the core bymigrating such task to a core that has less temperature and is in a low power state.The proposed EA-EDF scheduling technique migrates load on different cores to attain stability in temperature among multiple cores of the CPU and optimized the duration of the idle and sleep periods to enable the low-temperature core.The effectiveness of the EA-EDF approach reduces the utilization and energy consumption compared to other existing methods and works.The simulation results show the improvement in performance by optimizing 4.8%on u_(i) 9%,16%,23%and 25%at 520 MHz operating frequency as compared to other energy-aware techniques for MPSoCs when the least number of tasks is in running state and can schedule more tasks to make an energy-efficient processor by controlling and managing the energy consumption of MPSoC.
基金Supported by National Natural Science Foundation ofChina (No.60873112)
文摘This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this approach.Secondly,this approach has the feature of unified abstraction for software objects implementing in processor and hardware objects implementing in ASICs,easy for mapping CGDFG programming on MPSoC.This approach cuts down the kernel overhead and reduces the code size effectively.The principle of the oriented object model,the method of scheduling,and how to map a parallel programming through CGDFG to the MPSoC are analyzed in this approach.This approach also compares the code size and execution cycles with conventional control flow scheduling,and presents respective management overhead for one application in me-dia-SoC.
基金Supported by the National Basic Reseach Program of China (973 Program 2004 CB318200)
文摘In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible because it can take the advantages of backup copy de-allocation technique and overloading technique to improve schedulability. In this paper, we propose a novel efficient fault-tolerant ratemonotonic best-fit algorithm efficient fault-tolerant rate-monotonic best-fit (ERMBF) based on multiprocessors systems to enhance the schedulability. Unlike existing scheduling algorithms that start scheduling tasks with only one processor. ERMBF pre-allocates a certain amount of processors before starting scheduling tasks, which enlarge the searching spaces for tasks. Besides, when a new processor is allocated, we reassign the task copies that have already been assigned to the existing processors in order to find a superior tasks assignment configuration. These two strategies are all aiming at making as many backup copies as possible to be executed as passive status. As a result, ERMBF can use fewer processors to schedule a set of tasks without losing real-time and fault-tolerant capabilities of the system. Simulation results reveal that ERMBF significantly improves the schedulability over existing, comparable algorithms in literature.
基金supported by the National Natural Science Foundation of China(61363002)
文摘Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model, where k is typically a small number. Based on the Preparata, Metze, and Chien(PMC)model, the n-dimensional hypercube network is proved to be t/kdiagnosable. In this paper, based on the Maeng and Malek(MM)*model, a novel t/k-fault diagnosis(1≤k≤4) algorithm of ndimensional hypercube, called t/k-MM*-DIAG, is proposed to isolate all faulty processors within the set of nodes, among which the number of fault-free nodes identified wrongly as faulty is at most k. The time complexity in our algorithm is only O(2~n n~2).