Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multit...Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multitasking mechanism in the circumstance of Windows. This method resolves the question of Real-time answering in the serial communication validly, reduces losing rate of data and improves reliability of system. This article presents a general method used in the serial communication which is practical.展开更多
Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fa...Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fault Tolerance Processor can be expected to achieve better tradeoff between performance and hardware cost than traditional Fault Tolerance Processors. Detailed simulations of 3 of SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 40%–61%longer than running a single version of the program. The new instruction fetch algorithm enhances the performance by 0.4%~1%to most of the benchmarks we choose randomly.展开更多
To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and...To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and recovering from transient-fault. This paper surveys a wide range of RMT architectures-from the original AR-SMT(A-stream R-stream Simultaneous MultiThreading) to the most-recent SD-SRT (Slack-Decode Simultaneous Redundant Threading), presenting traverse analyses and comparisons among them, and hereby demonstrates its evolution and tendency. Finally, some directions and suggestions are put forward for the further RMT research and development.展开更多
In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware m...In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.展开更多
Thread partition plays an important role in speculative multithreading (SpMT) for automatic parallelization of ir- regular programs. Using unified values of partition parameters to partition different applications l...Thread partition plays an important role in speculative multithreading (SpMT) for automatic parallelization of ir- regular programs. Using unified values of partition parameters to partition different applications leads to the fact that every ap- plication cannot own its optimal partition scheme. In this paper, five parameters affecting thread partition are extracted from heuristic rules. They are the dependence threshold (DT), lower limit of thread size (TSL), upper limit of thread size (TSU), lower limit of spawning distance (SDL), and upper limit of spawning distance (SDU). Their ranges are determined in accordance with heuristic rules, and their step-sizes are set empirically. Under the condition of setting speedup as an objective function, all com- binations of five threshold values form the solution space, and our aim is to search for the best combination to obtain the best thread granularity, thread dependence, and spawning distance, so that every application has its best partition scheme. The issue can be attributed to a single objective optimization problem. We use the artificial immune algorithm (AIA) to search for the optimal solution. On Prophet, which is a generic SpMT processor to evaluate the performance of multithreaded programs, Olden bench- marks are used to implement the process. Experiments show that we can obtain the optimal parameter values for every benchmark, and Olden benchmarks partitioned with the optimized parameter values deliver a performance improvement of 3.00% on a 4-core platform compared with a machine learning based approach, and 8.92% compared with a heuristics-based approach.展开更多
The superiority of hypothetical quantum computers is not due to faster calculations but due to different schemes of calculations running on special hardware. The core of quantum computing follows the way a state of a ...The superiority of hypothetical quantum computers is not due to faster calculations but due to different schemes of calculations running on special hardware. The core of quantum computing follows the way a state of a quantum system is defined when basic things interact with each other. In conventional approach it is implemented through tensor product of qubits. In the geometric algebra formalism simultaneous availability of all the results for non-measured observables is based on the definition of states as points on three-dimensional sphere.展开更多
We utilized Raspberry Pi 4B to develop a microbial monitoring system to simplify the microbial image-capturing process and facilitate the informatization of microbial observation results.The Raspberry Pi 4B firmware,d...We utilized Raspberry Pi 4B to develop a microbial monitoring system to simplify the microbial image-capturing process and facilitate the informatization of microbial observation results.The Raspberry Pi 4B firmware,developed under Python on the Linux platform,achieves sum verification of serial data,file upload based on TCP protocol,control of sequence light source and light valve,real-time self-test based on multithreading,and an experiment-oriented file management method.The system demonstrated improved code logic,scheduling,exception handling,and code readability.展开更多
Slack-Decode Simultaneously and Redundantly Threaded (SD-SRT) is proposed for detecting transient faults in processors. SD-SRT boosts the previously proposed SRT performance via definitely eliminating redundant inst...Slack-Decode Simultaneously and Redundantly Threaded (SD-SRT) is proposed for detecting transient faults in processors. SD-SRT boosts the previously proposed SRT performance via definitely eliminating redundant instructiou fetches. First, the fetch stage is moved out of the Spheres of Replication (SoR), and a unified instruction-fetch-queue (IFQ) is exploited by both the leading and trailing threads. Second, a scheme called slack-decode cooperates with the unified IFQ to harmonize proceeding of the two threads. The simulations show that SD-SRT outperforms original SRT in terms of IPC by 15%, and decreases I-cache access by 42%. Meanwhile, SD-SRT leads to a lessened size and complexity for hardware structures such as load-value-queue and store-buffer.展开更多
Programs take on changing behavior at nmtime in a simultaneous multithreading (SMT) environment. How reasonably common resources are distributed among the threads significantly determines the throughput and fairness...Programs take on changing behavior at nmtime in a simultaneous multithreading (SMT) environment. How reasonably common resources are distributed among the threads significantly determines the throughput and fairness performance in SMT processors. Existing resource distribution methods either mainly rely on the front-end fetch policy, or make distribution decisions according to the limited information from the pipeline. It is difficult for them to efficiently catch the various resource requirements of the threads. This work presents a spatially triggered dissipative resource distribution (SDRD) policy for SMT processors, its two parts, the self-organization mechanism that is driven by the real-time instructions per cycle (IPC) performance and the introduction of chaos that tries to control the diversity Of trial resource distributions, work together to supply sustaining resource distribution optimization for changing program behavior. Simulation results show that SDRD with fine-grained diversity controlling is more effective than that with a coarse-grained one. And SDRD benefits much from its two well-coordinated parts, providing potential fairness gains as well as good throughput gains. Meanings and settings of important SDRD parameters are also discussed.展开更多
Based on Simultancous Multithrtading (SMT), we propose a fault-tola antscheme called Tri-modular Redun-danlly and Simultaneously threaded processor with Recovery (TRSTR),TRSTR features as following: First, we introduc...Based on Simultancous Multithrtading (SMT), we propose a fault-tola antscheme called Tri-modular Redun-danlly and Simultaneously threaded processor with Recovery (TRSTR),TRSTR features as following: First, we introduce an arbitrator context into thtconventional SRT(Simultaneous and Redundantly Threaded), which acts as an arbitrator when results from the other twocontexts disagree, or acts as an ordinary thread generally, thus making full use of SMT'sparallelism. Second, we append reconfigurablefeature to sphere of replication in SRT, making it moreflexible for changing demands and situations Third, TRSFR has two working modes: Tri-Simultancouswith Voling (TSV) and Dual-Simultaneous with Arbitrator CDSA), which can switch at will. Finally, inaddition to transient-fault coverage, TRSTR has on-line self-checking and self-recover ingabilities, so as to shield off some permanent faults and reconfigure itself without stopping thecrucial job. improving its reliability and availability.展开更多
A set of data-processing middleware for a high-powered neutral beam injection(NBI) control system is presented in this paper.The middleware,based on TCP/IP and multi-threading technologies,focuses mainly on data pro...A set of data-processing middleware for a high-powered neutral beam injection(NBI) control system is presented in this paper.The middleware,based on TCP/IP and multi-threading technologies,focuses mainly on data processing and transmission.It separates the data processing and compression from data acquisition and storage.It provides universal transmitting interfaces for different software circumstances,such as WinCC,LabView and other measurement systems. The experimental data acquired on Windows,QNX and Linux platforms are processed by the middleware and sent to the monitoring applications.There are three middleware deployment models:serial processing,parallel processing and alternate serial processing.By using these models,the middleware solves real-time data-processing problems on heterogeneous environmental acquisition hardware with different operating systems and data applications.展开更多
To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. ...To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. In this paper, the timing synchronization software is presented which is related to many kinds of technologies, such as shared memory, multithreading, TCP protocol and so on. Shared memory helps the server save the information of clients and system time, multithreading can deal with different clients with different threads, the server works under Linux operating system, the client works under Linux operating system and Windows operating system. With the help of this design, synchronization of all subsystems can be achieved in less than one second, and this accuracy is enough for the NBI system and the reliability of data is thus ensured.展开更多
For the remote control of a neutral beam injection (NBI) system, a software NBIcsw is developed to work on the control server. It can meet the requirements of data transmission and operation-control between the NBI ...For the remote control of a neutral beam injection (NBI) system, a software NBIcsw is developed to work on the control server. It can meet the requirements of data transmission and operation-control between the NBI measurement and control layer (MCL) and the remote monitoring layer (RML). The NBIcsw runs on a Linux system, developed with client/server (C/S) mode and multithreading technology. It is shown through application that the software is with good efficiency.展开更多
Scalability is one of the utmost nonfunctional requirement of server applications,because it maintains an effective performance parallel to the large fluctuating and sometimes unpredictable workload.In order to achiev...Scalability is one of the utmost nonfunctional requirement of server applications,because it maintains an effective performance parallel to the large fluctuating and sometimes unpredictable workload.In order to achieve scalability,thread pool system(TPS)has been used extensively as a middleware service in server applications.The size of thread pool is the most significant factor,that affects the overall performance of servers.Determining the optimal size of thread pool dynamically on runtime is a challenging problem.The most widely used and simple method to tackle this problem is to keep the size of thread pool equal to the request rate,i.e.,the frequencyoriented thread pool(FOTP).The FOTPs are the most widely used TPSs in the industry,because of the implementation simplicity,the negligible overhead and the capability to use in any system.However,the frequency-based schemes only focused on one aspect of changes in the load,and that is the fluctuations in request rate.The request rate alone is an imperfect knob to scale thread pool.Thus,this paper presents a workload profiling based FOTP,that focuses on request size(service time of request)besides the request rate as a knob to scale thread pool on runtime,because we argue that the combination of both truly represents the load fluctuation in server-side applications.We evaluated the results of the proposed system against state of the art TPS of Oracle Corporation(by a client-server-based simulator)and concluded that our system outperformed in terms of both;the response times and throughput.展开更多
Developing a high-performance public key cryptosystem is crucial for numerous modern security applications.The Elliptic Curve Cryptosystem(ECC)has performance and resource-saving advantages compared to other types of ...Developing a high-performance public key cryptosystem is crucial for numerous modern security applications.The Elliptic Curve Cryptosystem(ECC)has performance and resource-saving advantages compared to other types of asymmetric ciphers.However,the sequential design implementation for ECC does not satisfy the current applications’performance requirements.Therefore,several factors should be considered to boost the cryptosystem performance,including the coordinate system,the scalar multiplication algo-rithm,and the elliptic curve form.The tripling-oriented(3DIK)form is imple-mented in this work due to its minimal computational complexity compared to other elliptic curves forms.This experimental study explores the factors playing an important role in ECC performance to determine the best combi-nation that leads to developing high-speed ECC.The proposed cryptosystem uses parallel software implementation to speed up ECC performance.To our knowledge,previous studies have no similar software implementation for 3DIK ECC.Supported by using parallel design,projective coordinates,and a fast scalar multiplication algorithm,the proposed 3DIK ECC improved the speed of the encryption process compared with other counterparts and the usual sequential implementation.The highest performance level for 3DIK ECC was achieved when it was implemented using the Non-Adjacent Form algorithm and homogenous projection.Compared to the costly hardware implementations,the proposed software implementation is cost effective and can be easily adapted to other environments.In addition,the power con-sumption of the proposed ECC is analyzed and compared with other known cryptosystems.thus,the current study presents a detailed overview of the design and implementation of 3DIK ECC.展开更多
With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day k...With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.展开更多
Big data analytics is emerging as one kind of the most important workloads in modern data centers. Hence,it is of great interest to identify the method of achieving the best performance for big data analytics workload...Big data analytics is emerging as one kind of the most important workloads in modern data centers. Hence,it is of great interest to identify the method of achieving the best performance for big data analytics workloads running on state-of-the-art SMT( simultaneous multithreading) processors,which needs comprehensive understanding to workload characteristics. This paper chooses the Spark workloads as the representative big data analytics workloads and performs comprehensive measurements on the POWER8 platform,which supports a wide range of multithreading. The research finds that the thread assignment policy and cache contention have significant impacts on application performance. In order to identify the potential optimization method from the experiment results,this study performs micro-architecture level characterizations by means of hardware performance counters and gives implications accordingly.展开更多
Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CN...Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).展开更多
The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In...The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.展开更多
文摘Present a kind of method which is used to communicate between serial serial port and peripheral equipment dynamicly and real-time using multithreading technique based on the basic principle of communication and multitasking mechanism in the circumstance of Windows. This method resolves the question of Real-time answering in the serial communication validly, reduces losing rate of data and improves reliability of system. This article presents a general method used in the serial communication which is practical.
基金Supported by the National Natural Science Funda tion of China (60103002)
文摘Transient fault detection mechanism is added to simultaneous multithreading architecture. By exploiting both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism), Simultaneous Multithreading (SMT) Fault Tolerance Processor can be expected to achieve better tradeoff between performance and hardware cost than traditional Fault Tolerance Processors. Detailed simulations of 3 of SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 40%–61%longer than running a single version of the program. The new instruction fetch algorithm enhances the performance by 0.4%~1%to most of the benchmarks we choose randomly.
基金Supported by the National Natural Science Foun-dation of China (60503015)
文摘To overcome the ever-increasing susceptibility to transient-fault in processors, various redundant multithreading (RMT) architectures have been proposed, which is becoming a most effective approach for detecting and recovering from transient-fault. This paper surveys a wide range of RMT architectures-from the original AR-SMT(A-stream R-stream Simultaneous MultiThreading) to the most-recent SD-SRT (Slack-Decode Simultaneous Redundant Threading), presenting traverse analyses and comparisons among them, and hereby demonstrates its evolution and tendency. Finally, some directions and suggestions are put forward for the further RMT research and development.
基金supported partially by the National High Technical Research and Development Program of China (863 Program) under Grants No. 2011AA040101, No. 2008AA01Z134the National Natural Science Foundation of China under Grants No. 61003251, No. 61172049, No. 61173150+2 种基金the Doctoral Fund of Ministry of Education of China under Grant No. 20100006110015Beijing Municipal Natural Science Foundation under Grant No. Z111100054011078the 2012 Ladder Plan Project of Beijing Key Laboratory of Knowledge Engineering for Materials Science under Grant No. Z121101002812005
文摘In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.
基金supported by the National Natural Science Foundation of China(No.61173040)the Doctoral Fund of Ministry of Education of China(No.2013021110012)
文摘Thread partition plays an important role in speculative multithreading (SpMT) for automatic parallelization of ir- regular programs. Using unified values of partition parameters to partition different applications leads to the fact that every ap- plication cannot own its optimal partition scheme. In this paper, five parameters affecting thread partition are extracted from heuristic rules. They are the dependence threshold (DT), lower limit of thread size (TSL), upper limit of thread size (TSU), lower limit of spawning distance (SDL), and upper limit of spawning distance (SDU). Their ranges are determined in accordance with heuristic rules, and their step-sizes are set empirically. Under the condition of setting speedup as an objective function, all com- binations of five threshold values form the solution space, and our aim is to search for the best combination to obtain the best thread granularity, thread dependence, and spawning distance, so that every application has its best partition scheme. The issue can be attributed to a single objective optimization problem. We use the artificial immune algorithm (AIA) to search for the optimal solution. On Prophet, which is a generic SpMT processor to evaluate the performance of multithreaded programs, Olden bench- marks are used to implement the process. Experiments show that we can obtain the optimal parameter values for every benchmark, and Olden benchmarks partitioned with the optimized parameter values deliver a performance improvement of 3.00% on a 4-core platform compared with a machine learning based approach, and 8.92% compared with a heuristics-based approach.
文摘The superiority of hypothetical quantum computers is not due to faster calculations but due to different schemes of calculations running on special hardware. The core of quantum computing follows the way a state of a quantum system is defined when basic things interact with each other. In conventional approach it is implemented through tensor product of qubits. In the geometric algebra formalism simultaneous availability of all the results for non-measured observables is based on the definition of states as points on three-dimensional sphere.
文摘We utilized Raspberry Pi 4B to develop a microbial monitoring system to simplify the microbial image-capturing process and facilitate the informatization of microbial observation results.The Raspberry Pi 4B firmware,developed under Python on the Linux platform,achieves sum verification of serial data,file upload based on TCP protocol,control of sequence light source and light valve,real-time self-test based on multithreading,and an experiment-oriented file management method.The system demonstrated improved code logic,scheduling,exception handling,and code readability.
文摘Slack-Decode Simultaneously and Redundantly Threaded (SD-SRT) is proposed for detecting transient faults in processors. SD-SRT boosts the previously proposed SRT performance via definitely eliminating redundant instructiou fetches. First, the fetch stage is moved out of the Spheres of Replication (SoR), and a unified instruction-fetch-queue (IFQ) is exploited by both the leading and trailing threads. Second, a scheme called slack-decode cooperates with the unified IFQ to harmonize proceeding of the two threads. The simulations show that SD-SRT outperforms original SRT in terms of IPC by 15%, and decreases I-cache access by 42%. Meanwhile, SD-SRT leads to a lessened size and complexity for hardware structures such as load-value-queue and store-buffer.
基金the Hi-Tech Research and Development Pro-gram (863) of China (No. 2006AA01Z431) the Key Science andTechnology Program of Zhejiang Province (Nos. 2007C11068 and2007C11088), China
文摘Programs take on changing behavior at nmtime in a simultaneous multithreading (SMT) environment. How reasonably common resources are distributed among the threads significantly determines the throughput and fairness performance in SMT processors. Existing resource distribution methods either mainly rely on the front-end fetch policy, or make distribution decisions according to the limited information from the pipeline. It is difficult for them to efficiently catch the various resource requirements of the threads. This work presents a spatially triggered dissipative resource distribution (SDRD) policy for SMT processors, its two parts, the self-organization mechanism that is driven by the real-time instructions per cycle (IPC) performance and the introduction of chaos that tries to control the diversity Of trial resource distributions, work together to supply sustaining resource distribution optimization for changing program behavior. Simulation results show that SDRD with fine-grained diversity controlling is more effective than that with a coarse-grained one. And SDRD benefits much from its two well-coordinated parts, providing potential fairness gains as well as good throughput gains. Meanings and settings of important SDRD parameters are also discussed.
基金Supported by the 10th5 Year National Defence Pre Research Project (41316.1.2)
文摘Based on Simultancous Multithrtading (SMT), we propose a fault-tola antscheme called Tri-modular Redun-danlly and Simultaneously threaded processor with Recovery (TRSTR),TRSTR features as following: First, we introduce an arbitrator context into thtconventional SRT(Simultaneous and Redundantly Threaded), which acts as an arbitrator when results from the other twocontexts disagree, or acts as an ordinary thread generally, thus making full use of SMT'sparallelism. Second, we append reconfigurablefeature to sphere of replication in SRT, making it moreflexible for changing demands and situations Third, TRSFR has two working modes: Tri-Simultancouswith Voling (TSV) and Dual-Simultaneous with Arbitrator CDSA), which can switch at will. Finally, inaddition to transient-fault coverage, TRSTR has on-line self-checking and self-recover ingabilities, so as to shield off some permanent faults and reconfigure itself without stopping thecrucial job. improving its reliability and availability.
基金supported by National Natural Science Foundation of China(No.10875146)
文摘A set of data-processing middleware for a high-powered neutral beam injection(NBI) control system is presented in this paper.The middleware,based on TCP/IP and multi-threading technologies,focuses mainly on data processing and transmission.It separates the data processing and compression from data acquisition and storage.It provides universal transmitting interfaces for different software circumstances,such as WinCC,LabView and other measurement systems. The experimental data acquired on Windows,QNX and Linux platforms are processed by the middleware and sent to the monitoring applications.There are three middleware deployment models:serial processing,parallel processing and alternate serial processing.By using these models,the middleware solves real-time data-processing problems on heterogeneous environmental acquisition hardware with different operating systems and data applications.
基金supported by National Natural Science Foundation of China(No.11075183)the Knowledge Innovation Program of the Chinese Academy of Sciences(the study of neutral beam steady-state operation of the key technical and physical problems)
文摘To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. In this paper, the timing synchronization software is presented which is related to many kinds of technologies, such as shared memory, multithreading, TCP protocol and so on. Shared memory helps the server save the information of clients and system time, multithreading can deal with different clients with different threads, the server works under Linux operating system, the client works under Linux operating system and Windows operating system. With the help of this design, synchronization of all subsystems can be achieved in less than one second, and this accuracy is enough for the NBI system and the reliability of data is thus ensured.
基金supported by National Natural Science Foundation of China (No. 10875146)
文摘For the remote control of a neutral beam injection (NBI) system, a software NBIcsw is developed to work on the control server. It can meet the requirements of data transmission and operation-control between the NBI measurement and control layer (MCL) and the remote monitoring layer (RML). The NBIcsw runs on a Linux system, developed with client/server (C/S) mode and multithreading technology. It is shown through application that the software is with good efficiency.
文摘Scalability is one of the utmost nonfunctional requirement of server applications,because it maintains an effective performance parallel to the large fluctuating and sometimes unpredictable workload.In order to achieve scalability,thread pool system(TPS)has been used extensively as a middleware service in server applications.The size of thread pool is the most significant factor,that affects the overall performance of servers.Determining the optimal size of thread pool dynamically on runtime is a challenging problem.The most widely used and simple method to tackle this problem is to keep the size of thread pool equal to the request rate,i.e.,the frequencyoriented thread pool(FOTP).The FOTPs are the most widely used TPSs in the industry,because of the implementation simplicity,the negligible overhead and the capability to use in any system.However,the frequency-based schemes only focused on one aspect of changes in the load,and that is the fluctuations in request rate.The request rate alone is an imperfect knob to scale thread pool.Thus,this paper presents a workload profiling based FOTP,that focuses on request size(service time of request)besides the request rate as a knob to scale thread pool on runtime,because we argue that the combination of both truly represents the load fluctuation in server-side applications.We evaluated the results of the proposed system against state of the art TPS of Oracle Corporation(by a client-server-based simulator)and concluded that our system outperformed in terms of both;the response times and throughput.
文摘Developing a high-performance public key cryptosystem is crucial for numerous modern security applications.The Elliptic Curve Cryptosystem(ECC)has performance and resource-saving advantages compared to other types of asymmetric ciphers.However,the sequential design implementation for ECC does not satisfy the current applications’performance requirements.Therefore,several factors should be considered to boost the cryptosystem performance,including the coordinate system,the scalar multiplication algo-rithm,and the elliptic curve form.The tripling-oriented(3DIK)form is imple-mented in this work due to its minimal computational complexity compared to other elliptic curves forms.This experimental study explores the factors playing an important role in ECC performance to determine the best combi-nation that leads to developing high-speed ECC.The proposed cryptosystem uses parallel software implementation to speed up ECC performance.To our knowledge,previous studies have no similar software implementation for 3DIK ECC.Supported by using parallel design,projective coordinates,and a fast scalar multiplication algorithm,the proposed 3DIK ECC improved the speed of the encryption process compared with other counterparts and the usual sequential implementation.The highest performance level for 3DIK ECC was achieved when it was implemented using the Non-Adjacent Form algorithm and homogenous projection.Compared to the costly hardware implementations,the proposed software implementation is cost effective and can be easily adapted to other environments.In addition,the power con-sumption of the proposed ECC is analyzed and compared with other known cryptosystems.thus,the current study presents a detailed overview of the design and implementation of 3DIK ECC.
文摘With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.
基金Supported by the National High Technology Research and Development Program of China(No.2015AA015308)the State Key Development Program for Basic Research of China(No.2014CB340402)
文摘Big data analytics is emerging as one kind of the most important workloads in modern data centers. Hence,it is of great interest to identify the method of achieving the best performance for big data analytics workloads running on state-of-the-art SMT( simultaneous multithreading) processors,which needs comprehensive understanding to workload characteristics. This paper chooses the Spark workloads as the representative big data analytics workloads and performs comprehensive measurements on the POWER8 platform,which supports a wide range of multithreading. The research finds that the thread assignment policy and cache contention have significant impacts on application performance. In order to identify the potential optimization method from the experiment results,this study performs micro-architecture level characterizations by means of hardware performance counters and gives implications accordingly.
文摘Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).
文摘The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.