In china,many students are unable to do experiments in computer architecture courses,which is very important in helping them to understand many key points.The reason is that the cost of the hardware required is too mu...In china,many students are unable to do experiments in computer architecture courses,which is very important in helping them to understand many key points.The reason is that the cost of the hardware required is too much.Besides,it is very difficult to do research study in hardware experiments.In our course,we adopted an alternative way to deal with the problem: to use software simulators,and designed a set of virtual experiments based on these simulators,which are described in detail in this paper.展开更多
Introduction Research on computer architecture and systems is typically driven by technology and applications. Big data has emerged as an important application domain which has shown its huge impact on scientific rese...Introduction Research on computer architecture and systems is typically driven by technology and applications. Big data has emerged as an important application domain which has shown its huge impact on scientific research, business, and society. Big data is known for its large volume, high velocity, and a variety of formats. The collection, storage, retrieval, processing, and visualization of big data issues many challenges to computer architecture and systems. This special section is an effort to encourage and promote research to address the big data challenges from the computer architecture and systems perspectives.展开更多
The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is present...The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is presented. It has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility. The parallel system performance is evaluated by practical experiment.展开更多
A parallel architecture for efficient hardware implementation of Rivest Shamir Adleman(RSA) cryptography is proposed.Residue number system(RNS) is introduced to realize high parallelism,thus all the elements under the...A parallel architecture for efficient hardware implementation of Rivest Shamir Adleman(RSA) cryptography is proposed.Residue number system(RNS) is introduced to realize high parallelism,thus all the elements under the same base are independent of each other and can be computed in parallel.Moreover,a simple and fast base transformation is used to achieve RNS Montgomery modular multiplication algorithm,which facilitates hardware implementation.Based on transport triggered architecture(TTA),the proposed architecture is designed to evaluate the performance and feasibility of the algorithm.With these optimizations,a decryption rate of 106 kbps can be achieved for 1 024-b RSA at the frequency of 100 MHz.展开更多
Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip...Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip multiprocessor is described, including thetechniques from hard-wart redundancy to software support and firmware strategy. The design aims atmasking the influences of errors and automatically correcting the system states.展开更多
Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calcula...Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.展开更多
High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and s...High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and sequential simulation, are not applicable to system-level simulations of HPC systems. Even the parallel simulation using large-scale parallel machines also have many difficulties in scalability, reliability, generality, as well as efficiency. According to the current needs of HPC architecture design, this paper proposes a system-level parallel simulation platform: ArchSim. We first introduce the architecture of ArchSim simulation platform which is composed of a global server (GS), local server agents (LSA) and entities. Secondly, we emphasize some key techniques of ArchSim, including the synchronization protocol, the communication mechanism and the distributed checkpointing/restart mechanism. We then make a synthesized test of some main performance indices of ArchSim with the phold benchmark and analyze the extra overhead generated by ArchSim. Finally, based on ArchSim, we construct a parallel event-driven interconnection network simulator and a system-level simulator for a small scale HPC system with 256 processors. The results of the performance test and HPC system simulations demonstrate that ArchSim can achieve high speedup ratio and high scalability on parallel host machine and support system-level simulations for the architecture design of HPC systems.展开更多
The paper presents the conceptual and operational basis of the creation of IDSS based on our recent research experience. In this paper, an intelligent decision support system, IDSS is defined as: any interactive syste...The paper presents the conceptual and operational basis of the creation of IDSS based on our recent research experience. In this paper, an intelligent decision support system, IDSS is defined as: any interactive system that is specially designed to improve the decision making of its user by extending the user's cognitive decision making abilities. As a result, this view of man-machine joint cognitive system stresses the need to use computational technology to aid the user in the decision making process. And the human's role is to achieve total systems's objectives. The paper outlines the designing procedure in successive steps. First, the decision maker's cognitive needs for decision support are identified. Second, the computationally realizable support functions are defined that could be provided by IDSS. Then, the specific techniques that would best fill the decision needs are discussed. And finally, for system implementation the modern computational technology infrastructure is emphasized.展开更多
The new type of embedded signal processing system based on the packet switched network is achieved. According to the application field and the-characteristics of signal processing system, the RapidIO protocol is used ...The new type of embedded signal processing system based on the packet switched network is achieved. According to the application field and the-characteristics of signal processing system, the RapidIO protocol is used to solve the high-speed interconnection of multi-digital signal processor (DSP). Based on this protocol, a kind of crossbar switch module which is used to interconnect multi-DSP in the system is introduced. A route strategy, some flow control rules and error control rules, which adapt to different RapidIO network topology are also introduced. Crossbar switch performance is analyzed in detail by the probability module. By researching the technique of crossbar switch and analyzing the system performance, it has a significant meaning for building the general signal processing system.展开更多
The CORBA technique is an integration of the object-oriented conception and distributed computing technique. It can make the application within distributed heterogeneous environments reusable, portable and interoperab...The CORBA technique is an integration of the object-oriented conception and distributed computing technique. It can make the application within distributed heterogeneous environments reusable, portable and interoperable. The architecture of CORBA-based discrete event simulation systems is presented and the interface of distributed simulation objects (DSO) is defined in this paper after the DSO is identified and the synchronization mechanism among DSO is discussed.展开更多
There has been an increasing interest in integrating decision support systems (DSS) and expert systems (ES) to provide decision makers a more accessible, productive and domain-independent information and computing env...There has been an increasing interest in integrating decision support systems (DSS) and expert systems (ES) to provide decision makers a more accessible, productive and domain-independent information and computing environment. This paper is aimed at designing a multiple expert systems integrated decision support system (MESIDSS) to enhance decision makers' ability in more complex cases. The basic framework, management system of multiple ESs, and functions of MESIDSS are presented. The applications of MESIDSS in large-scale decision making processes are discussed from the following aspects of problem decomposing, dynamic combination of multiple ESs, link of multiple bases and decision coordinating. Finally, a summary and some ideas for the future are presented.展开更多
Component-based software reuse (CBSR) has been widely used in software developing practice and has an even more brilliant future with the rapid extension of the Internet, because World Wide Web (WWW) makes the large s...Component-based software reuse (CBSR) has been widely used in software developing practice and has an even more brilliant future with the rapid extension of the Internet, because World Wide Web (WWW) makes the large scale of component resources from different vendors become available to software developers. In this paper, an abstract component model suitable for representing components on WWW is proposed, which plays important roles both in achieving interoperability among components and among reusable component libraries (RCLs). Some necessary changes to many aspects of component management brought by WWW are also discussed, such as the classification of components and the corresponding searching methods, and the certification of components.展开更多
Vision systems that enable collision avoidance, localization and navigation in complex and uncertain environments are common in biology, but are extremely challenging to mimic in artificial electronic systems, in part...Vision systems that enable collision avoidance, localization and navigation in complex and uncertain environments are common in biology, but are extremely challenging to mimic in artificial electronic systems, in particular when size and power limitations apply. The development of neuromorphic electronic systems implementing models of biological sensory-motor systems in silicon is one promising approach to addressing these challenges. Concept learning is a central part of animal cognition that enables appropriate motor response in novel situations by generalization of former experience, possibly from a few examples. These aspects make concept learning a challenging and important problem. Learning methods in computer vision are typically inspired by mammals, but recent studies of insects motivate an interesting complementary research direction. There are several remarkable results showing that honeybees can learn to master abstract concepts, providing a road map for future work to allow direct comparisons between bio-inspired computing architectures and information processing in miniaturized “real” brains. Considering that the brain of a bee has less than 0.01% as many neurons as a human brain, the task to infer a minimal architecture and mechanism of concept learning from studies of bees appears well motivated. The relatively low complexity of insect sensory-motor systems makes them an interesting model for the further development of bio-inspired computing architectures, in particular for resource-constrained applications such as miniature robots, wireless sensors and handheld or wearable devices. Work in that direction is a natural step towards understanding and making use of prototype circuits for concept learning, which eventually may also help us to understand the more complex learning circuits of the human brain. By adapting concept learning mechanisms to a polymorphic computing framework we could possibly create large-scale decentralized computer vision systems, for example in the form of wireless sensor networks.展开更多
Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data appli...Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.展开更多
The instruction fetch unit (IFU) usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cac...The instruction fetch unit (IFU) usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cache and TLB arrays for instruction fetch. Since limited work can be done by the power-saving logic after the fetch address generation and before the instruction fetch, previous power-saving approaches usually suffer from the unnecessary restrictions from traditional IFU architectures. In this paper, we present CASA, a new power-aware IFU architecture, which effectively reduces the unnecessary restrictions on the power-saving approaches and provides sufficient time and information for the power-saving logic of both instruction cache and TLB. By analyzing, recording, and utilizing the key information of the dynamic instruction flow early in the front-end pipeline, CASA brings the opportunity to maximize the power efficiency and minimize the performance overhead. Compared to the baseline configuration, the leakage and dynamic power of instruction cache is reduced by 89.7% and 64.1% respectively, and the dynamic power of instruction TLB is reduced by 90.2%. Meanwhile the performance degradation in the worst case is only 0.63%. Compared to previous state-of-the-art power-saving approaches, the CASA-based approach saves IFU power more effectively, incurs less performance overhead and achieves better scalability. It is promising that CASA can stimulate further work on architectural solutions to power-efficient IFU designs.展开更多
CPU and System-on-Chip (SoC) are two key technologies of IT industry. During the course of ten years of research, we have defined the UniCore instruction set architecture, and designed the UniCore CPU and the PKUnit...CPU and System-on-Chip (SoC) are two key technologies of IT industry. During the course of ten years of research, we have defined the UniCore instruction set architecture, and designed the UniCore CPU and the PKUnity SoC family. This cross-disciplinary practice has also fostered many innovations in microprocessor architecture, optimizing compilers, low power design, functional verification, physical design, and so on. In the mean time, we have put technology transfer on the list of our top priorities. This effort has led to several marketable products, such as ultra mobile personal computers, secure micro-workstations and 3C-converged consumer electronics. The development of the next generation products, the 64-bit multi-core CPU and SoC, is also underway. They will find their applications in secure and adaptable computers for mobile and desktop, as well as personal digital multimedia devices. Being consistent with the philosophy and the long-term plan, and by leveraging the cutting-edge process technology, we will continue to make more innovations in CPUs and SoCs, and strengthen our commitment to technology transfer.展开更多
Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typical...Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typically compute-bound,memory-bound,and power-bound in different application cases.It is crucial to decide how to schedule different illumination algorithms with different features according to the practical requirements in reconfigurable graphics hardware.This paper analyze the performance characteristics of four main-stream lighting algorithms,Lambert illumination algorithm,Phong illumination algorithm,Blinn-Phong illumination algorithm,and Cook-Torrance illumination algorithm,using hardware performance counters on x86 processor platform KabyLake(KBL).The data movement,computation,power consumption,and memory accessing are evaluated over a range of application scenarios.Further,by analyzing the system-level behavior of these illumination algorithms,obtains the cons and pros of these specific algorithms were obtained.The associated relationship between performance/energy and the evaluated metrics was analyzed through Pearson correlation coefficient(PCC)analysis.According to these performance characterization data,this paper presents some reconfiguration suggestions in reconfigurable graphics processor.展开更多
Nondeterminism of PROLOG execution requires that a block of control information or a choice point for each procedure call be stored when there are other candidate clauses to be used.When the currently selected clause ...Nondeterminism of PROLOG execution requires that a block of control information or a choice point for each procedure call be stored when there are other candidate clauses to be used.When the currently selected clause fails,the bindings made by the clause must be undone and the stored choice point is reactivated,and then another clause of the candidate ones is chosen to run on it. Storing and reactivating choice points and undoing account for the great overhead are required to control PROLOG execution,which is quite different from conventional programs. This paper focuses on the techniques used in Sequential PROLOG Engine(SPE)to reduce the overhead of control operations.The control instructions of SPE store no more choice points than the necessary.Its architecture takes the approaches of analysing the potential parallelism in the con- trol operations and developing a fraction of it due to the cost-effect consideration.The results of executing two sample programs on SPE in the form of hand timings are presented,which favor the approach.展开更多
Multithreaded technique is the developing trend of high performance processor. Memory consistency model is essential to the correctness, performance and complexity of multithreaded processor. The chip multithreaded co...Multithreaded technique is the developing trend of high performance processor. Memory consistency model is essential to the correctness, performance and complexity of multithreaded processor. The chip multithreaded consistency model adapting to multithreaded processor is proposed in this paper. The restriction imposed on memory event ordering by chip multithreaded consistency is presented and formalized. With the idea of critical cycle built by Wei-Wu Hu, we prove that the proposed chip multithreaded consistency model satisfies the criterion of correct execution of sequential consistency model. Chip multithreaded consistency model provides a way of achieving high performance compared with sequential consistency model and easures the compatibility of software that the execution result in multithreaded processor is the same as the execution result in uniprocessor. The implementation strategy of chip multithreaded consistency model in Godson-2 SMT processor is also proposed. Godson-2 SMT processor supports chip multithreaded consistency model correctly by exception scheme based on the sequential memory access queue of each thread.展开更多
This paper introduces a heterogeneous distributed database system called LSZ system, where LSZ is an abbreviation of Li Shizhen, an ancient Chinese medical scientist. LSZ system adopts cluster as distributed database ...This paper introduces a heterogeneous distributed database system called LSZ system, where LSZ is an abbreviation of Li Shizhen, an ancient Chinese medical scientist. LSZ system adopts cluster as distributed database node (or site). Each cluster consists of one or several microcomputers and one server. The paper describes its basic architecture and the prototype implementation, which includes query processing and optimization, transaction manager and data language translation. The system provides a uniform retrieve and update user interface through global relational data language GRDL.展开更多
文摘In china,many students are unable to do experiments in computer architecture courses,which is very important in helping them to understand many key points.The reason is that the cost of the hardware required is too much.Besides,it is very difficult to do research study in hardware experiments.In our course,we adopted an alternative way to deal with the problem: to use software simulators,and designed a set of virtual experiments based on these simulators,which are described in detail in this paper.
文摘Introduction Research on computer architecture and systems is typically driven by technology and applications. Big data has emerged as an important application domain which has shown its huge impact on scientific research, business, and society. Big data is known for its large volume, high velocity, and a variety of formats. The collection, storage, retrieval, processing, and visualization of big data issues many challenges to computer architecture and systems. This special section is an effort to encourage and promote research to address the big data challenges from the computer architecture and systems perspectives.
基金This project was supported by the National Natural Science Foundation of China (60135020).
文摘The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is presented. It has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility. The parallel system performance is evaluated by practical experiment.
基金Supported by the Natural Science Foundation of Tianjin (No. 11JCZDJC15800)the National Natural Science Foundation of China(No. 61003306)
文摘A parallel architecture for efficient hardware implementation of Rivest Shamir Adleman(RSA) cryptography is proposed.Residue number system(RNS) is introduced to realize high parallelism,thus all the elements under the same base are independent of each other and can be computed in parallel.Moreover,a simple and fast base transformation is used to achieve RNS Montgomery modular multiplication algorithm,which facilitates hardware implementation.Based on transport triggered architecture(TTA),the proposed architecture is designed to evaluate the performance and feasibility of the algorithm.With these optimizations,a decryption rate of 106 kbps can be achieved for 1 024-b RSA at the frequency of 100 MHz.
基金Supported by the National High Techology Devel opment 863 Program of China(2002AA1Z030) and China PostdoctoralScience Foundation(2003034151)
文摘Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip multiprocessor is described, including thetechniques from hard-wart redundancy to software support and firmware strategy. The design aims atmasking the influences of errors and automatically correcting the system states.
文摘Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.
基金supported by the National High Technology Research and Development 863 Program of China under Grant No. 2007AA01Z117the National Basic Research 973 Program of China under Grant No.2007CB310900
文摘High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and sequential simulation, are not applicable to system-level simulations of HPC systems. Even the parallel simulation using large-scale parallel machines also have many difficulties in scalability, reliability, generality, as well as efficiency. According to the current needs of HPC architecture design, this paper proposes a system-level parallel simulation platform: ArchSim. We first introduce the architecture of ArchSim simulation platform which is composed of a global server (GS), local server agents (LSA) and entities. Secondly, we emphasize some key techniques of ArchSim, including the synchronization protocol, the communication mechanism and the distributed checkpointing/restart mechanism. We then make a synthesized test of some main performance indices of ArchSim with the phold benchmark and analyze the extra overhead generated by ArchSim. Finally, based on ArchSim, we construct a parallel event-driven interconnection network simulator and a system-level simulator for a small scale HPC system with 256 processors. The results of the performance test and HPC system simulations demonstrate that ArchSim can achieve high speedup ratio and high scalability on parallel host machine and support system-level simulations for the architecture design of HPC systems.
文摘The paper presents the conceptual and operational basis of the creation of IDSS based on our recent research experience. In this paper, an intelligent decision support system, IDSS is defined as: any interactive system that is specially designed to improve the decision making of its user by extending the user's cognitive decision making abilities. As a result, this view of man-machine joint cognitive system stresses the need to use computational technology to aid the user in the decision making process. And the human's role is to achieve total systems's objectives. The paper outlines the designing procedure in successive steps. First, the decision maker's cognitive needs for decision support are identified. Second, the computationally realizable support functions are defined that could be provided by IDSS. Then, the specific techniques that would best fill the decision needs are discussed. And finally, for system implementation the modern computational technology infrastructure is emphasized.
文摘The new type of embedded signal processing system based on the packet switched network is achieved. According to the application field and the-characteristics of signal processing system, the RapidIO protocol is used to solve the high-speed interconnection of multi-digital signal processor (DSP). Based on this protocol, a kind of crossbar switch module which is used to interconnect multi-DSP in the system is introduced. A route strategy, some flow control rules and error control rules, which adapt to different RapidIO network topology are also introduced. Crossbar switch performance is analyzed in detail by the probability module. By researching the technique of crossbar switch and analyzing the system performance, it has a significant meaning for building the general signal processing system.
基金This project was supported by "Ninth Five" National Defense Sustentation Foundation.
文摘The CORBA technique is an integration of the object-oriented conception and distributed computing technique. It can make the application within distributed heterogeneous environments reusable, portable and interoperable. The architecture of CORBA-based discrete event simulation systems is presented and the interface of distributed simulation objects (DSO) is defined in this paper after the DSO is identified and the synchronization mechanism among DSO is discussed.
文摘There has been an increasing interest in integrating decision support systems (DSS) and expert systems (ES) to provide decision makers a more accessible, productive and domain-independent information and computing environment. This paper is aimed at designing a multiple expert systems integrated decision support system (MESIDSS) to enhance decision makers' ability in more complex cases. The basic framework, management system of multiple ESs, and functions of MESIDSS are presented. The applications of MESIDSS in large-scale decision making processes are discussed from the following aspects of problem decomposing, dynamic combination of multiple ESs, link of multiple bases and decision coordinating. Finally, a summary and some ideas for the future are presented.
文摘Component-based software reuse (CBSR) has been widely used in software developing practice and has an even more brilliant future with the rapid extension of the Internet, because World Wide Web (WWW) makes the large scale of component resources from different vendors become available to software developers. In this paper, an abstract component model suitable for representing components on WWW is proposed, which plays important roles both in achieving interoperability among components and among reusable component libraries (RCLs). Some necessary changes to many aspects of component management brought by WWW are also discussed, such as the classification of components and the corresponding searching methods, and the certification of components.
基金partially supported by the Swedish Foundation for International Cooperation in Research and Higher Education(STINT),grant number IG2011-2025ARC DP0878968/DP0987989 for funding support.
文摘Vision systems that enable collision avoidance, localization and navigation in complex and uncertain environments are common in biology, but are extremely challenging to mimic in artificial electronic systems, in particular when size and power limitations apply. The development of neuromorphic electronic systems implementing models of biological sensory-motor systems in silicon is one promising approach to addressing these challenges. Concept learning is a central part of animal cognition that enables appropriate motor response in novel situations by generalization of former experience, possibly from a few examples. These aspects make concept learning a challenging and important problem. Learning methods in computer vision are typically inspired by mammals, but recent studies of insects motivate an interesting complementary research direction. There are several remarkable results showing that honeybees can learn to master abstract concepts, providing a road map for future work to allow direct comparisons between bio-inspired computing architectures and information processing in miniaturized “real” brains. Considering that the brain of a bee has less than 0.01% as many neurons as a human brain, the task to infer a minimal architecture and mechanism of concept learning from studies of bees appears well motivated. The relatively low complexity of insect sensory-motor systems makes them an interesting model for the further development of bio-inspired computing architectures, in particular for resource-constrained applications such as miniature robots, wireless sensors and handheld or wearable devices. Work in that direction is a natural step towards understanding and making use of prototype circuits for concept learning, which eventually may also help us to understand the more complex learning circuits of the human brain. By adapting concept learning mechanisms to a polymorphic computing framework we could possibly create large-scale decentralized computer vision systems, for example in the form of wireless sensor networks.
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2008907 and CCF-2029014the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-029the Chinese Academy of Sciences Project for Youth Innovation Promotion Association.
文摘Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.
基金Supported by the National High Technology Development 863 Program of China under Grant No.2004AAIZ1010.
文摘The instruction fetch unit (IFU) usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cache and TLB arrays for instruction fetch. Since limited work can be done by the power-saving logic after the fetch address generation and before the instruction fetch, previous power-saving approaches usually suffer from the unnecessary restrictions from traditional IFU architectures. In this paper, we present CASA, a new power-aware IFU architecture, which effectively reduces the unnecessary restrictions on the power-saving approaches and provides sufficient time and information for the power-saving logic of both instruction cache and TLB. By analyzing, recording, and utilizing the key information of the dynamic instruction flow early in the front-end pipeline, CASA brings the opportunity to maximize the power efficiency and minimize the performance overhead. Compared to the baseline configuration, the leakage and dynamic power of instruction cache is reduced by 89.7% and 64.1% respectively, and the dynamic power of instruction TLB is reduced by 90.2%. Meanwhile the performance degradation in the worst case is only 0.63%. Compared to previous state-of-the-art power-saving approaches, the CASA-based approach saves IFU power more effectively, incurs less performance overhead and achieves better scalability. It is promising that CASA can stimulate further work on architectural solutions to power-efficient IFU designs.
基金Supported by the National High Technology Research and Development 863 Program of China under Grant Nos.2002AA1Z1010 2003AA1Z1010,2004AA1Z1010 and 2006AA010202.
文摘CPU and System-on-Chip (SoC) are two key technologies of IT industry. During the course of ten years of research, we have defined the UniCore instruction set architecture, and designed the UniCore CPU and the PKUnity SoC family. This cross-disciplinary practice has also fostered many innovations in microprocessor architecture, optimizing compilers, low power design, functional verification, physical design, and so on. In the mean time, we have put technology transfer on the list of our top priorities. This effort has led to several marketable products, such as ultra mobile personal computers, secure micro-workstations and 3C-converged consumer electronics. The development of the next generation products, the 64-bit multi-core CPU and SoC, is also underway. They will find their applications in secure and adaptable computers for mobile and desktop, as well as personal digital multimedia devices. Being consistent with the philosophy and the long-term plan, and by leveraging the cutting-edge process technology, we will continue to make more innovations in CPUs and SoCs, and strengthen our commitment to technology transfer.
基金supported by the Natural National Science Foundation of China (61602377, 61834005, 61772417, 61802304,61874087,61634004)the International Science and Technology Cooperation Program of Shaanxi (2018KW006)
文摘Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typically compute-bound,memory-bound,and power-bound in different application cases.It is crucial to decide how to schedule different illumination algorithms with different features according to the practical requirements in reconfigurable graphics hardware.This paper analyze the performance characteristics of four main-stream lighting algorithms,Lambert illumination algorithm,Phong illumination algorithm,Blinn-Phong illumination algorithm,and Cook-Torrance illumination algorithm,using hardware performance counters on x86 processor platform KabyLake(KBL).The data movement,computation,power consumption,and memory accessing are evaluated over a range of application scenarios.Further,by analyzing the system-level behavior of these illumination algorithms,obtains the cons and pros of these specific algorithms were obtained.The associated relationship between performance/energy and the evaluated metrics was analyzed through Pearson correlation coefficient(PCC)analysis.According to these performance characterization data,this paper presents some reconfiguration suggestions in reconfigurable graphics processor.
基金SPE is partly supported by National Natural Science Foundation of China.
文摘Nondeterminism of PROLOG execution requires that a block of control information or a choice point for each procedure call be stored when there are other candidate clauses to be used.When the currently selected clause fails,the bindings made by the clause must be undone and the stored choice point is reactivated,and then another clause of the candidate ones is chosen to run on it. Storing and reactivating choice points and undoing account for the great overhead are required to control PROLOG execution,which is quite different from conventional programs. This paper focuses on the techniques used in Sequential PROLOG Engine(SPE)to reduce the overhead of control operations.The control instructions of SPE store no more choice points than the necessary.Its architecture takes the approaches of analysing the potential parallelism in the con- trol operations and developing a fraction of it due to the cost-effect consideration.The results of executing two sample programs on SPE in the form of hand timings are presented,which favor the approach.
基金Supported by the National High Technology Development 863 Program of China(Grant Nos.2007AA01Z114, 2006AA010201)the National Natural Science Foundation of China(Grant Nos.60703017, 60736012, 60325205, 60673146, 60603049)+1 种基金the National Grand Fundamental Research 973 Program of China(Grant Nos.2005CB321601, 2005CB321603)Beijing Natural Science Foundation(Grant No.4072024).
文摘Multithreaded technique is the developing trend of high performance processor. Memory consistency model is essential to the correctness, performance and complexity of multithreaded processor. The chip multithreaded consistency model adapting to multithreaded processor is proposed in this paper. The restriction imposed on memory event ordering by chip multithreaded consistency is presented and formalized. With the idea of critical cycle built by Wei-Wu Hu, we prove that the proposed chip multithreaded consistency model satisfies the criterion of correct execution of sequential consistency model. Chip multithreaded consistency model provides a way of achieving high performance compared with sequential consistency model and easures the compatibility of software that the execution result in multithreaded processor is the same as the execution result in uniprocessor. The implementation strategy of chip multithreaded consistency model in Godson-2 SMT processor is also proposed. Godson-2 SMT processor supports chip multithreaded consistency model correctly by exception scheme based on the sequential memory access queue of each thread.
基金This research is supported by National Natural Science Foundation under Grant 6863051
文摘This paper introduces a heterogeneous distributed database system called LSZ system, where LSZ is an abbreviation of Li Shizhen, an ancient Chinese medical scientist. LSZ system adopts cluster as distributed database node (or site). Each cluster consists of one or several microcomputers and one server. The paper describes its basic architecture and the prototype implementation, which includes query processing and optimization, transaction manager and data language translation. The system provides a uniform retrieve and update user interface through global relational data language GRDL.