A 1.8-V 64-kb four-way set-associative CMOS cache memory implemented by 0.18μm/1.8V 1P6M logic CMOS technology for a super performance 32-b RISC microprocessor is presented.For comparison,a conventional parallel acce...A 1.8-V 64-kb four-way set-associative CMOS cache memory implemented by 0.18μm/1.8V 1P6M logic CMOS technology for a super performance 32-b RISC microprocessor is presented.For comparison,a conventional parallel access cache with the same storage and organization is also designed and simulated using the same technology.Simulation results indicate that by using sequential access,power reduction of 26% on a cache hit and 35% on a cache miss is achieved.High-speed approaches including modified current-mode sense amplifier and split dynamic tag comparators are adopted to achieve fast data access.Simulation results indicate that a typical clock to data access of 2.7ns is achieved...展开更多
The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are ap...The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are application specific opportunities for baseband algorithm optimizations.Therefore,the special attention can be paid,for example,the specific register allocation algorithm has not been studied so far.The compilation for kernel sub-routines of baseband in 5G micro base stations is our focusing point.For applications of known and fixed data size,we proposed a compilation scheme of parallel data accessing,while operands can be mainly allocated and stored in registers.Based on a small register group(48×32b),the target of our compilation scheme is the optimization of baseband algorithms based on 4×4 or smaller matrices,maximizing the utilization of register files,and eliminating the extra register data exchanging.Meanwhile,when data is allocated into register files,we used VLIW(Very Long Instruction Word)machine to hide the time of data accessing and minimize the cost of data accessing,thus the total execution time is minimum.Experiments indicate that for algorithms with small data size,the cost of data accessing and extra addressing can be minimized.展开更多
Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, para...Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, parallel operations are used to solve computer problems such as sort and search, which result in a reasonable speed. Sorting is one of the most important operations in computing world. The authors always try to find the best in different areas which the premier is speedup. In this paper, the authors issued a sort with O(logn) time complexity on PRAM EREW (Parallel Random Access Machine Exclusive Read Exclusive Write). The algorithm is designed in a manner that keeps the tradeoff between the number of processor elements in the architecture and execution time. The simulation of the algorithm proves the theoretical analysis of the algorithm. The results of this research can be utilized in developing faster embedded systems. Sorting on Centralized Diamond (SOCD) algorithm is issued on the novel Centralized Diamond architecture which takes the advantages of Single Instruction Multiple Data (SIMD) architecture. This architecture and the sort on it are intuitive and optimal.展开更多
In this paper,a sequential algorithm computing the all vertex pair distance matrix D and the path matrix Pis given.On a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is sho...In this paper,a sequential algorithm computing the all vertex pair distance matrix D and the path matrix Pis given.On a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is shown.This method can also be used to get a parallel algorithm to compute transitive closure arrayof an undirected graph.The time complexify of the parallel algorithm is O(n^3/p).If D,P andare known,it is shown that the problems to find all connected components, to compute the diameter of an undirected graph,to determine the center of a directed graph and to search for a directed cycle with the minimum(maximum)length in a directed graph can all be solved in O(n^2/p^+ logp)time.展开更多
文摘A 1.8-V 64-kb four-way set-associative CMOS cache memory implemented by 0.18μm/1.8V 1P6M logic CMOS technology for a super performance 32-b RISC microprocessor is presented.For comparison,a conventional parallel access cache with the same storage and organization is also designed and simulated using the same technology.Simulation results indicate that by using sequential access,power reduction of 26% on a cache hit and 35% on a cache miss is achieved.High-speed approaches including modified current-mode sense amplifier and split dynamic tag comparators are adopted to achieve fast data access.Simulation results indicate that a typical clock to data access of 2.7ns is achieved...
基金supported by the research funding KYQD(ZR)1974 from Hainan University.
文摘The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are application specific opportunities for baseband algorithm optimizations.Therefore,the special attention can be paid,for example,the specific register allocation algorithm has not been studied so far.The compilation for kernel sub-routines of baseband in 5G micro base stations is our focusing point.For applications of known and fixed data size,we proposed a compilation scheme of parallel data accessing,while operands can be mainly allocated and stored in registers.Based on a small register group(48×32b),the target of our compilation scheme is the optimization of baseband algorithms based on 4×4 or smaller matrices,maximizing the utilization of register files,and eliminating the extra register data exchanging.Meanwhile,when data is allocated into register files,we used VLIW(Very Long Instruction Word)machine to hide the time of data accessing and minimize the cost of data accessing,thus the total execution time is minimum.Experiments indicate that for algorithms with small data size,the cost of data accessing and extra addressing can be minimized.
文摘Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, parallel operations are used to solve computer problems such as sort and search, which result in a reasonable speed. Sorting is one of the most important operations in computing world. The authors always try to find the best in different areas which the premier is speedup. In this paper, the authors issued a sort with O(logn) time complexity on PRAM EREW (Parallel Random Access Machine Exclusive Read Exclusive Write). The algorithm is designed in a manner that keeps the tradeoff between the number of processor elements in the architecture and execution time. The simulation of the algorithm proves the theoretical analysis of the algorithm. The results of this research can be utilized in developing faster embedded systems. Sorting on Centralized Diamond (SOCD) algorithm is issued on the novel Centralized Diamond architecture which takes the advantages of Single Instruction Multiple Data (SIMD) architecture. This architecture and the sort on it are intuitive and optimal.
基金Research supported by the Science Foundation of Shandong Province.
文摘In this paper,a sequential algorithm computing the all vertex pair distance matrix D and the path matrix Pis given.On a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is shown.This method can also be used to get a parallel algorithm to compute transitive closure arrayof an undirected graph.The time complexify of the parallel algorithm is O(n^3/p).If D,P andare known,it is shown that the problems to find all connected components, to compute the diameter of an undirected graph,to determine the center of a directed graph and to search for a directed cycle with the minimum(maximum)length in a directed graph can all be solved in O(n^2/p^+ logp)time.