期刊文献+
共找到264篇文章
< 1 2 14 >
每页显示 20 50 100
Efficient cache replacement framework based on access hotness for spacecraft processors
1
作者 GAO Xin NIAN Jiawei +1 位作者 LIU Hongjin YANG Mengfei 《中国空间科学技术(中英文)》 CSCD 北大核心 2024年第2期74-88,共15页
A notable portion of cachelines in real-world workloads exhibits inner non-uniform access behaviors.However,modern cache management rarely considers this fine-grained feature,which impacts the effective cache capacity... A notable portion of cachelines in real-world workloads exhibits inner non-uniform access behaviors.However,modern cache management rarely considers this fine-grained feature,which impacts the effective cache capacity of contemporary high-performance spacecraft processors.To harness these non-uniform access behaviors,an efficient cache replacement framework featuring an auxiliary cache specifically designed to retain evicted hot data was proposed.This framework reconstructs the cache replacement policy,facilitating data migration between the main cache and the auxiliary cache.Unlike traditional cacheline-granularity policies,the approach excels at identifying and evicting infrequently used data,thereby optimizing cache utilization.The evaluation shows impressive performance improvement,especially on workloads with irregular access patterns.Benefiting from fine granularity,the proposal achieves superior storage efficiency compared with commonly used cache management schemes,providing a potential optimization opportunity for modern resource-constrained processors,such as spacecraft processors.Furthermore,the framework complements existing modern cache replacement policies and can be seamlessly integrated with minimal modifications,enhancing their overall efficacy. 展开更多
关键词 spacecraft processors cache management replacement policy storage efficiency memory hierarchy MICROARCHITECTURE
下载PDF
Proposal for sequential Stern-Gerlach experiment with programmable quantum processors
2
作者 胡孟军 缪海兴 张永生 《Chinese Physics B》 SCIE EI CAS CSCD 2024年第2期131-136,共6页
The historical significance of the Stern–Gerlach(SG)experiment lies in its provision of the initial evidence for space quantization.Over time,its sequential form has evolved into an elegant paradigm that effectively ... The historical significance of the Stern–Gerlach(SG)experiment lies in its provision of the initial evidence for space quantization.Over time,its sequential form has evolved into an elegant paradigm that effectively illustrates the fundamental principles of quantum theory.To date,the practical implementation of the sequential SG experiment has not been fully achieved.In this study,we demonstrate the capability of programmable quantum processors to simulate the sequential SG experiment.The specific parametric shallow quantum circuits,which are suitable for the limitations of current noisy quantum hardware,are given to replicate the functionality of SG devices with the ability to perform measurements in different directions.Surprisingly,it has been demonstrated that Wigner’s SG interferometer can be readily implemented in our sequential quantum circuit.With the utilization of the identical circuits,it is also feasible to implement Wheeler’s delayed-choice experiment.We propose the utilization of cross-shaped programmable quantum processors to showcase sequential experiments,and the simulation results demonstrate a strong alignment with theoretical predictions.With the rapid advancement of cloud-based quantum computing,such as BAQIS Quafu,it is our belief that the proposed solution is well-suited for deployment on the cloud,allowing for public accessibility.Our findings not only expand the potential applications of quantum computers,but also contribute to a deeper comprehension of the fundamental principles underlying quantum theory. 展开更多
关键词 sequential Stern-Gerlach quantum circuit quantum processor
下载PDF
Speeding up the MATLAB complex networks package using graphic processors 被引量:1
3
作者 张百达 唐玉华 +1 位作者 吴俊杰 李鑫 《Chinese Physics B》 SCIE EI CAS CSCD 2011年第9期460-467,共8页
The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks ... The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research. 展开更多
关键词 complex networks graphic processors unit MATLAB Jacket Toolbox
下载PDF
SDN-Based Switch Implementation on Network Processors 被引量:1
4
作者 Yunchun Li Guodong Wang 《Communications and Network》 2013年第3期434-437,共4页
Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly ... Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly based on the idea of Software Define Network (SDN). Open vSwitch is a sort of software virtual switch, which conforms to the OpenFlow protocol standard. It is basically deployed in the Linux kernel hypervisor. This leads to its performance relatively poor because of the limited system resource. In turn, the packet process throughput is very low.In this paper, we present a Cavium-based Open vSwitch implementation. The Cavium platform features with multi cores and couples of hard ac-celerators. It supports zero-copy of packets and handles packet more quickly. We also carry some experiments on the platform. It indicates that we can use it in the enterprise network or campus network as convergence layer and core layer device. 展开更多
关键词 SDN OPEN vSwitch Network processors OpenFlow
下载PDF
High-Level Portable Programming Language for Optimized Memory Use of Network Processors
5
作者 Yasusi Kanada 《Communications and Network》 2015年第1期55-69,共15页
Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. ... Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardware and software but can be translated into an NP program with high performance especially in the memory use. A common NP hardware feature is that a whole packet is stored in DRAM, but the header is cached in SRAM. Phonepl has a hardware-independent abstraction of this feature so that it allows programmers mostly unconscious of this hardware feature. To implement the abstraction, four representations of packet data type that cover all the packet operations (including substring, concatenation, input, and output) are introduced. Phonepl have been implemented on Octeon NPs used in plug-ins for a network-virtualization environment called the VNode Infrastructure, and several packet-handling programs were evaluated. As for the evaluation result, the conversion throughput is close to the wire rate, i.e., 10 Gbps, and no packet loss (by cache miss) occurs when the packet size is 256 bytes or larger. 展开更多
关键词 NETWORK processors PORTABILITY HIGH-LEVEL Language Hardware Independence MEMORY Usage DRAM SRAM NETWORK Virtualization
下载PDF
Export Processors to be Pooled
6
《China's Foreign Trade》 2000年第5期45-45,共1页
关键词 Export processors to be Pooled
下载PDF
处理器Processors
7
《个人电脑》 2004年第1期110-110,共1页
关键词 微处理器 芯片组 主板 AMD ATHLON 64 processors
下载PDF
Processors处理器
8
《个人电脑》 2004年第1期89-89,共1页
2003年里,我们在处理器市场上看到了一幕幕重头戏的上演,基于Hyper Threading技术的Pentium4处理器、Pentium 4 EE、AMD Athlon 64、AMD Athlon 64FX系列,这些新产品的快速推出让人有目不暇接的感觉。
关键词 processors 微处理器 体系结构 CPU 纳米制造工艺
下载PDF
A Quantitative Evaluation of Vector Transcendental Functions on ARMv8-Based Processors
9
作者 沈洁 龙标 黄春 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第3期686-701,共16页
Transcendental functions are important functions in various high performance computing applications.Because these functions are time-consuming and the vector units on modern processors become wider and more scalable,t... Transcendental functions are important functions in various high performance computing applications.Because these functions are time-consuming and the vector units on modern processors become wider and more scalable,there is an increasing demand for developing and using vector transcendental functions in such performance-hungry applications.However,the performance of vector transcendental functions as well as their accuracy remain largely unexplored.To address this issue,we perform a comprehensive evaluation of two Single Instruction Multiple Data(SIMD)intrinsics based vector math libraries on two ARMv8 compatible processors.We first design dedicated microbenchmarks that help us understand the performance behavior of vector transcendental functions.Then,we propose a piecewise,quantitative evaluation method with a set of meaningful metrics to quantify their performance and accuracy.By analyzing the experimental results,we find that vector transcendental functions achieve good performance speedups thanks to the vectorization and algorithm optimization.Moreover,vector math libraries can replace scalar math libraries in many cases because of improved performance and satisfactory accuracy.Despite this,the implementations of vector math libraries are still immature,which means further optimization is needed,and our evaluation reveals feasible optimization solutions for future vector math libraries. 展开更多
关键词 transcendental function vector math library piecewise quantitative evaluation microbenchmarking ARMv8-based processor
原文传递
High Performance General-Purpose Microprocessors: Past and Future 被引量:5
10
作者 胡伟武 侯锐 +1 位作者 肖俊华 章隆宾 《Journal of Computer Science & Technology》 SCIE EI CSCD 2006年第5期631-640,共10页
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to sim... It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced. 展开更多
关键词 high performance general-purpose microprocessor instruction level parallelism data level parallelism thread level parallelism chip multiprocessors Godson processor
原文传递
PARAMETRIC BOUNDS FOR LPT SCHEDULING ON UNIFORM PROCESSORS 被引量:2
11
作者 陈礴 《Acta Mathematicae Applicatae Sinica》 SCIE CSCD 1991年第1期67-73,共7页
The nonpreemptive assignment of independent tasks to a system of m uniform processors isexamined with the objective of minimizing the makespan. Using r_m, the ratio of the fasest speed tothe slowest speed of the syste... The nonpreemptive assignment of independent tasks to a system of m uniform processors isexamined with the objective of minimizing the makespan. Using r_m, the ratio of the fasest speed tothe slowest speed of the system, as a parameter, we assess the performance of LPT (largestprocessing time) schedule with respect to optimal schedules. It is shown thet the worst-case boundfor the ratio of the two schedule lengths is between 展开更多
关键词 LPT NG Pr PARAMETRIC BOUNDS FOR LPT SCHEDULING ON UNIFORM processors
原文传递
Fault Tolerance Mechanism in Chip Many-Core Processors 被引量:1
12
作者 张磊 韩银和 +1 位作者 李华伟 李晓维 《Tsinghua Science and Technology》 SCIE EI CAS 2007年第S1期169-174,共6页
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan... As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time. 展开更多
关键词 chip many-core processors YIELD fault tolerance RECONFIGURATION NETWORK-ON-CHIP
原文传递
Taxonomy of Data Prefetching for Multicore Processors 被引量:1
13
作者 Surendra Byna 陈勇 孙贤和 《Journal of Computer Science & Technology》 SCIE EI CSCD 2009年第3期405-417,共13页
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support,... Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well. 展开更多
关键词 taxonomy of prefetching strategies multicore processors data prefetching memory hierarchy
原文传递
BAR:a branch-alternation-resorting algorithm for locality exploration in graph processing
14
作者 邓军勇 WANG Junjie +2 位作者 JIANG Lin XIE Xiaoyan ZHOU Kai 《High Technology Letters》 EI CAS 2024年第1期31-42,共12页
Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-re... Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph processing.This paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algorithm(BAR).In order to make the algorithm run in parallel and improve the efficiency of algorithm operation,the BAR algorithm is mapped onto the reconfigurable array processor(APR-16)to achieve vertex reordering,effectively improving the locality of graph data.This paper validates the BAR algorithm on the GraphBIG framework,by utilizing the reordered dataset with BAR on breadth-first search(BFS),single source shortest paht(SSSP)and betweenness centrality(BC)algorithms for traversal.The results show that compared with DBR and Corder algorithms,BAR can reduce execution time by up to 33.00%,and 51.00%seperatively.In terms of data movement,the BAR algorithm has a maximum reduction of 39.00%compared with the DBR algorithm and 29.66%compared with Corder algorithm.In terms of computational complexity,the BAR algorithm has a maximum reduction of 32.56%compared with DBR algorithm and53.05%compared with Corder algorithm. 展开更多
关键词 graph processing vertex reordering branch-alternation-resorting algorithm(BAR) reconfigurable array processor
下载PDF
Research and design of matrix operation accelerator based on reconfigurable array
15
作者 邓军勇 ZHANG Pan +2 位作者 JIANG Lin XIE Xiaoyan DENG Jingwen 《High Technology Letters》 EI CAS 2024年第2期128-137,共10页
In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations t... In the case of massive data,matrix operations are very computationally intensive,and the memory limitation in standalone mode leads to the system inefficiencies.At the same time,it is difficult for matrix operations to achieve flexible switching between different requirements when implemented in hardware.To address this problem,this paper proposes a matrix operation accelerator based on reconfigurable arrays in the context of the application of recommender systems(RS).Based on the reconfigurable array processor(APR-16)with reconfiguration,a parallelized design of matrix operations on processing element(PE)array is realized with flexibility.The experimental results show that,compared with the proposed central processing unit(CPU)and graphics processing unit(GPU)hybrid implementation matrix multiplication framework,the energy efficiency ratio of the accelerator proposed in this paper is improved by about 35×.Compared with blocked alternating least squares(BALS),its the energy efficiency ratio has been accelerated by about 1×,and the switching of matrix factorization(MF)schemes suitable for different sparsity can be realized. 展开更多
关键词 matrix factorization(MF) recommender system(RS) array processor RECONFIGURABLE matrix multiplication
下载PDF
Energy Efficient Block-Partitioned Multicore Processors for Parallel Applications
16
作者 祁轩 朱大开 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期418-433,共16页
Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture t... Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multieore processors, where cores are grouped into blocks with the cores on one block sharing a DVFS- enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multieore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application. 展开更多
关键词 multicore processors dynamic voltage and frequency scaling (DVFS) voltage islands parallel applications
原文传递
Parallelization and sustainability of distributed genetic algorithms on many-core processors
17
作者 Yuji Sato Mikiko Sato 《International Journal of Intelligent Computing and Cybernetics》 EI 2014年第1期2-23,共22页
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr... Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs. 展开更多
关键词 Evolutionary computation Genetic algorithms Fault identification Many-core processors PARALLELIZATION
原文传递
Grounding Technique Quells EMI from Fast Processors
18
作者 David G.Morrison 《电源世界》 2003年第12期61-62,共2页
With microprocessor clock speeds rising above 1 GHz, theCPU’s heatsink comes under increasing scrutiny as a source ofelectromagnetic interference (EMI). Often, this problem isn’ taddressed until after the microproce... With microprocessor clock speeds rising above 1 GHz, theCPU’s heatsink comes under increasing scrutiny as a source ofelectromagnetic interference (EMI). Often, this problem isn’ taddressed until after the microprocessor has been designed inand the product is undergoing EMI tests. If EMI problems arediscovered at that point, they’re addressed with Band—Aid solutions,such as adding shielding hardware. Naturally, such fixesslow product development and add cost. Moreover, theymay be ineffective in future product spins with newer processors. 展开更多
关键词 processors hardware addressed SHIELDING adding CLOCK UNTIL COMES APERTURE titled
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
19
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
20
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
上一页 1 2 14 下一页 到第
使用帮助 返回顶部