This paper presents a parameterized instruction scheduling algorithm based on machine description table for TH-RISC system, having a (3-5) stages pipeline structure.It would provide considerable fiexibility for instru...This paper presents a parameterized instruction scheduling algorithm based on machine description table for TH-RISC system, having a (3-5) stages pipeline structure.It would provide considerable fiexibility for instruction scheduling, improving execution efficiency for rapidly upgrading RISC machines. Alld, using this instruction scheduler as a tool, the effect of several methods for solving instruction interlock problem has been analyzed. Finally, a high performance approach combining the hardware feasibility and software effectiveness for solving instruction interlock problem, the improvement of instruction level parallelism (ILP) and speed-up results are given.The algorithm complexity is O(n2).展开更多
This paper uses timed Petri net to model and analyze the problem of instructionlevel loop scheduling with resource constraints, which has been proven to be an NP complete problem. First, we present a new timed Petri n...This paper uses timed Petri net to model and analyze the problem of instructionlevel loop scheduling with resource constraints, which has been proven to be an NP complete problem. First, we present a new timed Petri net model to integrate functional unit allocation, register allocation and spilling ilno a unified theoretical framework.Then we develop a state subgraph, called Register Allocation Solution Graph, which can effectively describe the major behavior of our new model. The maill property of this state subgraph is that the number of all its nodes is polynomial. Finally we present and prove that the optimum loop schedules can be found with polynomial computation complexity, for almost all practical loop prograrns. Our work lightens a new idea of finding the optimum loop schedules.展开更多
随着集成电路工艺的发展,众核处理器体系结构逐渐成为计算机体系结构设计者的研究热点。众核体系结构通过任务级的并行来提升整个处理器的性能。然而,指令级的并行性仍然是众核设计者需要认真考虑的问题。对浮点运算效率和加速比进行了...随着集成电路工艺的发展,众核处理器体系结构逐渐成为计算机体系结构设计者的研究热点。众核体系结构通过任务级的并行来提升整个处理器的性能。然而,指令级的并行性仍然是众核设计者需要认真考虑的问题。对浮点运算效率和加速比进行了形式化描述,验证了进行指令级调度的必要性。对处理器核内流水线进行详细分析,指出了指令级调度的一般性问题。提出了在众核结构上使用指令级调度和软件流水的方法。针对Splash2程序集里的LU分解算法,使用众核结构的硬件支持,在Scratched Pad Memory(SPM)上给出了调度指令的方案。在众核仿真器Godson-T上仿真了经过指令级调度后的算法,当使用64个线程处理512×512的矩阵时,程序性能达到调度前性能的4倍。展开更多
文摘This paper presents a parameterized instruction scheduling algorithm based on machine description table for TH-RISC system, having a (3-5) stages pipeline structure.It would provide considerable fiexibility for instruction scheduling, improving execution efficiency for rapidly upgrading RISC machines. Alld, using this instruction scheduler as a tool, the effect of several methods for solving instruction interlock problem has been analyzed. Finally, a high performance approach combining the hardware feasibility and software effectiveness for solving instruction interlock problem, the improvement of instruction level parallelism (ILP) and speed-up results are given.The algorithm complexity is O(n2).
文摘This paper uses timed Petri net to model and analyze the problem of instructionlevel loop scheduling with resource constraints, which has been proven to be an NP complete problem. First, we present a new timed Petri net model to integrate functional unit allocation, register allocation and spilling ilno a unified theoretical framework.Then we develop a state subgraph, called Register Allocation Solution Graph, which can effectively describe the major behavior of our new model. The maill property of this state subgraph is that the number of all its nodes is polynomial. Finally we present and prove that the optimum loop schedules can be found with polynomial computation complexity, for almost all practical loop prograrns. Our work lightens a new idea of finding the optimum loop schedules.
文摘随着集成电路工艺的发展,众核处理器体系结构逐渐成为计算机体系结构设计者的研究热点。众核体系结构通过任务级的并行来提升整个处理器的性能。然而,指令级的并行性仍然是众核设计者需要认真考虑的问题。对浮点运算效率和加速比进行了形式化描述,验证了进行指令级调度的必要性。对处理器核内流水线进行详细分析,指出了指令级调度的一般性问题。提出了在众核结构上使用指令级调度和软件流水的方法。针对Splash2程序集里的LU分解算法,使用众核结构的硬件支持,在Scratched Pad Memory(SPM)上给出了调度指令的方案。在众核仿真器Godson-T上仿真了经过指令级调度后的算法,当使用64个线程处理512×512的矩阵时,程序性能达到调度前性能的4倍。