期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
Combined Timetabling Procedure and Complete Local Search for No-Wait Job Shop Scheduling with Total Tardiness 被引量:1
1
作者 杨玉珍 顾幸生 《Journal of Donghua University(English Edition)》 EI CAS 2014年第2期83-91,共9页
The strong non-deterministic polynomial-hard( NP-hard)character of job shop scheduling problem( JSSP) has been acknowledged widely and it becomes stronger when attaches the nowait constraint,which widely exists in man... The strong non-deterministic polynomial-hard( NP-hard)character of job shop scheduling problem( JSSP) has been acknowledged widely and it becomes stronger when attaches the nowait constraint,which widely exists in many production processes,such as chemistry process, metallurgical process. However,compared with the massive research on traditional job shop problem,little attention has been paid on the no-wait constraint.Therefore,in this paper, we have dealt with this problem by decomposing it into two sub-problems, the timetabling and sequencing problems,in traditional frame work. A new efficient combined non-order timetabling method,coordinated with objective of total tardiness,is proposed for the timetabling problems. As for the sequencing one,we have presented a modified complete local search with memory combined by crossover operator and distance counting. The entire algorithm was tested on well-known benchmark problems and compared with several existing algorithms.Computational experiments showed that our proposed algorithm performed both effectively and efficiently. 展开更多
关键词 job shop scheduling NO-WAIT TIMETABLING TARDINESS complete local search with memory
下载PDF
CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications 被引量:3
2
作者 杨毅 李超 周辉阳 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期3-19,共17页
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still ... Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average. 展开更多
关键词 GPGPU nested parallelism COMPILER local memory
原文传递
Parallelization and Performance Optimization on Face Detection Algorithm with OpenCL: A Case Study 被引量:1
3
作者 Weiyan Wang Yunquan Zhang +2 位作者 Shengen Yan Ying Zhang Haipeng Jia 《Tsinghua Science and Technology》 EI CAS 2012年第3期287-295,共9页
Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time n... Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very naive implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU. 展开更多
关键词 Viola-Jones OPENCL time cost hidden local memory usage parallel granularity
原文传递
Next High Performance and Low Power Flash Memory Package Structure
4
作者 Jung-Hoon Lee 《Journal of Computer Science & Technology》 SCIE EI CSCD 2007年第4期515-520,共6页
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time f... In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore, we proposed the new NAND flash memory package for overcoming this major drawback. We present a high performance and low power NAND flash memory system with a dual cache memory. The proposed NAND flash package consists of two parts, i.e., an NAND flash memory module, and a dual cache module. The new NAND flash memory system can achieve dramatically higher performance and lower power consumption compared with any conventionM NAND-type flash memory module. Our results show that the proposed system can reduce about 78% of write operations into the flash memory cell and about 70% of read operations from the flash memory cell by using only additional 3KB cache space. This value represents high potential to achieve low power consumption and high performance gain. 展开更多
关键词 flash memory NAND-type NOR-type memory localities buffer or cache memory
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部