The strong non-deterministic polynomial-hard( NP-hard)character of job shop scheduling problem( JSSP) has been acknowledged widely and it becomes stronger when attaches the nowait constraint,which widely exists in man...The strong non-deterministic polynomial-hard( NP-hard)character of job shop scheduling problem( JSSP) has been acknowledged widely and it becomes stronger when attaches the nowait constraint,which widely exists in many production processes,such as chemistry process, metallurgical process. However,compared with the massive research on traditional job shop problem,little attention has been paid on the no-wait constraint.Therefore,in this paper, we have dealt with this problem by decomposing it into two sub-problems, the timetabling and sequencing problems,in traditional frame work. A new efficient combined non-order timetabling method,coordinated with objective of total tardiness,is proposed for the timetabling problems. As for the sequencing one,we have presented a modified complete local search with memory combined by crossover operator and distance counting. The entire algorithm was tested on well-known benchmark problems and compared with several existing algorithms.Computational experiments showed that our proposed algorithm performed both effectively and efficiently.展开更多
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still ...Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.展开更多
Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time n...Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very naive implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU.展开更多
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time f...In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore, we proposed the new NAND flash memory package for overcoming this major drawback. We present a high performance and low power NAND flash memory system with a dual cache memory. The proposed NAND flash package consists of two parts, i.e., an NAND flash memory module, and a dual cache module. The new NAND flash memory system can achieve dramatically higher performance and lower power consumption compared with any conventionM NAND-type flash memory module. Our results show that the proposed system can reduce about 78% of write operations into the flash memory cell and about 70% of read operations from the flash memory cell by using only additional 3KB cache space. This value represents high potential to achieve low power consumption and high performance gain.展开更多
基金National Natural Science Foundations of China(Nos.61174040,61104178)Shanghai Commission of Science and Technology,China(No.12JC1403400)the Fundamental Research Funds for the Central Universities,China
文摘The strong non-deterministic polynomial-hard( NP-hard)character of job shop scheduling problem( JSSP) has been acknowledged widely and it becomes stronger when attaches the nowait constraint,which widely exists in many production processes,such as chemistry process, metallurgical process. However,compared with the massive research on traditional job shop problem,little attention has been paid on the no-wait constraint.Therefore,in this paper, we have dealt with this problem by decomposing it into two sub-problems, the timetabling and sequencing problems,in traditional frame work. A new efficient combined non-order timetabling method,coordinated with objective of total tardiness,is proposed for the timetabling problems. As for the sequencing one,we have presented a modified complete local search with memory combined by crossover operator and distance counting. The entire algorithm was tested on well-known benchmark problems and compared with several existing algorithms.Computational experiments showed that our proposed algorithm performed both effectively and efficiently.
基金This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.
文摘Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
基金Supported by the National Natural Science Foundation of China (No. 61133005)the National High-Tech Research and Development (863) Program of China (No. 2012AA010902)
文摘Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very naive implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU.
基金This work was supported by Korea Research Foundation Grant funded by Korea Government(MOEHRD,Basic Research Promotion Fund)(Grant No.KRF-2005-003-D00270).
文摘In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore, we proposed the new NAND flash memory package for overcoming this major drawback. We present a high performance and low power NAND flash memory system with a dual cache memory. The proposed NAND flash package consists of two parts, i.e., an NAND flash memory module, and a dual cache module. The new NAND flash memory system can achieve dramatically higher performance and lower power consumption compared with any conventionM NAND-type flash memory module. Our results show that the proposed system can reduce about 78% of write operations into the flash memory cell and about 70% of read operations from the flash memory cell by using only additional 3KB cache space. This value represents high potential to achieve low power consumption and high performance gain.