Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Com...Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Computer-aided diagnosis of pneumonia using deep learning techniques iswidely used due to its effectiveness and performance. In the proposed method,the Synthetic Minority Oversampling Technique (SMOTE) approach is usedto eliminate the class imbalance in the X-ray dataset. To compensate forthe paucity of accessible data, pre-trained transfer learning is used, and anensemble Convolutional Neural Network (CNN) model is developed. Theensemble model consists of all possible combinations of the MobileNetv2,Visual Geometry Group (VGG16), and DenseNet169 models. MobileNetV2and DenseNet169 performed well in the Single classifier model, with anaccuracy of 94%, while the ensemble model (MobileNetV2+DenseNet169)achieved an accuracy of 96.9%. Using the data synchronous parallel modelin Distributed Tensorflow, the training process accelerated performance by98.6% and outperformed other conventional approaches.展开更多
Deep neural networks are gaining importance and popularity in applications and services.Due to the enormous number of learnable parameters and datasets,the training of neural networks is computationally costly.Paralle...Deep neural networks are gaining importance and popularity in applications and services.Due to the enormous number of learnable parameters and datasets,the training of neural networks is computationally costly.Parallel and distributed computation-based strategies are used to accelerate this training process.Generative Adversarial Networks(GAN)are a recent technological achievement in deep learning.These generative models are computationally expensive because a GAN consists of two neural networks and trains on enormous datasets.Typically,a GAN is trained on a single server.Conventional deep learning accelerator designs are challenged by the unique properties of GAN,like the enormous computation stages with non-traditional convolution layers.This work addresses the issue of distributing GANs so that they can train on datasets distributed over many TPUs(Tensor Processing Unit).Distributed learning training accelerates the learning process and decreases computation time.In this paper,the Generative Adversarial Network is accelerated using the distributed multi-core TPU in distributed data-parallel synchronous model.For adequate acceleration of the GAN network,the data parallel SGD(Stochastic Gradient Descent)model is implemented in multi-core TPU using distributed TensorFlow with mixed precision,bfloat16,and XLA(Accelerated Linear Algebra).The study was conducted on the MNIST dataset for varying batch sizes from 64 to 512 for 30 epochs in distributed SGD in TPU v3 with 128×128 systolic array.An extensive batch technique is implemented in bfloat16 to decrease the storage cost and speed up floating-point computations.The accelerated learning curve for the generator and discriminator network is obtained.The training time was reduced by 79%by varying the batch size from 64 to 512 in multi-core TPU.展开更多
The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and Spark.An important step for any parallel clus...The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and Spark.An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes.This step governs the methodology and performance of the entire algorithm.Researchers typically use random,or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning,as per the requirements of the algorithm.However,these strategies are generic and are not tailor-made for any specific parallel clustering algorithm.In this paper,we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ.We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS,Cell-Based Dimensional Split for dGridSLINK,which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution,and Projection-Based Split,which is a generic distribution strategy.All of these preserve spatial locality,achieve disjoint partitioning,and ensure good data load balancing.The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for,based on which we give appropriate recommendations for their usage.展开更多
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a paral...In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.展开更多
The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are ap...The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are application specific opportunities for baseband algorithm optimizations.Therefore,the special attention can be paid,for example,the specific register allocation algorithm has not been studied so far.The compilation for kernel sub-routines of baseband in 5G micro base stations is our focusing point.For applications of known and fixed data size,we proposed a compilation scheme of parallel data accessing,while operands can be mainly allocated and stored in registers.Based on a small register group(48×32b),the target of our compilation scheme is the optimization of baseband algorithms based on 4×4 or smaller matrices,maximizing the utilization of register files,and eliminating the extra register data exchanging.Meanwhile,when data is allocated into register files,we used VLIW(Very Long Instruction Word)machine to hide the time of data accessing and minimize the cost of data accessing,thus the total execution time is minimum.Experiments indicate that for algorithms with small data size,the cost of data accessing and extra addressing can be minimized.展开更多
Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many...Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many-core)towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm.In the last decade,the graphics processing unit,or GPU,has gained an important place in the field of high performance computing(HPC)because of its low cost and massive parallel processing power.Super-computing has become,for the first time,available to anyone at the price of a desktop computer.In this paper,we survey the concept of parallel computing and especially GPU computing.Achieving efficient parallel algorithms for the GPU is not a trivial task,there are several technical restrictions that must be satisfied in order to achieve the expected performance.Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it.Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model.In particular,we show how this new technology can help the field of computational physics,especially when the problem is data-parallel.We present four examples of computational physics problems;n-body,collision detection,Potts model and cellular automata simulations.These examples well represent the kind of problems that are suitable for GPU computing.By understanding the GPU architecture and its massive parallelism programming model,one can overcome many of the technical limitations found along the way,design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.展开更多
Abstract Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of "volume" and "velocity", and not much ha...Abstract Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of "volume" and "velocity", and not much has been done on the theoreti- cal foundation and to handle the challenge of "variety". Based on metric-space indexing and computationalcomplexity the- ory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory.展开更多
Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with gene...Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads.展开更多
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to sim...It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced.展开更多
文摘Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Computer-aided diagnosis of pneumonia using deep learning techniques iswidely used due to its effectiveness and performance. In the proposed method,the Synthetic Minority Oversampling Technique (SMOTE) approach is usedto eliminate the class imbalance in the X-ray dataset. To compensate forthe paucity of accessible data, pre-trained transfer learning is used, and anensemble Convolutional Neural Network (CNN) model is developed. Theensemble model consists of all possible combinations of the MobileNetv2,Visual Geometry Group (VGG16), and DenseNet169 models. MobileNetV2and DenseNet169 performed well in the Single classifier model, with anaccuracy of 94%, while the ensemble model (MobileNetV2+DenseNet169)achieved an accuracy of 96.9%. Using the data synchronous parallel modelin Distributed Tensorflow, the training process accelerated performance by98.6% and outperformed other conventional approaches.
文摘Deep neural networks are gaining importance and popularity in applications and services.Due to the enormous number of learnable parameters and datasets,the training of neural networks is computationally costly.Parallel and distributed computation-based strategies are used to accelerate this training process.Generative Adversarial Networks(GAN)are a recent technological achievement in deep learning.These generative models are computationally expensive because a GAN consists of two neural networks and trains on enormous datasets.Typically,a GAN is trained on a single server.Conventional deep learning accelerator designs are challenged by the unique properties of GAN,like the enormous computation stages with non-traditional convolution layers.This work addresses the issue of distributing GANs so that they can train on datasets distributed over many TPUs(Tensor Processing Unit).Distributed learning training accelerates the learning process and decreases computation time.In this paper,the Generative Adversarial Network is accelerated using the distributed multi-core TPU in distributed data-parallel synchronous model.For adequate acceleration of the GAN network,the data parallel SGD(Stochastic Gradient Descent)model is implemented in multi-core TPU using distributed TensorFlow with mixed precision,bfloat16,and XLA(Accelerated Linear Algebra).The study was conducted on the MNIST dataset for varying batch sizes from 64 to 512 for 30 epochs in distributed SGD in TPU v3 with 128×128 systolic array.An extensive batch technique is implemented in bfloat16 to decrease the storage cost and speed up floating-point computations.The accelerated learning curve for the generator and discriminator network is obtained.The training time was reduced by 79%by varying the batch size from 64 to 512 in multi-core TPU.
文摘The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and Spark.An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes.This step governs the methodology and performance of the entire algorithm.Researchers typically use random,or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning,as per the requirements of the algorithm.However,these strategies are generic and are not tailor-made for any specific parallel clustering algorithm.In this paper,we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ.We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS,Cell-Based Dimensional Split for dGridSLINK,which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution,and Projection-Based Split,which is a generic distribution strategy.All of these preserve spatial locality,achieve disjoint partitioning,and ensure good data load balancing.The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for,based on which we give appropriate recommendations for their usage.
文摘In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
基金supported by the research funding KYQD(ZR)1974 from Hainan University.
文摘The currently available compilation techniques are for general computing and are not optimized for physical layer computing in 5G micro base stations.In such cases,the foreseeable data sizes and small code size are application specific opportunities for baseband algorithm optimizations.Therefore,the special attention can be paid,for example,the specific register allocation algorithm has not been studied so far.The compilation for kernel sub-routines of baseband in 5G micro base stations is our focusing point.For applications of known and fixed data size,we proposed a compilation scheme of parallel data accessing,while operands can be mainly allocated and stored in registers.Based on a small register group(48×32b),the target of our compilation scheme is the optimization of baseband algorithms based on 4×4 or smaller matrices,maximizing the utilization of register files,and eliminating the extra register data exchanging.Meanwhile,when data is allocated into register files,we used VLIW(Very Long Instruction Word)machine to hide the time of data accessing and minimize the cost of data accessing,thus the total execution time is minimum.Experiments indicate that for algorithms with small data size,the cost of data accessing and extra addressing can be minimized.
基金supported by Fondecyt Project No.1120495.Finally,thanks to Renato Cerro for improving the English of this manuscript.
文摘Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many-core)towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm.In the last decade,the graphics processing unit,or GPU,has gained an important place in the field of high performance computing(HPC)because of its low cost and massive parallel processing power.Super-computing has become,for the first time,available to anyone at the price of a desktop computer.In this paper,we survey the concept of parallel computing and especially GPU computing.Achieving efficient parallel algorithms for the GPU is not a trivial task,there are several technical restrictions that must be satisfied in order to achieve the expected performance.Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it.Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model.In particular,we show how this new technology can help the field of computational physics,especially when the problem is data-parallel.We present four examples of computational physics problems;n-body,collision detection,Potts model and cellular automata simulations.These examples well represent the kind of problems that are suitable for GPU computing.By understanding the GPU architecture and its massive parallelism programming model,one can overcome many of the technical limitations found along the way,design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.
文摘Abstract Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of "volume" and "velocity", and not much has been done on the theoreti- cal foundation and to handle the challenge of "variety". Based on metric-space indexing and computationalcomplexity the- ory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory.
基金supported by the National High Technology Research and Development 863 Program of China under Grant No.2012AA010901the National Natural Science Foundation of China under Grant No.61103021
文摘Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads.
基金Supported by the National Natural Science Foundation of China for Distinguished Young Scholar under Grant No. 60325205 the National High Technology Development 863 Program of China under Grants No. 2002AA110010, No. 2005AA110010 No. 2005AA119020, and the National Grand Fundamental Research 973 Program of China under Grant No. 2005CB321600.
文摘It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced.