The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific f...The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.展开更多
The mean shift image segmentation algorithm is very computationintensive. To address the need to deal with a large number of remotesensing (RS) image segmentations in real-world applications, this studyhas investigat...The mean shift image segmentation algorithm is very computationintensive. To address the need to deal with a large number of remotesensing (RS) image segmentations in real-world applications, this studyhas investigated the parallelization of the mean shift algorithm on asingle graphics processing unit (GPU) and a task-scheduling methodwith message passing interface (MPI)+OpenCL programming model on aGPU cluster platform. This paper presents the test results of the parallelmean shift image segmentation algorithm on Shelob, a GPU clusterplatform at Louisiana State University, with different datasets andparameters. The experimental results show that the proposed parallelalgorithm can achieve good speedups with different configurations andRS data and can provide an effective solution for RS image processingon a GPU cluster.展开更多
在高性能计算领域,拥有强大浮点计算能力的协处理器正在快速发展。近年来,利用协处理器(如GPU)来加速时域有限差分FDTD算法的计算过程成为电磁研究领域的热点问题。在GPU集群上实现了三维UPML-FDTD算法并进行了优化。采用电偶极子激励...在高性能计算领域,拥有强大浮点计算能力的协处理器正在快速发展。近年来,利用协处理器(如GPU)来加速时域有限差分FDTD算法的计算过程成为电磁研究领域的热点问题。在GPU集群上实现了三维UPML-FDTD算法并进行了优化。采用电偶极子激励源对算法的模拟结果同解析解进行了验证,结果表明该算法具有较高的精度;同时,在NVIDIA Tesla M2070和K20mGPU集群上对FDTD算法的性能进行测试,对优化前后的计算结果以及GPU与CPU的计算性能进行了比较,并使用80块NVIDIA Tesla K20mGPU进行了可扩展性测试。从本文的研究结果可以看出,经过优化的FDTD算法性能有了较大的提升,而且FDTD算法在GPU集群上获得了比较理想的并行效率。展开更多
文摘The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.
基金the Engineering Research Center of Geospatial Information and Digital Technology(NASG)(Wuhan University)[grant number SIDT20170601]Hubei Provincial Key Laboratory of Intelligent Geoinformation Processing(China University of Geosciences(Wuhan))[grant number KLIGIP2016A03]+2 种基金the Fundamental Research Funds for the Central Universities[grant number ZYGX2015J111]Key Laboratory of Spatial Data Mining&Information Sharing of the Ministry of Education(Fuzhou University)[grant number 2016LSDMIS06],[grant number 2017LSDMIS03]and also the National Science Foundation of the United States(Award Nos.1251095,1723292)。
文摘The mean shift image segmentation algorithm is very computationintensive. To address the need to deal with a large number of remotesensing (RS) image segmentations in real-world applications, this studyhas investigated the parallelization of the mean shift algorithm on asingle graphics processing unit (GPU) and a task-scheduling methodwith message passing interface (MPI)+OpenCL programming model on aGPU cluster platform. This paper presents the test results of the parallelmean shift image segmentation algorithm on Shelob, a GPU clusterplatform at Louisiana State University, with different datasets andparameters. The experimental results show that the proposed parallelalgorithm can achieve good speedups with different configurations andRS data and can provide an effective solution for RS image processingon a GPU cluster.
文摘在高性能计算领域,拥有强大浮点计算能力的协处理器正在快速发展。近年来,利用协处理器(如GPU)来加速时域有限差分FDTD算法的计算过程成为电磁研究领域的热点问题。在GPU集群上实现了三维UPML-FDTD算法并进行了优化。采用电偶极子激励源对算法的模拟结果同解析解进行了验证,结果表明该算法具有较高的精度;同时,在NVIDIA Tesla M2070和K20mGPU集群上对FDTD算法的性能进行测试,对优化前后的计算结果以及GPU与CPU的计算性能进行了比较,并使用80块NVIDIA Tesla K20mGPU进行了可扩展性测试。从本文的研究结果可以看出,经过优化的FDTD算法性能有了较大的提升,而且FDTD算法在GPU集群上获得了比较理想的并行效率。