Under a very general condition (TNC condition) we show that the spectral radius of the kernel of a general branching process is a threshold parameter and hence plays a role as the basic reproduction number in usual ...Under a very general condition (TNC condition) we show that the spectral radius of the kernel of a general branching process is a threshold parameter and hence plays a role as the basic reproduction number in usual CMJ processes. We discuss also some properties of the extinction probability and the generating operator of general branching processes. As an application in epidemics, in the final section we suggest a generalization of SIR model which can describe infectious diseases transmission in an inhomogeneous population.展开更多
We consider a structural stochastic volatility model for the loss from a large portfolio of credit risky assets.Both the asset value and the volatility processes are correlated through systemic Brownian motions,with d...We consider a structural stochastic volatility model for the loss from a large portfolio of credit risky assets.Both the asset value and the volatility processes are correlated through systemic Brownian motions,with default determined by the asset value reaching a lower boundary.We prove that if our volatility models are picked from a class of mean-reverting diffusions,the system converges as the portfolio becomes large and,when the vol-of-vol function satisfies certain regularity and boundedness conditions,the limit of the empirical measure process has a density given in terms of a solution to a stochastic initial-boundary value problem on a half-space.The problem is defined in a special weighted Sobolev space.Regularity results are established for solutions to this problem,and then we show that there exists a unique solution.In contrast to the CIR volatility setting covered by the existing literature,our results hold even when the systemic Brownian motions are taken to be correlated.展开更多
Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremend...Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.展开更多
A new class of branching models,the general collision branching processes with two parameters,is considered in this paper.For such models,it is necessary to evaluate the absorbing probabilities and mean extinction tim...A new class of branching models,the general collision branching processes with two parameters,is considered in this paper.For such models,it is necessary to evaluate the absorbing probabilities and mean extinction times for both absorbing states.Regularity and uniqueness criteria are firstly established.Explicit expressions are then obtained for the extinction probability vector,the mean extinction times and the conditional mean extinction times.The explosion behavior of these models is investigated and an explicit expression for mean explosion time is established.The mean global holding time is also obtained.It is revealed that these properties are substantially different between the super-explosive and sub-explosive cases.展开更多
In this paper,a class of reaction diffusion processes with general reaction rates is studied.A necessary and sufficient condition for the reversibility of this calss of reaction diffusion processes is given,and then t...In this paper,a class of reaction diffusion processes with general reaction rates is studied.A necessary and sufficient condition for the reversibility of this calss of reaction diffusion processes is given,and then the ergodicity of these processes is proved.展开更多
The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for t...The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.展开更多
The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the fi...The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.展开更多
An integration processing system of three-dimensional laser scanning information visualization in goaf was developed. It is provided with multiple functions, such as laser scanning information management for goaf, clo...An integration processing system of three-dimensional laser scanning information visualization in goaf was developed. It is provided with multiple functions, such as laser scanning information management for goaf, cloud data de-noising optimization, construction, display and operation of three-dimensional model, model editing, profile generation, calculation of goaf volume and roof area, Boolean calculation among models and interaction with the third party soft ware. Concerning this system with a concise interface, plentiful data input/output interfaces, it is featured with high integration, simple and convenient operations of applications. According to practice, in addition to being well-adapted, this system is favorably reliable and stable.展开更多
Spectra are fundamental observation data used for astronomical research,but understanding them strongly depends on theoretical models with many fundamental parameters from theoretical calculations.Different models giv...Spectra are fundamental observation data used for astronomical research,but understanding them strongly depends on theoretical models with many fundamental parameters from theoretical calculations.Different models give different insights for understanding a specific object.Hence,laboratory benchmarks for these theoretical models become necessary.An electron beam ion trap is an ideal facility for spectroscopic benchmarks due to its similar conditions of electron density and temperature compared to astrophysical plasmas in stellar coronae,supernova remnants and so on.In this paper,we will describe the performance of a small electron beam ion trap/source facility installed at National Astronomical Observatories,Chinese Academy of Sciences.We present some preliminary experimental results on X-ray emission,ion production,the ionization process of trapped ions as well as the effects of charge exchange on the ionization.展开更多
Having been developed in the way of concept extension, Marxism appears to be nowadays as the concrete-universal theory, in which originally imperfect transition program from abstract-universal to concrete-universal co...Having been developed in the way of concept extension, Marxism appears to be nowadays as the concrete-universal theory, in which originally imperfect transition program from abstract-universal to concrete-universal concepts of logic and sense is realized on materialistic foundation. This very program that was brought about in Karl Marx's "Capital" has not been sufficiently expressed in classical or contemporary philosophy. The base of this new Marxist philosophical form is not constructed by the terms of matter, movement, and development overall, but by the conception of the general naturally determined universal process of infinite movement from lower to superior forms of matter. We are aware of four of them: physical, chemical, biological, and social matter. Representing the eternal world as the progressive whole, modern materialism makes nature and proper place of each fundamental science understandable and helps to clarify the location and development future trends of the Man in the world.展开更多
The quality of full-disk solar Hα images is significantly degraded by stripe interference. In this paper, to improve the analysis of morphological evolution, a robust solution for stripe interference removal in a par...The quality of full-disk solar Hα images is significantly degraded by stripe interference. In this paper, to improve the analysis of morphological evolution, a robust solution for stripe interference removal in a partial full-disk solar Hα image is proposed. The full-disk solar image is decomposed into a set of support value images on different scales by convolving the image with a sequence of multiscale support value filters, which are calculated from the mapped least-squares support vector machines (LS-SVMs). To match the resolution of the support value images, a scale-adaptive LS-SVM regression model is used to remove stripe interference from the support value images. We have demonstrated the advantages of our method on solar Hα images taken in 2001-2002 at the Huairou Solar Observing Station. Our experimental results show that our method can remove the stripe interference well in solar Hα images and the restored image can be used in morphology researches.展开更多
Blazars are characterized by large intensity and spectral variations across the electromagnetic spectrum It is believed that jets emerging from them are almost aligned with the line-of-sight. The major- ity of identif...Blazars are characterized by large intensity and spectral variations across the electromagnetic spectrum It is believed that jets emerging from them are almost aligned with the line-of-sight. The major- ity of identified extragalactic sources in γ-ray catalogs of EGRET and Fermi are blazars. Observationally, blazars can be divided into two classes: fiat spectrum radio quasars (FSRQs) and BL Lacs. BL Lacs usually exhibit lower γ-ray luminosity and harder power law spectra at γ-ray energies than FSRQs. We attempt to explain the high energy properties of FSRQs and BL Lacs from Fermi γ-ray space telescope observations. It was argued previously that the difference in accretion rates is mainly responsible for the large mismatch in observed luminosity in "7-ray. However, when intrinsic luminosities are derived by correcting for beaming effects, this difference in 7-ray luminosity between the two classes is significantly reduced. In order to ex- plain this difference in intrinsic luminosities, we propose that spin plays an important role in the luminosity distribution dichotomy of BL Lacs and FSRQs. As the outflow power of a blazar increases with increasing spin of a central black hole, we suggest that the spin plays a crucial role in making BL Lac sources low luminous and slow rotators compared to FSRQ sources.展开更多
An efficient computing framework,namely PFlows,for fully resolved-direct numerical simulations of particle-laden flows was accelerated on NVIDIA General Processing Units(GPUs)and GPU-like accelerator(DCU)cards.The fra...An efficient computing framework,namely PFlows,for fully resolved-direct numerical simulations of particle-laden flows was accelerated on NVIDIA General Processing Units(GPUs)and GPU-like accelerator(DCU)cards.The framework is featured as coupling the lattice Boltzmann method for fluid flow with the immersed boundary method for fluid-particle interaction,and the discrete element method for particle collision,using two fixed Eulerian meshes and one moved Lagrangian point mesh,respectively.All the parts are accelerated by a fine-grained parallelism technique using CUDA on GPUs,and further using HIP on DCU cards,i.e.,the calculation on each fluid grid,each immersed boundary point,each particle motion,and each pair-particle collision is responsible by one computer thread,respectively.Coalesced memory accesses to LBM distribution functions with the data layout of Structure of Arrays are used to maximize utilization of hardware bandwidth.Parallel reduction with shared memory for data of immersed boundary points is adopted for the sake of reducing access to global memory when integrate particle hydrodynamic force.MPI computing is further used for computing on heterogeneous architectures with multiple CPUs-GPUs/DCUs.The communications between adjacent processors are hidden by overlapping with calculations.Two benchmark cases were conducted for code validation,including a pure fluid flow and a particle-laden flow.The performances on a single accelerator show that a GPU V100 can achieve 7.1–11.1 times speed up,while a single DCU can achieve 5.6–8.8 times speed up compared to a single Xeon CPU chip(32 cores).The performances on multi-accelerators show that parallel efficiency is 0.5–0.8 for weak scaling and 0.68–0.9 for strong scaling on up to 64 DCU cards even for the dense flow(φ=20%).The peak performance reaches 179 giga lattice updates per second(GLUPS)on 256 DCU cards by using 1 billion grids and 1 million particles.At last,a large-scale simulation of a gas-solid flow with 1.6 billion grids and 1.6 million particles was conducted using only 32 DCU cards.This simulation shows that the present framework is prospective for simulations of large-scale particle-laden flows in the upcoming exascale computing era.展开更多
A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpos...A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpose CPUs are employed for macro-scale control and optimization, and many integrated cores (MlCs) operating in multiple-instruction multiple-data mode are used for a molecular dynamics simulation of the solid particles at the meso-scale. Many cores operating in single-instruction multiple- data mode, such as general purpose graphics processing units (GPGPUs), are employed for direct numerical simulation of the fluid flow at the micro-scale using the lattice Boltzmann method. This architecture is also expected to be efficient for the multi-scale simulation of other comolex systems.展开更多
General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graph...General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.展开更多
This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the pro...This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the processing. The implementation achieves much better performance than conventional processing systems, with a speedup of more than 890 times on NVIDIA Tesla C1060 supercomputing cards compared to an Intel P4 2.4 GHz CPU. 256×256 pixel images could be reconstructed within 6.3 s, which makes real-time imaging possible. Six platforms were tested and compared. The results show that the GPGPU super-computing system has great potential for radar image processing.展开更多
文摘Under a very general condition (TNC condition) we show that the spectral radius of the kernel of a general branching process is a threshold parameter and hence plays a role as the basic reproduction number in usual CMJ processes. We discuss also some properties of the extinction probability and the generating operator of general branching processes. As an application in epidemics, in the final section we suggest a generalization of SIR model which can describe infectious diseases transmission in an inhomogeneous population.
基金supported financially by the United Kingdom Engineering and Physical Sciences Research Council (Grant No.EP/L015811/1)by the Foundation for Education and European Culture (founded by Nicos&Lydia Tricha).
文摘We consider a structural stochastic volatility model for the loss from a large portfolio of credit risky assets.Both the asset value and the volatility processes are correlated through systemic Brownian motions,with default determined by the asset value reaching a lower boundary.We prove that if our volatility models are picked from a class of mean-reverting diffusions,the system converges as the portfolio becomes large and,when the vol-of-vol function satisfies certain regularity and boundedness conditions,the limit of the empirical measure process has a density given in terms of a solution to a stochastic initial-boundary value problem on a half-space.The problem is defined in a special weighted Sobolev space.Regularity results are established for solutions to this problem,and then we show that there exists a unique solution.In contrast to the CIR volatility setting covered by the existing literature,our results hold even when the systemic Brownian motions are taken to be correlated.
文摘Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.
基金supported by National Natural Science Foundation of China (Grant No.10771216)Research Grants Council of Hong Kong (Grant No.HKU 7010/06P)Scientific Research Foundation for Returned Overseas Chinese Scholars,State Education Ministry of China (Grant No.[2007]1108)
文摘A new class of branching models,the general collision branching processes with two parameters,is considered in this paper.For such models,it is necessary to evaluate the absorbing probabilities and mean extinction times for both absorbing states.Regularity and uniqueness criteria are firstly established.Explicit expressions are then obtained for the extinction probability vector,the mean extinction times and the conditional mean extinction times.The explosion behavior of these models is investigated and an explicit expression for mean explosion time is established.The mean global holding time is also obtained.It is revealed that these properties are substantially different between the super-explosive and sub-explosive cases.
基金Ying-Tung Fok Education Foundation and NSFCNSFC and by Anhui Education Commitee..
文摘In this paper,a class of reaction diffusion processes with general reaction rates is studied.A necessary and sufficient condition for the reversibility of this calss of reaction diffusion processes is given,and then the ergodicity of these processes is proved.
基金the National Natural Science Foundation of China(Nos.61572508,61272144,61303065and 61202121)the National High Technology Research and Development Program(863)of China(No.2012AA010905)+2 种基金the Research Project of National University of Defense Technology(No.JC13-06-02)the Doctoral Fund of Ministry of Education of China(No.20134307120028)the Research Fund for the Doctoral Program of Higher Education of China(No.20114307120013)
文摘The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.
基金Supported by the National Basic Research Program of China(No.2012CB316502)the National High Technology Research and DevelopmentProgram of China(No.2009AA01A129)the National Natural Science Foundation of China(No.60921002)
文摘The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.
基金Project(51274250)supported by the National Natural Science Foundation of ChinaProject(2012BAK09B02-05)supported by the National Key Technology R&D Program during the 12th Five-year Plan of China
文摘An integration processing system of three-dimensional laser scanning information visualization in goaf was developed. It is provided with multiple functions, such as laser scanning information management for goaf, cloud data de-noising optimization, construction, display and operation of three-dimensional model, model editing, profile generation, calculation of goaf volume and roof area, Boolean calculation among models and interaction with the third party soft ware. Concerning this system with a concise interface, plentiful data input/output interfaces, it is featured with high integration, simple and convenient operations of applications. According to practice, in addition to being well-adapted, this system is favorably reliable and stable.
基金supported by the National Key R&D Program of China(No.2017YFA0402401)the National Natural Science Foundation of China(Grant No.11522326)+1 种基金the National Basic Research Program of China(973 Program,Grant 2013CBA01503)support by the Science Challenge Project(No.TZ2016005)
文摘Spectra are fundamental observation data used for astronomical research,but understanding them strongly depends on theoretical models with many fundamental parameters from theoretical calculations.Different models give different insights for understanding a specific object.Hence,laboratory benchmarks for these theoretical models become necessary.An electron beam ion trap is an ideal facility for spectroscopic benchmarks due to its similar conditions of electron density and temperature compared to astrophysical plasmas in stellar coronae,supernova remnants and so on.In this paper,we will describe the performance of a small electron beam ion trap/source facility installed at National Astronomical Observatories,Chinese Academy of Sciences.We present some preliminary experimental results on X-ray emission,ion production,the ionization process of trapped ions as well as the effects of charge exchange on the ionization.
文摘Having been developed in the way of concept extension, Marxism appears to be nowadays as the concrete-universal theory, in which originally imperfect transition program from abstract-universal to concrete-universal concepts of logic and sense is realized on materialistic foundation. This very program that was brought about in Karl Marx's "Capital" has not been sufficiently expressed in classical or contemporary philosophy. The base of this new Marxist philosophical form is not constructed by the terms of matter, movement, and development overall, but by the conception of the general naturally determined universal process of infinite movement from lower to superior forms of matter. We are aware of four of them: physical, chemical, biological, and social matter. Representing the eternal world as the progressive whole, modern materialism makes nature and proper place of each fundamental science understandable and helps to clarify the location and development future trends of the Man in the world.
基金supported in part by the National Natural Science Fund Committee and the Chinese Academy of Sciences astronomical union funds (Grant U1331113)the Special Program for Basic Research of the Ministry of Science and Technology,China (Grant 2014FY120300)
文摘The quality of full-disk solar Hα images is significantly degraded by stripe interference. In this paper, to improve the analysis of morphological evolution, a robust solution for stripe interference removal in a partial full-disk solar Hα image is proposed. The full-disk solar image is decomposed into a set of support value images on different scales by convolving the image with a sequence of multiscale support value filters, which are calculated from the mapped least-squares support vector machines (LS-SVMs). To match the resolution of the support value images, a scale-adaptive LS-SVM regression model is used to remove stripe interference from the support value images. We have demonstrated the advantages of our method on solar Hα images taken in 2001-2002 at the Huairou Solar Observing Station. Our experimental results show that our method can remove the stripe interference well in solar Hα images and the restored image can be used in morphology researches.
基金partially supported by projects SB/S2HEP-001/2013funded by DST(DB)+1 种基金ISRO/RES/2/367/10-11funded by ISRO,India
文摘Blazars are characterized by large intensity and spectral variations across the electromagnetic spectrum It is believed that jets emerging from them are almost aligned with the line-of-sight. The major- ity of identified extragalactic sources in γ-ray catalogs of EGRET and Fermi are blazars. Observationally, blazars can be divided into two classes: fiat spectrum radio quasars (FSRQs) and BL Lacs. BL Lacs usually exhibit lower γ-ray luminosity and harder power law spectra at γ-ray energies than FSRQs. We attempt to explain the high energy properties of FSRQs and BL Lacs from Fermi γ-ray space telescope observations. It was argued previously that the difference in accretion rates is mainly responsible for the large mismatch in observed luminosity in "7-ray. However, when intrinsic luminosities are derived by correcting for beaming effects, this difference in 7-ray luminosity between the two classes is significantly reduced. In order to ex- plain this difference in intrinsic luminosities, we propose that spin plays an important role in the luminosity distribution dichotomy of BL Lacs and FSRQs. As the outflow power of a blazar increases with increasing spin of a central black hole, we suggest that the spin plays a crucial role in making BL Lac sources low luminous and slow rotators compared to FSRQ sources.
基金supported by the National Natural Science Foundation of China(Grant No.51876075)supported by Wuhan Supercomputer Center in China。
文摘An efficient computing framework,namely PFlows,for fully resolved-direct numerical simulations of particle-laden flows was accelerated on NVIDIA General Processing Units(GPUs)and GPU-like accelerator(DCU)cards.The framework is featured as coupling the lattice Boltzmann method for fluid flow with the immersed boundary method for fluid-particle interaction,and the discrete element method for particle collision,using two fixed Eulerian meshes and one moved Lagrangian point mesh,respectively.All the parts are accelerated by a fine-grained parallelism technique using CUDA on GPUs,and further using HIP on DCU cards,i.e.,the calculation on each fluid grid,each immersed boundary point,each particle motion,and each pair-particle collision is responsible by one computer thread,respectively.Coalesced memory accesses to LBM distribution functions with the data layout of Structure of Arrays are used to maximize utilization of hardware bandwidth.Parallel reduction with shared memory for data of immersed boundary points is adopted for the sake of reducing access to global memory when integrate particle hydrodynamic force.MPI computing is further used for computing on heterogeneous architectures with multiple CPUs-GPUs/DCUs.The communications between adjacent processors are hidden by overlapping with calculations.Two benchmark cases were conducted for code validation,including a pure fluid flow and a particle-laden flow.The performances on a single accelerator show that a GPU V100 can achieve 7.1–11.1 times speed up,while a single DCU can achieve 5.6–8.8 times speed up compared to a single Xeon CPU chip(32 cores).The performances on multi-accelerators show that parallel efficiency is 0.5–0.8 for weak scaling and 0.68–0.9 for strong scaling on up to 64 DCU cards even for the dense flow(φ=20%).The peak performance reaches 179 giga lattice updates per second(GLUPS)on 256 DCU cards by using 1 billion grids and 1 million particles.At last,a large-scale simulation of a gas-solid flow with 1.6 billion grids and 1.6 million particles was conducted using only 32 DCU cards.This simulation shows that the present framework is prospective for simulations of large-scale particle-laden flows in the upcoming exascale computing era.
基金supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No.21225628the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant No.20821092+1 种基金the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA07080100the National Natural Science Foundation of China under Grant No. 21206167
文摘A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpose CPUs are employed for macro-scale control and optimization, and many integrated cores (MlCs) operating in multiple-instruction multiple-data mode are used for a molecular dynamics simulation of the solid particles at the meso-scale. Many cores operating in single-instruction multiple- data mode, such as general purpose graphics processing units (GPGPUs), are employed for direct numerical simulation of the fluid flow at the micro-scale using the lattice Boltzmann method. This architecture is also expected to be efficient for the multi-scale simulation of other comolex systems.
基金Project supported by the National Key Research and Development Program of China(No.2018YFB1003500)。
文摘General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.
文摘This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the processing. The implementation achieves much better performance than conventional processing systems, with a speedup of more than 890 times on NVIDIA Tesla C1060 supercomputing cards compared to an Intel P4 2.4 GHz CPU. 256×256 pixel images could be reconstructed within 6.3 s, which makes real-time imaging possible. Six platforms were tested and compared. The results show that the GPGPU super-computing system has great potential for radar image processing.