Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance co...Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and cus- tomizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main chal- lenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and en- gineers who are interested in running CEAs in Cloud envi- ronment.展开更多
The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need...The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged.In particular,the de facto HPC benchmark,LINPACK,cannot reflect the AI computing power and input/output performance without a representative workload.Current popular AI benchmarks,such as MLPerf,have a fixed problem size and therefore limited scalability.To address these issues,we propose an end-to-end benchmark suite utilizing automated machine learning,which not only represents real AI scenarios,but also is auto-adaptively scalable to various scales of machines.We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations.We utilize Operations Per Second(OPS),which is measured in an analytical and systematic approach,as a major metric to quantify the AI performance.We perform evaluations on various systems to ensure the benchmark’s stability and scalability,from 4 nodes with 32 NVIDIA Tesla T4(56.1 Tera-OPS measured)up to 512 nodes with 4096 Huawei Ascend 910(194.53 Peta-OPS measured),and the results show near-linear weak scalability.With a flexible workload and single metric,AIPerf can easily scale on and rank AI-HPC,providing a powerful benchmark suite for the coming supercomputing era.展开更多
基金We thank the anonymous reviewers for their insight- ful comments and suggestions. This work was supported by the National Science Foundation of China (Grant Nos. 61232008 and 61472151), Na- tional 863 Hi-Tech Research and Development Program (2015AA01A203 and 2014AA01A302), the Fundamental Research Funds for the Central Universities (2015TS067), Anhui Provincial Natural Science Foundation (1408085MF126).
文摘Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and cus- tomizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main chal- lenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and en- gineers who are interested in running CEAs in Cloud envi- ronment.
文摘The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged.In particular,the de facto HPC benchmark,LINPACK,cannot reflect the AI computing power and input/output performance without a representative workload.Current popular AI benchmarks,such as MLPerf,have a fixed problem size and therefore limited scalability.To address these issues,we propose an end-to-end benchmark suite utilizing automated machine learning,which not only represents real AI scenarios,but also is auto-adaptively scalable to various scales of machines.We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations.We utilize Operations Per Second(OPS),which is measured in an analytical and systematic approach,as a major metric to quantify the AI performance.We perform evaluations on various systems to ensure the benchmark’s stability and scalability,from 4 nodes with 32 NVIDIA Tesla T4(56.1 Tera-OPS measured)up to 512 nodes with 4096 Huawei Ascend 910(194.53 Peta-OPS measured),and the results show near-linear weak scalability.With a flexible workload and single metric,AIPerf can easily scale on and rank AI-HPC,providing a powerful benchmark suite for the coming supercomputing era.