Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications,including financial risk assessment,commodity recommendation,and data lineage tracking.While the ...Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications,including financial risk assessment,commodity recommendation,and data lineage tracking.While the principles and design of these databases have been the subject of some investigation,there remains a lack of comprehensive examination of aspects such as storage layout,query language,and deployment.The present study focuses on the design and implementation of graph storage layout,with a particular emphasis on tree-structured key-value stores.We also examine different design choices in the graph storage layer and present our findings through the development of TuGraph,a highly efficient single-machine graph database that significantly outperforms well-known Graph DataBase Management System(GDBMS).Additionally,TuGraph demonstrates superior performance in the Linked Data Benchmark Council(LDBC)Social Network Benchmark(SNB)interactive benchmark.展开更多
Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construct...Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.展开更多
A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large an...A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.展开更多
Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System(GRAPES),the Helmholtz equation in the dynamic core,with the increase of resolution,c...Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System(GRAPES),the Helmholtz equation in the dynamic core,with the increase of resolution,can hardly achieve sufficient parallelism in the solving process due to a large amount of communication and irregular access.In this paper,optimizing the Helmholtz equation solution for better performance and higher efficiency has been an urgent task.An optimization scheme for the parallel solution of the Helmholtz equation is proposed in this paper.Specifically,the geometrical multigrid optimization strategy is designed by taking advantage of the data anisotropy of grid points near the pole and the isotropy of those near memory equator in the Helmholtz equation,and the Incomplete LU(ILU)decomposition preconditioner is adopted to speed up the convergence of the improved Generalized Conjugate Residual(GCR),which effectively reduces the number of iterations and the computation time.The overall solving performance of the Helmholtz equation is improved by thread-level parallelism,vectorization,and reuse of data in the cache.The experimental results show that the proposed optimization scheme can effectively eliminate the bottleneck of the Helmholtz equation as regards the solving speed.Considering the test results on a 10-node two-way server,the solution of the Helmholtz equation,compared with the original serial version,is accelerated by 100,with one-third of iterations reduced.展开更多
Online Judge(OJ) systems are a basic and important component of computer education. Here, we present MetaOJ, an OJ system that can be used for holding massive programming tests online. MetaOJ is designed to create a d...Online Judge(OJ) systems are a basic and important component of computer education. Here, we present MetaOJ, an OJ system that can be used for holding massive programming tests online. MetaOJ is designed to create a distributed, fault-tolerant, and easy-to-scale OJ system from an existing ordinary OJ system by adding several interfaces into it and creating multiple instances of it. Our case on modifying the TUOJ system shows that the modification adds no more than 3% lines of code and the performance loss on a single OJ instance is no more than 12%. We also introduce mechanisms to integrate the system with cloud infrastructure to automate the deployment process. MetaOJ provides a solution for those OJ systems that are designed for a specific programming contest and are now facing performance bottlenecks.展开更多
Weibo is the Twitter counterpart in China that has attracted hundreds of millions of users. We crawled an almost complete Weibo user network that contains 222 million users and 27 billion links in 2013. This paper ana...Weibo is the Twitter counterpart in China that has attracted hundreds of millions of users. We crawled an almost complete Weibo user network that contains 222 million users and 27 billion links in 2013. This paper analyzes the structural properties of this network, and compares it with a Twitter user network. The topological properties we studied include the degree distributions, connected components, distance distributions, reciprocity,clustering coefficient, Page Rank centrality, and degree assortativity. We find that Weibo users have a higher diversity index, higher Gini index, but a lower reciprocity and clustering coefficient for most of the nodes. A surprising observation is that the reciprocity of Weibo is only about a quarter of the reciprocity of the Twitter user network. We also show that Weibo adoption rate correlates with economic development positively, and Weibo network can be used to quantify the connections between provinces and regions in China. In particular, point-wise mutual information is shown to be accurate in quantifying the strength of connections. We developed an interactive analyzing software framework for this study, and released the data and code online.展开更多
Structure Data Layout Optimization (SDLO) is a prevailing compiler optimization technique to improve cache efficiency. Structure transformation is a critical step for SDLO. Diversity of transformation methods and ex...Structure Data Layout Optimization (SDLO) is a prevailing compiler optimization technique to improve cache efficiency. Structure transformation is a critical step for SDLO. Diversity of transformation methods and existence of complex data types are major challenges for structure transformation. We have designed and implemented STrans, a well-defined system which provides controllable and comprehensive functionality on structure transformation. Compared to known systems, it has less limitation on data types for transformation. In this paper we give formal definition of the approach STrans transforms data types. We have also designed Transformation Specification Language, a mini language to configure how to transform structures, which can be either manually tuned or generated by compiler. STrans supports three kinds of transformation methods, i.e., splitting, peeling, and pool-splitting, and works well on different combinations of compound data types. STrans is the transformation system used in ASLOP and is well tested for all benchmarks for ASLOR展开更多
Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance co...Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and cus- tomizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main chal- lenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and en- gineers who are interested in running CEAs in Cloud envi- ronment.展开更多
In recent years,Apache Spark has become the de facto standard for big data processing.SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language(SQL).SparkSQL provides conven...In recent years,Apache Spark has become the de facto standard for big data processing.SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language(SQL).SparkSQL provides convenient data processing interfaces.Despite its efficient optimizer,SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization.Adopting native languages such as C++could help to avoid such bottlenecks.Benefiting from a bare-metal runtime environment and template usage,systems with C++interfaces usually achieve superior performance.However,the complexity of native languages also increases the required programming and debugging efforts.In this work,we present LotusSQL,an engine to provide SQL support for dataset abstraction on a native backend Lotus.We employ a convenient SQL processing framework to deal with frontend jobs.Advanced query optimization technologies are added to improve the quality of execution plans.Above the storage design and user interface of the compute engine,LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend.Evaluation results show that LotusSQL achieves a speedup of up to 9 in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2 on average.展开更多
The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need...The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged.In particular,the de facto HPC benchmark,LINPACK,cannot reflect the AI computing power and input/output performance without a representative workload.Current popular AI benchmarks,such as MLPerf,have a fixed problem size and therefore limited scalability.To address these issues,we propose an end-to-end benchmark suite utilizing automated machine learning,which not only represents real AI scenarios,but also is auto-adaptively scalable to various scales of machines.We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations.We utilize Operations Per Second(OPS),which is measured in an analytical and systematic approach,as a major metric to quantify the AI performance.We perform evaluations on various systems to ensure the benchmark’s stability and scalability,from 4 nodes with 32 NVIDIA Tesla T4(56.1 Tera-OPS measured)up to 512 nodes with 4096 Huawei Ascend 910(194.53 Peta-OPS measured),and the results show near-linear weak scalability.With a flexible workload and single metric,AIPerf can easily scale on and rank AI-HPC,providing a powerful benchmark suite for the coming supercomputing era.展开更多
As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis....As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes temporal graph data in spatio-temporal chunks. A chunk spans a particular time interval and covers a set of vertices in a graph.We propose chunk layout and chunk splitting designs to achieve the desired efficiency and the abovementioned goals. First, by carefully choosing the time split policy, Auxo achieves linear complexity in both space usage and query time. Second, graph splitting further improves the worst-case query time, and reduces the performance variance introduced by splitting operations. Third, Auxo optimizes the data layout inside chunks, thereby significantly improving the performance of traverse-based graph queries. Experimental evaluation showed that Auxo achieved 2:9 to 12:1 improvement for global queries, and 1:7 to 2:7 improvement for local queries, as compared with state-of-the-art open-source solutions.展开更多
文摘Graph databases have gained widespread adoption in various industries and have been utilized in a range of applications,including financial risk assessment,commodity recommendation,and data lineage tracking.While the principles and design of these databases have been the subject of some investigation,there remains a lack of comprehensive examination of aspects such as storage layout,query language,and deployment.The present study focuses on the design and implementation of graph storage layout,with a particular emphasis on tree-structured key-value stores.We also examine different design choices in the graph storage layer and present our findings through the development of TuGraph,a highly efficient single-machine graph database that significantly outperforms well-known Graph DataBase Management System(GDBMS).Additionally,TuGraph demonstrates superior performance in the Linked Data Benchmark Council(LDBC)Social Network Benchmark(SNB)interactive benchmark.
文摘Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.
基金supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306)the Science and Technology Plan of Beijing Municipality (No. Z161100000216147)+2 种基金the National Natural Science Foundation of China (No. 61762074)Youth Foundation Program of Qinghai University (No. 2016-QGY-5)the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ7034)
文摘A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.
基金partially supported by the Open Project of State Key Laboratory of Plateau Ecology and Agricuture,Qinghai University(No.2020-ZZ-03)the Qinghai Province High-End Innovative Thousand Talents Program Leading Talents+1 种基金the National Natural Science Foundation of China(Nos.61762074 and 61962051)the National Natural Science Foundation of Qinghai Province(No.2019-ZJ-7034)。
文摘Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System(GRAPES),the Helmholtz equation in the dynamic core,with the increase of resolution,can hardly achieve sufficient parallelism in the solving process due to a large amount of communication and irregular access.In this paper,optimizing the Helmholtz equation solution for better performance and higher efficiency has been an urgent task.An optimization scheme for the parallel solution of the Helmholtz equation is proposed in this paper.Specifically,the geometrical multigrid optimization strategy is designed by taking advantage of the data anisotropy of grid points near the pole and the isotropy of those near memory equator in the Helmholtz equation,and the Incomplete LU(ILU)decomposition preconditioner is adopted to speed up the convergence of the improved Generalized Conjugate Residual(GCR),which effectively reduces the number of iterations and the computation time.The overall solving performance of the Helmholtz equation is improved by thread-level parallelism,vectorization,and reuse of data in the cache.The experimental results show that the proposed optimization scheme can effectively eliminate the bottleneck of the Helmholtz equation as regards the solving speed.Considering the test results on a 10-node two-way server,the solution of the Helmholtz equation,compared with the original serial version,is accelerated by 100,with one-third of iterations reduced.
文摘Online Judge(OJ) systems are a basic and important component of computer education. Here, we present MetaOJ, an OJ system that can be used for holding massive programming tests online. MetaOJ is designed to create a distributed, fault-tolerant, and easy-to-scale OJ system from an existing ordinary OJ system by adding several interfaces into it and creating multiple instances of it. Our case on modifying the TUOJ system shows that the modification adds no more than 3% lines of code and the performance loss on a single OJ instance is no more than 12%. We also introduce mechanisms to integrate the system with cloud infrastructure to automate the deployment process. MetaOJ provides a solution for those OJ systems that are designed for a specific programming contest and are now facing performance bottlenecks.
基金supported by NSERC(Natural Sciences and Engineering Research Council of Canada)Discovery grant(No.RGPIN-2014-04463)the National High-Tech Research and Development(863)Program of China(No.2012AA010903)the National Natural Science Foundation of China(Nos.61433008 and U1435216)
文摘Weibo is the Twitter counterpart in China that has attracted hundreds of millions of users. We crawled an almost complete Weibo user network that contains 222 million users and 27 billion links in 2013. This paper analyzes the structural properties of this network, and compares it with a Twitter user network. The topological properties we studied include the degree distributions, connected components, distance distributions, reciprocity,clustering coefficient, Page Rank centrality, and degree assortativity. We find that Weibo users have a higher diversity index, higher Gini index, but a lower reciprocity and clustering coefficient for most of the nodes. A surprising observation is that the reciprocity of Weibo is only about a quarter of the reciprocity of the Twitter user network. We also show that Weibo adoption rate correlates with economic development positively, and Weibo network can be used to quantify the connections between provinces and regions in China. In particular, point-wise mutual information is shown to be accurate in quantifying the strength of connections. We developed an interactive analyzing software framework for this study, and released the data and code online.
基金supported by the National Natural Science Foundation of China(No.61133006)the National High-Tech Research and Development(863)Program of China(No.2012AA010901)
文摘Structure Data Layout Optimization (SDLO) is a prevailing compiler optimization technique to improve cache efficiency. Structure transformation is a critical step for SDLO. Diversity of transformation methods and existence of complex data types are major challenges for structure transformation. We have designed and implemented STrans, a well-defined system which provides controllable and comprehensive functionality on structure transformation. Compared to known systems, it has less limitation on data types for transformation. In this paper we give formal definition of the approach STrans transforms data types. We have also designed Transformation Specification Language, a mini language to configure how to transform structures, which can be either manually tuned or generated by compiler. STrans supports three kinds of transformation methods, i.e., splitting, peeling, and pool-splitting, and works well on different combinations of compound data types. STrans is the transformation system used in ASLOP and is well tested for all benchmarks for ASLOR
基金We thank the anonymous reviewers for their insight- ful comments and suggestions. This work was supported by the National Science Foundation of China (Grant Nos. 61232008 and 61472151), Na- tional 863 Hi-Tech Research and Development Program (2015AA01A203 and 2014AA01A302), the Fundamental Research Funds for the Central Universities (2015TS067), Anhui Provincial Natural Science Foundation (1408085MF126).
文摘Traditionally, complex engineering applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usu- ally run in dedicated clusters or high performance computing (HPC) centers. Nowadays, Cloud computing system with the ability of providing massive computing resources and cus- tomizable execution environment is becoming an attractive option for CEAs. As a new type on Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. The survey puts forward two important questions: 1) what are the main chal- lenges for CEAs to run in Clouds? and 2) what are the prior research topics addressing these challenges? We summarize and highlight the main challenges and prior research topics. Our work can be probably helpful to those scientists and en- gineers who are interested in running CEAs in Cloud envi- ronment.
文摘In recent years,Apache Spark has become the de facto standard for big data processing.SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language(SQL).SparkSQL provides convenient data processing interfaces.Despite its efficient optimizer,SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization.Adopting native languages such as C++could help to avoid such bottlenecks.Benefiting from a bare-metal runtime environment and template usage,systems with C++interfaces usually achieve superior performance.However,the complexity of native languages also increases the required programming and debugging efforts.In this work,we present LotusSQL,an engine to provide SQL support for dataset abstraction on a native backend Lotus.We employ a convenient SQL processing framework to deal with frontend jobs.Advanced query optimization technologies are added to improve the quality of execution plans.Above the storage design and user interface of the compute engine,LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend.Evaluation results show that LotusSQL achieves a speedup of up to 9 in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2 on average.
文摘The plethora of complex Artificial Intelligence(AI)algorithms and available High-Performance Computing(HPC)power stimulates the expeditious development of AI components with heterogeneous designs.Consequently,the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged.In particular,the de facto HPC benchmark,LINPACK,cannot reflect the AI computing power and input/output performance without a representative workload.Current popular AI benchmarks,such as MLPerf,have a fixed problem size and therefore limited scalability.To address these issues,we propose an end-to-end benchmark suite utilizing automated machine learning,which not only represents real AI scenarios,but also is auto-adaptively scalable to various scales of machines.We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations.We utilize Operations Per Second(OPS),which is measured in an analytical and systematic approach,as a major metric to quantify the AI performance.We perform evaluations on various systems to ensure the benchmark’s stability and scalability,from 4 nodes with 32 NVIDIA Tesla T4(56.1 Tera-OPS measured)up to 512 nodes with 4096 Huawei Ascend 910(194.53 Peta-OPS measured),and the results show near-linear weak scalability.With a flexible workload and single metric,AIPerf can easily scale on and rank AI-HPC,providing a powerful benchmark suite for the coming supercomputing era.
基金supported by the National High-Tech Development Plan of China (No. 2015AA015306)the National Natural Science Foundation of China (No. 61772302)
文摘As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of graphs has grown. Herein, we propose Auxo, a novel temporal graph management system to support temporal graph analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes temporal graph data in spatio-temporal chunks. A chunk spans a particular time interval and covers a set of vertices in a graph.We propose chunk layout and chunk splitting designs to achieve the desired efficiency and the abovementioned goals. First, by carefully choosing the time split policy, Auxo achieves linear complexity in both space usage and query time. Second, graph splitting further improves the worst-case query time, and reduces the performance variance introduced by splitting operations. Third, Auxo optimizes the data layout inside chunks, thereby significantly improving the performance of traverse-based graph queries. Experimental evaluation showed that Auxo achieved 2:9 to 12:1 improvement for global queries, and 1:7 to 2:7 improvement for local queries, as compared with state-of-the-art open-source solutions.