This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extract...This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.展开更多
There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage m...There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha- nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex- isling approaches using Brewer's CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.展开更多
Medical artificial intelligence(AI)and big data technology have rapidly advanced in recent years,and they are now routinely used for image-based diagnosis.China has a massive amount of medical data.However,a uniform c...Medical artificial intelligence(AI)and big data technology have rapidly advanced in recent years,and they are now routinely used for image-based diagnosis.China has a massive amount of medical data.However,a uniform criteria for medical data quality have yet to be established.Therefore,this review aimed to develop a standardized and detailed set of quality criteria for medical data collection,storage,annotation,and management related to medical AI.This would greatly improve the process of medical data resource sharing and the use of AI in clinical medicine.展开更多
在当今信息时代,数据的复杂性不断增加,传统的关系型数据库在大规模数据存储和处理方面面临着挑战。非关系型数据库(Not Only SQL,NoSQL)作为一种新的存储和处理数据的方法,受到了广泛关注,并在分布式存储领域取得了显著的成就。文章重...在当今信息时代,数据的复杂性不断增加,传统的关系型数据库在大规模数据存储和处理方面面临着挑战。非关系型数据库(Not Only SQL,NoSQL)作为一种新的存储和处理数据的方法,受到了广泛关注,并在分布式存储领域取得了显著的成就。文章重点探讨基于大数据技术的非关系型数据库分布式存储方法,并通过实验进行评估,发现其在可扩展性和安全性方面具有优势,可以为相关研究提供参考。展开更多
基金Supported by the National Science and Technology Support Project(No.2012BAH01F02)from Ministry of Science and Technology of Chinathe Director Fund(No.IS201116002)from Institute of Seismology,CEA
文摘This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.
文摘There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing volume, velocity, and diversity of data, termed 'big data'. The structural shift of the storage mechanism from traditional data management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha- nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex- isling approaches using Brewer's CAP theorem. The significance and applications of storage technologies and support to other categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a reliable and scalable storage system.
基金supported by the Science and Technology Planning Projects of Guangdong Province(Grant No.2018B010109008)Na-tional Key R&D Program of China(Grant No.2018YFC0116500).
文摘Medical artificial intelligence(AI)and big data technology have rapidly advanced in recent years,and they are now routinely used for image-based diagnosis.China has a massive amount of medical data.However,a uniform criteria for medical data quality have yet to be established.Therefore,this review aimed to develop a standardized and detailed set of quality criteria for medical data collection,storage,annotation,and management related to medical AI.This would greatly improve the process of medical data resource sharing and the use of AI in clinical medicine.
文摘在当今信息时代,数据的复杂性不断增加,传统的关系型数据库在大规模数据存储和处理方面面临着挑战。非关系型数据库(Not Only SQL,NoSQL)作为一种新的存储和处理数据的方法,受到了广泛关注,并在分布式存储领域取得了显著的成就。文章重点探讨基于大数据技术的非关系型数据库分布式存储方法,并通过实验进行评估,发现其在可扩展性和安全性方面具有优势,可以为相关研究提供参考。