A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various form...A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.展开更多
An outsource database is a database service provided by cloud computing companies.Using the outsource database can reduce the hardware and software's cost and also get more efficient and reliable data processing capa...An outsource database is a database service provided by cloud computing companies.Using the outsource database can reduce the hardware and software's cost and also get more efficient and reliable data processing capacity.However,the outsource database still has some challenges.If the service provider does not have sufficient confidence,there is the possibility of data leakage.The data may has user's privacy,so data leakage may cause data privacy leak.Based on this factor,to protect the privacy of data in the outsource database becomes very important.In the past,scholars have proposed k-anonymity to protect data privacy in the database.It lets data become anonymous to avoid data privacy leak.But k-anonymity has some problems,it is irreversible,and easier to be attacked by homogeneity attack and background knowledge attack.Later on,scholars have proposed some studies to solve homogeneity attack and background knowledge attack.But their studies still cannot recover back to the original data.In this paper,we propose a data anonymity method.It can be reversible and also prevent those two attacks.Our study is based on the proposed r-transform.It can be used on the numeric type of attributes in the outsource database.In the experiment,we discussed the time required to anonymize and recover data.Furthermore,we investigated the defense against homogeneous attack and background knowledge attack.At the end,we summarized the proposed method and future researches.展开更多
Disruption database and disruption warning database of the EAST tokamak had been established by a disruption research group. The disruption database, based on Structured Query Language(SQL), comprises 41 disruption ...Disruption database and disruption warning database of the EAST tokamak had been established by a disruption research group. The disruption database, based on Structured Query Language(SQL), comprises 41 disruption parameters, which include current quench characteristics, EFIT equilibrium characteristics, kinetic parameters, halo currents,and vertical motion. Presently most disruption databases are based on plasma experiments of non-superconducting tokamak devices. The purposes of the EAST database are to find disruption characteristics and disruption statistics to the fully superconducting tokamak EAST,to elucidate the physics underlying tokamak disruptions, to explore the influence of disruption on superconducting magnets and to extrapolate toward future burning plasma devices. In order to quantitatively assess the usefulness of various plasma parameters for predicting disruptions,a similar SQL database to Alcator C-Mod for EAST has been created by compiling values for a number of proposed disruption-relevant parameters sampled from all plasma discharges in the2015 campaign. The detailed statistic results and analysis of two databases on the EAST tokamak are presented.展开更多
Rockburst is an important phenomenon that has affected many deep underground mines around the world. An understanding of this phenomenon is relevant to the management of such events, which can lead to saving both cost...Rockburst is an important phenomenon that has affected many deep underground mines around the world. An understanding of this phenomenon is relevant to the management of such events, which can lead to saving both costs and lives. Laboratory experiments are one way to obtain a deeper and better understanding of the mechanisms of rockburst. In a previous study by these authors, a database of rockburst laboratory tests was created; in addition, with the use of data mining (DM) techniques, models to predict rockburst maximum stress and rockburst risk indexes were developed. In this paper, we focus on the analysis of a database of in situ cases of rockburst in order to build influence diagrams, list the factors that interact in the occurrence of rockburst, and understand the relationships between these factors. The in situ rockburst database was further analyzed using different DM techniques ranging from artificial neural networks (ANNs) to naive Bayesian classifiers. The aim was to predict the type of rockburst-that is, the rockburst level-based on geologic and construction characteristics of the mine or tunnel. Conclusions are drawn at the end of the paper.展开更多
The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, ...The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of humanrelated biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation.展开更多
The usage of a subset of observed stars in a CCD image to find their corresponding matched stars in a stellar catalog is an important issue in astronomical research. Subgraph isomorphic-based algorithms are the most w...The usage of a subset of observed stars in a CCD image to find their corresponding matched stars in a stellar catalog is an important issue in astronomical research. Subgraph isomorphic-based algorithms are the most widely used methods in star catalog matching. When more subgraph features are provided, the CCD images are recognized better. However, when the navigation feature database is large, the method requires more time to match the observing model. To solve this problem, this study investigates further and improves subgraph isomorphic matching algorithms. We present an algorithm based on a locality-sensitive hashing technique, which allocates quadrilateral models in the navigation feature database into different hash buckets and reduces the search range to the bucket in which the observed quadrilateral model is located. Experimental results indicate the effectivity of our method.展开更多
Considering features of stellar spectral radiation and sky surveys, we established a computational model for stellar effective temperatures, detected angular parameters and gray rates. Using known stellar flux data in...Considering features of stellar spectral radiation and sky surveys, we established a computational model for stellar effective temperatures, detected angular parameters and gray rates. Using known stellar flux data in some bands, we estimated stellar effective temperatures and detected angular parameters using stochastic particle swarm optimization (SPSO). We first verified the reliability of SPSO, and then determined reasonable parameters that produced highly accurate estimates under certain gray deviation levels. Finally, we calculated 177 860 stellar effective temperatures and detected angular parameters using data from the Midcourse Space Experiment (MSX) catalog. These derived stellar effective temperatures were accurate when we compared them to known values from literatures. This research makes full use of catalog data and presents an original technique for studying stellar characteristics. It proposes a novel method for calculating stellar effective temperatures and detecting angular parameters, and provides theoretical and practical data for finding information about radiation in any band.展开更多
We compare the performance of Bayesian Belief Networks (BBN), Multilayer Perception (MLP) networks and Alternating Decision Trees (ADtree) on separating quasars from stars with the database from the 2MASS and FI...We compare the performance of Bayesian Belief Networks (BBN), Multilayer Perception (MLP) networks and Alternating Decision Trees (ADtree) on separating quasars from stars with the database from the 2MASS and FIRST survey catalogs. Having a training sample of sources of known object types, the classifiers are trained to separate quasars from stars. By the statistical properties of the sample, the features important for classifica- tion are selected. We compare the classification results with and without feature selection. Experiments show that the results with feature selection are better than those without feature selection. From the high accuracy found, it is concluded that these automated methods are robust and effective for classifying point sources. They may all be applied to large survey projects (e.g. selecting input catalogs) and for other astronomical issues, such as the parameter measurement of stars and the redshift estimation of galaxies and quasars.展开更多
Background:Developing and sustaining a data collection and management system(DCMS)is difficult in malariaendemic countries because of limitations in internet bandwidth,computer resources and numbers of trained personn...Background:Developing and sustaining a data collection and management system(DCMS)is difficult in malariaendemic countries because of limitations in internet bandwidth,computer resources and numbers of trained personnel.The premise of this paper is that development of a DCMS in West Africa was a critically important outcome of the West African International Centers of Excellence for Malaria Research.The purposes of this paper are to make that information available to other investigators and to encourage the linkage of DCMSs to international research and Ministry of Health data systems and repositories.Methods:We designed and implemented a DCMS to link study sites in Mali,Senegal and The Gambia.This system was based on case report forms for epidemiologic,entomologic,clinical and laboratory aspects of plasmodial infection and malarial disease for a longitudinal cohort study and included on-site training for Principal Investigators and Data Managers.Based on this experience,we propose guidelines for the design and sustainability of DCMSs in environments with limited resources and personnel.Results:From 2012 to 2017,we performed biannual thick smear surveys for plasmodial infection,mosquito collections for anopheline biting rates and sporozoite rates and year-round passive case detection for malarial disease in four longitudinal cohorts with 7708 individuals and 918 households in Senegal,The Gambia and Mali.Major challenges included the development of uniform definitions and reporting,assessment of data entry error rates,unstable and limited internet access and software and technology maintenance.Strengths included entomologic collections linked to longitudinal cohort studies,on-site data centres and a cloud-based data repository.Conclusions:At a time when research on diseases of poverty in low and middle-income countries is a global priority,the resources available to ensure accurate data collection and the electronic availability of those data remain severely limited.Based on our experience,we suggest the development of a regional DCMS.This approach is more economical than separate data centres and has the potential to improve data quality by encouraging shared case definitions,data validation strategies and analytic approaches including the molecular analysis of treatment successes and failures.展开更多
The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB le...The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB level structured data called Banian. Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management. It provides a uniform query interface for cross-platform datasets and thus shows favorable compatibility and scalability. Banian's system architecture mainly includes three layers:(1) a storage layer using HDFS for the distributed storage of massive data;(2) a scheduling and execution layer employing the splitting and scheduling technology of parallel database; and(3)an application layer providing a cross-platform query interface and supporting standard SQL. We evaluate Banian using PB level Internet data and the TPC-H benchmark. The results show that when compared with Hive, Banian improves the query performance to a maximum of 30 times and achieves better scalability and concurrency.展开更多
A partition checkpoint strategy based on data segment priority is presented to meet the timing constraints of the data and the transaction in embedded real-time main memory database systems(ERTMMDBS) as well as to r...A partition checkpoint strategy based on data segment priority is presented to meet the timing constraints of the data and the transaction in embedded real-time main memory database systems(ERTMMDBS) as well as to reduce the number of the transactions missing their deadlines and the recovery time.The partition checkpoint strategy takes into account the characteristics of the data and the transactions associated with it;moreover,it partitions the database according to the data segment priority and sets the corresponding checkpoint frequency to each partition for independent checkpoint operation.The simulation results show that the partition checkpoint strategy decreases the ratio of trans-actions missing their deadlines.展开更多
基金supported by the Student Scheme provided by Universiti Kebangsaan Malaysia with the Code TAP-20558.
文摘A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.
文摘An outsource database is a database service provided by cloud computing companies.Using the outsource database can reduce the hardware and software's cost and also get more efficient and reliable data processing capacity.However,the outsource database still has some challenges.If the service provider does not have sufficient confidence,there is the possibility of data leakage.The data may has user's privacy,so data leakage may cause data privacy leak.Based on this factor,to protect the privacy of data in the outsource database becomes very important.In the past,scholars have proposed k-anonymity to protect data privacy in the database.It lets data become anonymous to avoid data privacy leak.But k-anonymity has some problems,it is irreversible,and easier to be attacked by homogeneity attack and background knowledge attack.Later on,scholars have proposed some studies to solve homogeneity attack and background knowledge attack.But their studies still cannot recover back to the original data.In this paper,we propose a data anonymity method.It can be reversible and also prevent those two attacks.Our study is based on the proposed r-transform.It can be used on the numeric type of attributes in the outsource database.In the experiment,we discussed the time required to anonymize and recover data.Furthermore,we investigated the defense against homogeneous attack and background knowledge attack.At the end,we summarized the proposed method and future researches.
基金supported by the National Magnetic Confinement Fusion Science Program of China(No.2014GB103000)
文摘Disruption database and disruption warning database of the EAST tokamak had been established by a disruption research group. The disruption database, based on Structured Query Language(SQL), comprises 41 disruption parameters, which include current quench characteristics, EFIT equilibrium characteristics, kinetic parameters, halo currents,and vertical motion. Presently most disruption databases are based on plasma experiments of non-superconducting tokamak devices. The purposes of the EAST database are to find disruption characteristics and disruption statistics to the fully superconducting tokamak EAST,to elucidate the physics underlying tokamak disruptions, to explore the influence of disruption on superconducting magnets and to extrapolate toward future burning plasma devices. In order to quantitatively assess the usefulness of various plasma parameters for predicting disruptions,a similar SQL database to Alcator C-Mod for EAST has been created by compiling values for a number of proposed disruption-relevant parameters sampled from all plasma discharges in the2015 campaign. The detailed statistic results and analysis of two databases on the EAST tokamak are presented.
文摘Rockburst is an important phenomenon that has affected many deep underground mines around the world. An understanding of this phenomenon is relevant to the management of such events, which can lead to saving both costs and lives. Laboratory experiments are one way to obtain a deeper and better understanding of the mechanisms of rockburst. In a previous study by these authors, a database of rockburst laboratory tests was created; in addition, with the use of data mining (DM) techniques, models to predict rockburst maximum stress and rockburst risk indexes were developed. In this paper, we focus on the analysis of a database of in situ cases of rockburst in order to build influence diagrams, list the factors that interact in the occurrence of rockburst, and understand the relationships between these factors. The in situ rockburst database was further analyzed using different DM techniques ranging from artificial neural networks (ANNs) to naive Bayesian classifiers. The aim was to predict the type of rockburst-that is, the rockburst level-based on geologic and construction characteristics of the mine or tunnel. Conclusions are drawn at the end of the paper.
基金supported by the‘‘100-Talent Program’’of Chinese Academy of Sciencesthe Strategic Priority Research Program of the Chinese Academy of Sciences(Grant No.XDB13040500)+1 种基金the National High-tech R&D Program(863 ProgramGrant No.2012AA020409)by the Ministry of Science and Technology of China awarded to ZZ
文摘The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of humanrelated biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation.
基金supported by the National Natural Science Foundation of China(U1431227)Guangzhou Science and Technology Planning Project(201604010037)
文摘The usage of a subset of observed stars in a CCD image to find their corresponding matched stars in a stellar catalog is an important issue in astronomical research. Subgraph isomorphic-based algorithms are the most widely used methods in star catalog matching. When more subgraph features are provided, the CCD images are recognized better. However, when the navigation feature database is large, the method requires more time to match the observing model. To solve this problem, this study investigates further and improves subgraph isomorphic matching algorithms. We present an algorithm based on a locality-sensitive hashing technique, which allocates quadrilateral models in the navigation feature database into different hash buckets and reduces the search range to the bucket in which the observed quadrilateral model is located. Experimental results indicate the effectivity of our method.
基金supported by the National Natural Science Foundation of China (Grant Nos. 51327803 and 51406041)the Fundamental Research Funds for the Central Universities (Grant No. HIT. NSRIF.2014090)
文摘Considering features of stellar spectral radiation and sky surveys, we established a computational model for stellar effective temperatures, detected angular parameters and gray rates. Using known stellar flux data in some bands, we estimated stellar effective temperatures and detected angular parameters using stochastic particle swarm optimization (SPSO). We first verified the reliability of SPSO, and then determined reasonable parameters that produced highly accurate estimates under certain gray deviation levels. Finally, we calculated 177 860 stellar effective temperatures and detected angular parameters using data from the Midcourse Space Experiment (MSX) catalog. These derived stellar effective temperatures were accurate when we compared them to known values from literatures. This research makes full use of catalog data and presents an original technique for studying stellar characteristics. It proposes a novel method for calculating stellar effective temperatures and detecting angular parameters, and provides theoretical and practical data for finding information about radiation in any band.
基金Supported by the National Natural Science Foundation of China.
文摘We compare the performance of Bayesian Belief Networks (BBN), Multilayer Perception (MLP) networks and Alternating Decision Trees (ADtree) on separating quasars from stars with the database from the 2MASS and FIRST survey catalogs. Having a training sample of sources of known object types, the classifiers are trained to separate quasars from stars. By the statistical properties of the sample, the features important for classifica- tion are selected. We compare the classification results with and without feature selection. Experiments show that the results with feature selection are better than those without feature selection. From the high accuracy found, it is concluded that these automated methods are robust and effective for classifying point sources. They may all be applied to large survey projects (e.g. selecting input catalogs) and for other astronomical issues, such as the parameter measurement of stars and the redshift estimation of galaxies and quasars.
基金These studies were supported by Cooperative Agreements from the National Institutes of Allergy and Infectious Diseases(NIAID)for the West African International Center of Excellence for Malaria Research(ICEMR):NIAID U19 AI 089696 and U19 AI 129387(from 2010 to 2017 and 2017 to 2024,respectively)Development of Case Report Forms,Standard Operating Procedures and other bilingual documentation in English and French was performed in collaboration with Aliou Sissako,Lansana Sangare,Ayouba Diarra and Ousmane Koita at the University of Bamako,Jules Gomis and Daouda Ndiaye at the University Cheikh Anta Diop in Dakar,Abdullahi Ahmad and Davis Nwakanma at the MRC in The Gambia,Clarissa Valim at the T.H.Chan Harvard School of Public Health,Mary Lukowski at Study TRAX and was supported by a Fulbright Scholar Award to DJK from 2009 to 2011.
文摘Background:Developing and sustaining a data collection and management system(DCMS)is difficult in malariaendemic countries because of limitations in internet bandwidth,computer resources and numbers of trained personnel.The premise of this paper is that development of a DCMS in West Africa was a critically important outcome of the West African International Centers of Excellence for Malaria Research.The purposes of this paper are to make that information available to other investigators and to encourage the linkage of DCMSs to international research and Ministry of Health data systems and repositories.Methods:We designed and implemented a DCMS to link study sites in Mali,Senegal and The Gambia.This system was based on case report forms for epidemiologic,entomologic,clinical and laboratory aspects of plasmodial infection and malarial disease for a longitudinal cohort study and included on-site training for Principal Investigators and Data Managers.Based on this experience,we propose guidelines for the design and sustainability of DCMSs in environments with limited resources and personnel.Results:From 2012 to 2017,we performed biannual thick smear surveys for plasmodial infection,mosquito collections for anopheline biting rates and sporozoite rates and year-round passive case detection for malarial disease in four longitudinal cohorts with 7708 individuals and 918 households in Senegal,The Gambia and Mali.Major challenges included the development of uniform definitions and reporting,assessment of data entry error rates,unstable and limited internet access and software and technology maintenance.Strengths included entomologic collections linked to longitudinal cohort studies,on-site data centres and a cloud-based data repository.Conclusions:At a time when research on diseases of poverty in low and middle-income countries is a global priority,the resources available to ensure accurate data collection and the electronic availability of those data remain severely limited.Based on our experience,we suggest the development of a regional DCMS.This approach is more economical than separate data centres and has the potential to improve data quality by encouraging shared case definitions,data validation strategies and analytic approaches including the molecular analysis of treatment successes and failures.
基金supported by the National High-Tech Research and Development (863) Program of China (No. 2012AA012609)
文摘The rapid growth of structured data has presented new technological challenges in the research fields of big data and relational database. In this paper, we present an efficient system for managing and analyzing PB level structured data called Banian. Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management. It provides a uniform query interface for cross-platform datasets and thus shows favorable compatibility and scalability. Banian's system architecture mainly includes three layers:(1) a storage layer using HDFS for the distributed storage of massive data;(2) a scheduling and execution layer employing the splitting and scheduling technology of parallel database; and(3)an application layer providing a cross-platform query interface and supporting standard SQL. We evaluate Banian using PB level Internet data and the TPC-H benchmark. The results show that when compared with Hive, Banian improves the query performance to a maximum of 30 times and achieves better scalability and concurrency.
基金Supported by the National Natural Science Foundation of China (60673128)
文摘A partition checkpoint strategy based on data segment priority is presented to meet the timing constraints of the data and the transaction in embedded real-time main memory database systems(ERTMMDBS) as well as to reduce the number of the transactions missing their deadlines and the recovery time.The partition checkpoint strategy takes into account the characteristics of the data and the transactions associated with it;moreover,it partitions the database according to the data segment priority and sets the corresponding checkpoint frequency to each partition for independent checkpoint operation.The simulation results show that the partition checkpoint strategy decreases the ratio of trans-actions missing their deadlines.