CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for i...CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for investigating a deeper correlation between human CHD and aberrant transcriptome expression.The develop-ment of human diseases involves important regulatory roles of RNAs,and expression profiling data can reflect the underlying etiology of inherited diseases.Hence,collecting and compiling expression profiling data is of critical significance for a comprehensive understanding of the mechanisms and functions that underpin genetic diseases.CHDTEPDB stores the expression profiles of over 200 sets of 7 types of CHD and provides users with more convenient basic analytical functions.Due to the differences in clinical indicators such as disease type and unavoidable detection errors among various datasets,users are able to customize their selection of corresponding data for personalized analysis.Moreover,we provide a submission page for researchers to submit their own data so that increasing expression profiles as well as some other histological data could be supplemented to the database.CHDTEPDB is a user-friendly interface that allows users to quickly browse,retrieve,download,and analyze their target samples.CHDTEPDB will significantly improve the current knowledge of expression profiling data in CHD and has the potential to be exploited as an important tool for future research on the disease.展开更多
The effectiveness of the Business Intelligence(BI)system mainly depends on the quality of knowledge it produces.The decision-making process is hindered,and the user’s trust is lost,if the knowledge offered is undesir...The effectiveness of the Business Intelligence(BI)system mainly depends on the quality of knowledge it produces.The decision-making process is hindered,and the user’s trust is lost,if the knowledge offered is undesired or of poor quality.A Data Warehouse(DW)is a huge collection of data gathered from many sources and an important part of any BI solution to assist management in making better decisions.The Extract,Transform,and Load(ETL)process is the backbone of a DW system,and it is responsible for moving data from source systems into the DW system.The more mature the ETL process the more reliable the DW system.In this paper,we propose the ETL Maturity Model(EMM)that assists organizations in achieving a high-quality ETL system and thereby enhancing the quality of knowledge produced.The EMM is made up of five levels of maturity i.e.,Chaotic,Acceptable,Stable,Efficient and Reliable.Each level of maturity contains Key Process Areas(KPAs)that have been endorsed by industry experts and include all critical features of a good ETL system.Quality Objectives(QOs)are defined procedures that,when implemented,resulted in a high-quality ETL process.Each KPA has its own set of QOs,the execution of which meets the requirements of that KPA.Multiple brainstorming sessions with relevant industry experts helped to enhance the model.EMMwas deployed in two key projects utilizing multiple case studies to supplement the validation process and support our claim.This model can assist organizations in improving their current ETL process and transforming it into a more mature ETL system.This model can also provide high-quality information to assist users inmaking better decisions and gaining their trust.展开更多
Universities collect and generate a considerable amount of data on students throughout their academic career. Currently in South Kivu, most universities have an information system in the form of a database made up of ...Universities collect and generate a considerable amount of data on students throughout their academic career. Currently in South Kivu, most universities have an information system in the form of a database made up of several disparate files. This makes it difficult to use this data efficiently and profitably. The aim of this study is to develop this transactional database-based information system into a data warehouse-oriented system. This tool will be able to collect, organize and archive data on the student’s career path, year after year, and transform it for analysis purposes. In the age of Big Data, a number of artificial intelligence techniques have been developed, making it possible to extract useful information from large databases. This extracted information is of paramount importance in decision-making. By way of example, the information extracted by these techniques can be used to predict which stream a student should choose when applying to university. In order to develop our contribution, we analyzed the IT information systems used in the various universities and applied the bottom-up method to design our data warehouse model. We used the relational model to design the data warehouse.展开更多
This paper presents a methodology driven by database constraints for designing and developing(database)software applications.Much needed and with excellent results,this paradigm guarantees the highest possible quality...This paper presents a methodology driven by database constraints for designing and developing(database)software applications.Much needed and with excellent results,this paradigm guarantees the highest possible quality of the managed data.The proposed methodology is illustrated with an easy to understand,yet complex medium-sized genealogy software application driven by more than 200 database constraints,which fully meets such expectations.展开更多
Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, incl...Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, including ACID-compliant, concurrency support, data sharing, and efficient access. Each block model is organized by linear octree, stored in LMDB(lightning memory-mapped database). Geological attribute can be queried at any point of 3D space by comparison algorithm of location code and conversion algorithm from address code of geometry space to location code of storage. The performance and robustness of querying geological attribute at 3D spatial region are enhanced greatly by the transformation from 3D to 2D and the method of 2D grid scanning to screen the inner and outer points. Experimental results showed that this method can access the massive data of block model, meeting the database characteristics. The method with LMDB is at least 3 times faster than that with etree, especially when it is used to read. In addition, the larger the amount of data is processed, the more efficient the method would be.展开更多
Expenditure on wells constitute a significant part of the operational costs for a petroleum enterprise, where most of the cost results from drilling. This has prompted drilling departments to continuously look for wa...Expenditure on wells constitute a significant part of the operational costs for a petroleum enterprise, where most of the cost results from drilling. This has prompted drilling departments to continuously look for ways to reduce their drilling costs and be as efficient as possible. A system called the Drilling Comprehensive Information Management and Application System (DCIMAS) is developed and presented here, with an aim at collecting, storing and making full use of the valuable well data and information relating to all drilling activities and operations. The DCIMAS comprises three main parts, including a data collection and transmission system, a data warehouse (DW) management system, and an integrated platform of core applications. With the support of the application platform, the DW management system is introduced, whereby the operation data are captured at well sites and transmitted electronically to a data warehouse via transmission equipment and ETL (extract, transformation and load) tools. With the high quality of the data guaranteed, our central task is to make the best use of the operation data and information for drilling analysis and to provide further information to guide later production stages. Applications have been developed and integrated on a uniform platform to interface directly with different layers of the multi-tier DW. Now, engineers in every department spend less time on data handling and more time on applying technology in their real work with the system.展开更多
A uniform metadata representation is introduced for heterogeneous databases, multi media information and other information sources. Some features about metadata are analyzed. The limitation of existing metadata model...A uniform metadata representation is introduced for heterogeneous databases, multi media information and other information sources. Some features about metadata are analyzed. The limitation of existing metadata model is compared with the new one. The metadata model is described in XML which is fit for metadata denotation and exchange. The well structured data, semi structured data and those exterior file data without structure are described in the metadata model. The model provides feasibility and extensibility for constructing uniform metadata model of data warehouse.展开更多
This paper detailed introduces the related work for building Chinese National Database, Sinocenter, of materials life cycle assessment (MLCA) and developing the environmental burden dataset of materials. The MLCA data...This paper detailed introduces the related work for building Chinese National Database, Sinocenter, of materials life cycle assessment (MLCA) and developing the environmental burden dataset of materials. The MLCA database was built in 2004, and the basic framework mainly includes LCA methodology, materials environmental dataset about energy consumption, resource input and environmental emissions. Nowadays, the database contains about fifty-thousand records of the main materials industries, such as cement, iron and steel, nonferrous metal, etc., and also includes the primary LCI data of fossil fuels and electricity grid in China. At the same time, the LCA method localization work is going on, for instance, calculated the resource characterization factors of 42 kinds of metal and 58 sorts of nonmetal, and also obtained some heavy metal impact factors in water. Based on the database, the iron and steel dataset has been developed with the data quality analysis, and some environmental burden data could be queried in our website, www.cnmlca.com, in the future. Lastly, according to the framework of the ISO14040 series standards, the antitype of Chinese LCA evaluation system was developed to support materials and products LCA evaluation in China.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
Many high quality studies have emerged from public databases,such as Surveillance,Epidemiology,and End Results(SEER),National Health and Nutrition Examination Survey(NHANES),The Cancer Genome Atlas(TCGA),and Medical I...Many high quality studies have emerged from public databases,such as Surveillance,Epidemiology,and End Results(SEER),National Health and Nutrition Examination Survey(NHANES),The Cancer Genome Atlas(TCGA),and Medical Information Mart for Intensive Care(MIMIC);however,these data are often characterized by a high degree of dimensional heterogeneity,timeliness,scarcity,irregularity,and other characteristics,resulting in the value of these data not being fully utilized.Data-mining technology has been a frontier field in medical research,as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models.Therefore,data mining has unique advantages in clinical big-data research,especially in large-scale medical public databases.This article introduced the main medical public database and described the steps,tasks,and models of data mining in simple language.Additionally,we described data-mining methods along with their practical applications.The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.展开更多
Data warehouse (DW), a new technology invented in 1990s, is more useful for integrating and analyzing massive data than traditional database. Its application in geology field can be divided into 3 phrases: 1992-1996,...Data warehouse (DW), a new technology invented in 1990s, is more useful for integrating and analyzing massive data than traditional database. Its application in geology field can be divided into 3 phrases: 1992-1996, commercial data warehouse (CDW) appeared; 1996-1999, geological data warehouse (GDW) appeared and the geologists or geographers realized the importance of DW and began the studies on it, but the practical DW still followed the framework of DB; 2000 to present, geological data warehouse grows, and the theory of geo-spatial data warehouse (GSDW) has been developed but the research in geological area is still deficient except that in geography. Although some developments of GDW have been made, its core still follows the CDW-organizing data by time and brings about 3 problems: difficult to integrate the geological data, for the data feature more space than time; hard to store the massive data in different levels due to the same reason; hardly support the spatial analysis if the data are organized by time as CDW does. So the GDW should be redesigned by organizing data by scale in order to store mass data in different levels and synthesize the data in different granularities, and choosing space control points to replace the former time control points so as to integrate different types of data by the method of storing one type data as one layer and then to superpose the layers. In addition, data cube, a wide used technology in CDW, will be no use in GDW, for the causality among the geological data is not so obvious as commercial data, as the data are the mixed result of many complex rules, and their analysis always needs the special geological methods and software; on the other hand, data cube for mass and complex geo-data will devour too much store space to be practical. On this point, the main purpose of GDW may be fit for data integration unlike CDW for data analysis.展开更多
文摘CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for investigating a deeper correlation between human CHD and aberrant transcriptome expression.The develop-ment of human diseases involves important regulatory roles of RNAs,and expression profiling data can reflect the underlying etiology of inherited diseases.Hence,collecting and compiling expression profiling data is of critical significance for a comprehensive understanding of the mechanisms and functions that underpin genetic diseases.CHDTEPDB stores the expression profiles of over 200 sets of 7 types of CHD and provides users with more convenient basic analytical functions.Due to the differences in clinical indicators such as disease type and unavoidable detection errors among various datasets,users are able to customize their selection of corresponding data for personalized analysis.Moreover,we provide a submission page for researchers to submit their own data so that increasing expression profiles as well as some other histological data could be supplemented to the database.CHDTEPDB is a user-friendly interface that allows users to quickly browse,retrieve,download,and analyze their target samples.CHDTEPDB will significantly improve the current knowledge of expression profiling data in CHD and has the potential to be exploited as an important tool for future research on the disease.
基金King Saud University for funding this work through Researchers Supporting Project Number(RSP-2021/387),King Saud University,Riyadh,Saudi Arabia.
文摘The effectiveness of the Business Intelligence(BI)system mainly depends on the quality of knowledge it produces.The decision-making process is hindered,and the user’s trust is lost,if the knowledge offered is undesired or of poor quality.A Data Warehouse(DW)is a huge collection of data gathered from many sources and an important part of any BI solution to assist management in making better decisions.The Extract,Transform,and Load(ETL)process is the backbone of a DW system,and it is responsible for moving data from source systems into the DW system.The more mature the ETL process the more reliable the DW system.In this paper,we propose the ETL Maturity Model(EMM)that assists organizations in achieving a high-quality ETL system and thereby enhancing the quality of knowledge produced.The EMM is made up of five levels of maturity i.e.,Chaotic,Acceptable,Stable,Efficient and Reliable.Each level of maturity contains Key Process Areas(KPAs)that have been endorsed by industry experts and include all critical features of a good ETL system.Quality Objectives(QOs)are defined procedures that,when implemented,resulted in a high-quality ETL process.Each KPA has its own set of QOs,the execution of which meets the requirements of that KPA.Multiple brainstorming sessions with relevant industry experts helped to enhance the model.EMMwas deployed in two key projects utilizing multiple case studies to supplement the validation process and support our claim.This model can assist organizations in improving their current ETL process and transforming it into a more mature ETL system.This model can also provide high-quality information to assist users inmaking better decisions and gaining their trust.
文摘Universities collect and generate a considerable amount of data on students throughout their academic career. Currently in South Kivu, most universities have an information system in the form of a database made up of several disparate files. This makes it difficult to use this data efficiently and profitably. The aim of this study is to develop this transactional database-based information system into a data warehouse-oriented system. This tool will be able to collect, organize and archive data on the student’s career path, year after year, and transform it for analysis purposes. In the age of Big Data, a number of artificial intelligence techniques have been developed, making it possible to extract useful information from large databases. This extracted information is of paramount importance in decision-making. By way of example, the information extracted by these techniques can be used to predict which stream a student should choose when applying to university. In order to develop our contribution, we analyzed the IT information systems used in the various universities and applied the bottom-up method to design our data warehouse model. We used the relational model to design the data warehouse.
文摘This paper presents a methodology driven by database constraints for designing and developing(database)software applications.Much needed and with excellent results,this paradigm guarantees the highest possible quality of the managed data.The proposed methodology is illustrated with an easy to understand,yet complex medium-sized genealogy software application driven by more than 200 database constraints,which fully meets such expectations.
基金Projects(41572317,51374242)supported by the National Natural Science Foundation of ChinaProject(2015CX005)supported by the Innovation Driven Plan of Central South University,China
文摘Data organization requires high efficiency for large amount of data applied in the digital mine system. A new method of storing massive data of block model is proposed to meet the characteristics of the database, including ACID-compliant, concurrency support, data sharing, and efficient access. Each block model is organized by linear octree, stored in LMDB(lightning memory-mapped database). Geological attribute can be queried at any point of 3D space by comparison algorithm of location code and conversion algorithm from address code of geometry space to location code of storage. The performance and robustness of querying geological attribute at 3D spatial region are enhanced greatly by the transformation from 3D to 2D and the method of 2D grid scanning to screen the inner and outer points. Experimental results showed that this method can access the massive data of block model, meeting the database characteristics. The method with LMDB is at least 3 times faster than that with etree, especially when it is used to read. In addition, the larger the amount of data is processed, the more efficient the method would be.
文摘Expenditure on wells constitute a significant part of the operational costs for a petroleum enterprise, where most of the cost results from drilling. This has prompted drilling departments to continuously look for ways to reduce their drilling costs and be as efficient as possible. A system called the Drilling Comprehensive Information Management and Application System (DCIMAS) is developed and presented here, with an aim at collecting, storing and making full use of the valuable well data and information relating to all drilling activities and operations. The DCIMAS comprises three main parts, including a data collection and transmission system, a data warehouse (DW) management system, and an integrated platform of core applications. With the support of the application platform, the DW management system is introduced, whereby the operation data are captured at well sites and transmitted electronically to a data warehouse via transmission equipment and ETL (extract, transformation and load) tools. With the high quality of the data guaranteed, our central task is to make the best use of the operation data and information for drilling analysis and to provide further information to guide later production stages. Applications have been developed and integrated on a uniform platform to interface directly with different layers of the multi-tier DW. Now, engineers in every department spend less time on data handling and more time on applying technology in their real work with the system.
文摘A uniform metadata representation is introduced for heterogeneous databases, multi media information and other information sources. Some features about metadata are analyzed. The limitation of existing metadata model is compared with the new one. The metadata model is described in XML which is fit for metadata denotation and exchange. The well structured data, semi structured data and those exterior file data without structure are described in the metadata model. The model provides feasibility and extensibility for constructing uniform metadata model of data warehouse.
文摘This paper detailed introduces the related work for building Chinese National Database, Sinocenter, of materials life cycle assessment (MLCA) and developing the environmental burden dataset of materials. The MLCA database was built in 2004, and the basic framework mainly includes LCA methodology, materials environmental dataset about energy consumption, resource input and environmental emissions. Nowadays, the database contains about fifty-thousand records of the main materials industries, such as cement, iron and steel, nonferrous metal, etc., and also includes the primary LCI data of fossil fuels and electricity grid in China. At the same time, the LCA method localization work is going on, for instance, calculated the resource characterization factors of 42 kinds of metal and 58 sorts of nonmetal, and also obtained some heavy metal impact factors in water. Based on the database, the iron and steel dataset has been developed with the data quality analysis, and some environmental burden data could be queried in our website, www.cnmlca.com, in the future. Lastly, according to the framework of the ISO14040 series standards, the antitype of Chinese LCA evaluation system was developed to support materials and products LCA evaluation in China.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
基金the National Social Science Foundation of China(No.16BGL183).
文摘Many high quality studies have emerged from public databases,such as Surveillance,Epidemiology,and End Results(SEER),National Health and Nutrition Examination Survey(NHANES),The Cancer Genome Atlas(TCGA),and Medical Information Mart for Intensive Care(MIMIC);however,these data are often characterized by a high degree of dimensional heterogeneity,timeliness,scarcity,irregularity,and other characteristics,resulting in the value of these data not being fully utilized.Data-mining technology has been a frontier field in medical research,as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models.Therefore,data mining has unique advantages in clinical big-data research,especially in large-scale medical public databases.This article introduced the main medical public database and described the steps,tasks,and models of data mining in simple language.Additionally,we described data-mining methods along with their practical applications.The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.
文摘Data warehouse (DW), a new technology invented in 1990s, is more useful for integrating and analyzing massive data than traditional database. Its application in geology field can be divided into 3 phrases: 1992-1996, commercial data warehouse (CDW) appeared; 1996-1999, geological data warehouse (GDW) appeared and the geologists or geographers realized the importance of DW and began the studies on it, but the practical DW still followed the framework of DB; 2000 to present, geological data warehouse grows, and the theory of geo-spatial data warehouse (GSDW) has been developed but the research in geological area is still deficient except that in geography. Although some developments of GDW have been made, its core still follows the CDW-organizing data by time and brings about 3 problems: difficult to integrate the geological data, for the data feature more space than time; hard to store the massive data in different levels due to the same reason; hardly support the spatial analysis if the data are organized by time as CDW does. So the GDW should be redesigned by organizing data by scale in order to store mass data in different levels and synthesize the data in different granularities, and choosing space control points to replace the former time control points so as to integrate different types of data by the method of storing one type data as one layer and then to superpose the layers. In addition, data cube, a wide used technology in CDW, will be no use in GDW, for the causality among the geological data is not so obvious as commercial data, as the data are the mixed result of many complex rules, and their analysis always needs the special geological methods and software; on the other hand, data cube for mass and complex geo-data will devour too much store space to be practical. On this point, the main purpose of GDW may be fit for data integration unlike CDW for data analysis.