It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries.While these documents are useful in helping an end-user properly interpr...It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries.While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set,existing data dictionaries typically are not machine-readable and do not follow a common specification standard.We introduce the Semantic Data Dictionary,a specification that formalizes the assignment of a semantic representation of data,enabling standardization and harmonization across diverse data sets.In this paper,we present our Semantic Data Dictionary work in the context of our work with biomedical data;however,the approach can and has been used in a wide range of domains.The rendition of data in this form helps promote improved discovery,interoperability,reuse,traceability,and reproducibility.We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature.We discuss our approach,present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set,present modeling challenges,and describe the use of this approach in sponsored research,including our work on a large National Institutes of Health(NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics,Learning,and Semantics project.展开更多
Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality ...Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.展开更多
With the extensive application of software collaborative development technology,the processing of code data generated in programming scenes has become a research hotspot.In the collaborative programming process,differ...With the extensive application of software collaborative development technology,the processing of code data generated in programming scenes has become a research hotspot.In the collaborative programming process,different users can submit code in a distributed way.The consistency of code grammar can be achieved by syntax constraints.However,when different users work on the same code in semantic development programming practices,the development factors of different users will inevitably lead to the problem of data semantic conflict.In this paper,the characteristics of code segment data in a programming scene are considered.The code sequence can be obtained by disassembling the code segment using lexical analysis technology.Combined with a traditional solution of a data conflict problem,the code sequence can be taken as the declared value object in the data conflict resolution problem.Through the similarity analysis of code sequence objects,the concept of the deviation degree between the declared value object and the truth value object is proposed.A multi-truth discovery algorithm,called the multiple truth discovery algorithm based on deviation(MTDD),is proposed.The basic methods,such as Conflict Resolution on Heterogeneous Data,Voting-K,and MTRuths_Greedy,are compared to verify the performance and precision of the proposed MTDD algorithm.展开更多
Purpose:This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment.Key requirements,which the archival records manager should consider for publi...Purpose:This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment.Key requirements,which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment,are presented in the perspective of linked data.Design/methodology/approach:This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive,the search and browse menus of Gugak Archive’s website and K-PAAN,the performing arts portal site.Findings:The importance of consistency,continuity,and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment.However,a semantic web environment also requires new tools such as web identifiers(URIs),data models(RDF),and link information(interlinking).Research limitations:The scope of this study does not include practical implementation strategies for the archival records management system and website services.The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.Practical implications:The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system.This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.Originality/value:This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment.In the application of the semantic web services’principles and methods to an Gugak Archive,this study can contribute to the improvement of information organization and services in the field of Korean traditional music.展开更多
In this paper, a Graph-based semantic Data Model (GDM) is proposed with the primary objective of bridging the gap between the human perception of an enterprise and the needs of computing infrastructure to organize i...In this paper, a Graph-based semantic Data Model (GDM) is proposed with the primary objective of bridging the gap between the human perception of an enterprise and the needs of computing infrastructure to organize information in some particular manner for efficient storage and retrieval. The Graph Data Model (GDM) has been proposed as an alternative data model to combine the advantages of the relational model with the positive features of semantic data models. The proposed GDM offers a structural representation for interacting to the designer, making it always easy to comprehend the complex relations amongst basic data items. GDM allows an entire database to be viewed as a Graph (V, E) in a layered organization. Here, a graph is created in a bottom up fashion where V represents the basic instances of data or a functionally abstracted module, called primary semantic group (PSG) and secondary semantic group (SSG). An edge in the model implies the relationship among the secondary semantic groups. The contents of the lowest layer are the semantically grouped data values in the form of primary semantic groups. The SSGs are nothing but the higher-level abstraction and are created by the method of encapsulation of various PSGs, SSGs and basic data elements. This encapsulation methodology to provide a higher-level abstraction continues generating various secondary semantic groups until the designer thinks that it is sufficient to declare the actual problem domain. GDM, thus, uses standard abstractions available in a semantic data model with a structural representation in terms of a graph. The operations on the data model are formalized in the proposed graph algebra. A Graph Query Language (GQL) is also developed, maintaining similarity with the widely accepted user-friendly SQL. Finally, the paper also presents the methodology to make this GDM compatible with the distributed environment, and a corresponding query processing technique for distributed environment is also suggested for the sake of completeness.展开更多
Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines(that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and s...Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines(that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and stewardship has the potential to remarkably enhance the framework for the reuse of research data. In this way, FAIR is aiding digital transformation. The ‘FAIRification’ of data increases the interoperability and(re)usability of data, so that new and robust analytical tools, such as machine learning(ML) models, can access the data to deduce meaningful insights, extract actionable information, and identify hidden patterns. This article aims to build a FAIR ML model pipeline using the generic FAIRification workflow to make the whole ML analytics process FAIR. Accordingly, FAIR input data was modelled using a FAIR ML model. The output data from the FAIR ML model was also made FAIR. For this, a hybrid hierarchical k-means (HHK) clustering ML algorithm was applied to group the data into homogeneous subgroups and ascertain the underlying structure of the data using a Nigerian-based FAIR dataset that contains data on economic factors, healthcare facilities, and coronavirus occurrences in all the 36 states of Nigeria. The model showed that research data and the ML pipeline can be FAIRified, shared, and reused by following the proposed FAIRification workflow and implementing technical architecture.展开更多
Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pa...Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing(such as Hadoop) and stream processing(such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.展开更多
The benefits of the use of modeling and simulation in engineering are acknowledged widely.It has proven its advantages e.g.,in virtual prototyping i.e.,simulation aided design and testing as well as in training and R&...The benefits of the use of modeling and simulation in engineering are acknowledged widely.It has proven its advantages e.g.,in virtual prototyping i.e.,simulation aided design and testing as well as in training and R&D.It is recognized to be a tool for modern decision making.However,there are still reasons that slow down the wider utilization of modeling and simulation in companies.Modeling and simulation tools are separate and are not an integrated part of the other engineering information management in the company networks.They do not integrate well enough into the used CAD,PLM/PDM and control systems.The co-use of the simulation tools themselves is poor and the whole modeling process is considered often to be too laborious.In this article we introduce an integration solution for modeling and simulation based on the semantic data modeling approach.Semantic data modeling and ontology mapping techniques have been used in database system integration,but the novelty of this work is in utilizing these techniques in the domain of modeling and simulation.The benefits and drawbacks of the chosen approach are discussed.Furthermore,we describe real industrial project cases where this new approach has been applied.展开更多
基金This work is supported by the National Institute of Environmental Health Sciences(NIEHS)Award 0255-0236-4609/1U2CES026555-01IBM Research AI through the AI Horizons Network,and the CAPES Foundation Senior Internship Program Award 88881.120772/2016-01.
文摘It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries.While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set,existing data dictionaries typically are not machine-readable and do not follow a common specification standard.We introduce the Semantic Data Dictionary,a specification that formalizes the assignment of a semantic representation of data,enabling standardization and harmonization across diverse data sets.In this paper,we present our Semantic Data Dictionary work in the context of our work with biomedical data;however,the approach can and has been used in a wide range of domains.The rendition of data in this form helps promote improved discovery,interoperability,reuse,traceability,and reproducibility.We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature.We discuss our approach,present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set,present modeling challenges,and describe the use of this approach in sponsored research,including our work on a large National Institutes of Health(NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics,Learning,and Semantics project.
文摘Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.
基金supported by the National Key R&D Program of China(No.2018YFB1003905)the National Natural Science Foundation of China under Grant(No.61971032)Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘With the extensive application of software collaborative development technology,the processing of code data generated in programming scenes has become a research hotspot.In the collaborative programming process,different users can submit code in a distributed way.The consistency of code grammar can be achieved by syntax constraints.However,when different users work on the same code in semantic development programming practices,the development factors of different users will inevitably lead to the problem of data semantic conflict.In this paper,the characteristics of code segment data in a programming scene are considered.The code sequence can be obtained by disassembling the code segment using lexical analysis technology.Combined with a traditional solution of a data conflict problem,the code sequence can be taken as the declared value object in the data conflict resolution problem.Through the similarity analysis of code sequence objects,the concept of the deviation degree between the declared value object and the truth value object is proposed.A multi-truth discovery algorithm,called the multiple truth discovery algorithm based on deviation(MTDD),is proposed.The basic methods,such as Conflict Resolution on Heterogeneous Data,Voting-K,and MTRuths_Greedy,are compared to verify the performance and precision of the proposed MTDD algorithm.
基金supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(NRF-2016S1A5A2A03927725)
文摘Purpose:This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment.Key requirements,which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment,are presented in the perspective of linked data.Design/methodology/approach:This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive,the search and browse menus of Gugak Archive’s website and K-PAAN,the performing arts portal site.Findings:The importance of consistency,continuity,and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment.However,a semantic web environment also requires new tools such as web identifiers(URIs),data models(RDF),and link information(interlinking).Research limitations:The scope of this study does not include practical implementation strategies for the archival records management system and website services.The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.Practical implications:The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system.This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.Originality/value:This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment.In the application of the semantic web services’principles and methods to an Gugak Archive,this study can contribute to the improvement of information organization and services in the field of Korean traditional music.
文摘In this paper, a Graph-based semantic Data Model (GDM) is proposed with the primary objective of bridging the gap between the human perception of an enterprise and the needs of computing infrastructure to organize information in some particular manner for efficient storage and retrieval. The Graph Data Model (GDM) has been proposed as an alternative data model to combine the advantages of the relational model with the positive features of semantic data models. The proposed GDM offers a structural representation for interacting to the designer, making it always easy to comprehend the complex relations amongst basic data items. GDM allows an entire database to be viewed as a Graph (V, E) in a layered organization. Here, a graph is created in a bottom up fashion where V represents the basic instances of data or a functionally abstracted module, called primary semantic group (PSG) and secondary semantic group (SSG). An edge in the model implies the relationship among the secondary semantic groups. The contents of the lowest layer are the semantically grouped data values in the form of primary semantic groups. The SSGs are nothing but the higher-level abstraction and are created by the method of encapsulation of various PSGs, SSGs and basic data elements. This encapsulation methodology to provide a higher-level abstraction continues generating various secondary semantic groups until the designer thinks that it is sufficient to declare the actual problem domain. GDM, thus, uses standard abstractions available in a semantic data model with a structural representation in terms of a graph. The operations on the data model are formalized in the proposed graph algebra. A Graph Query Language (GQL) is also developed, maintaining similarity with the widely accepted user-friendly SQL. Finally, the paper also presents the methodology to make this GDM compatible with the distributed environment, and a corresponding query processing technique for distributed environment is also suggested for the sake of completeness.
基金VODAN-Africathe Philips Foundation+2 种基金the Dutch Development Bank FMOCORDAIDthe GO FAIR Foundation for supporting this research
文摘Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines(that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and stewardship has the potential to remarkably enhance the framework for the reuse of research data. In this way, FAIR is aiding digital transformation. The ‘FAIRification’ of data increases the interoperability and(re)usability of data, so that new and robust analytical tools, such as machine learning(ML) models, can access the data to deduce meaningful insights, extract actionable information, and identify hidden patterns. This article aims to build a FAIR ML model pipeline using the generic FAIRification workflow to make the whole ML analytics process FAIR. Accordingly, FAIR input data was modelled using a FAIR ML model. The output data from the FAIR ML model was also made FAIR. For this, a hybrid hierarchical k-means (HHK) clustering ML algorithm was applied to group the data into homogeneous subgroups and ascertain the underlying structure of the data using a Nigerian-based FAIR dataset that contains data on economic factors, healthcare facilities, and coronavirus occurrences in all the 36 states of Nigeria. The model showed that research data and the ML pipeline can be FAIRified, shared, and reused by following the proposed FAIRification workflow and implementing technical architecture.
基金supported by the Discovery grant No.RGPIN 2014-05254 from Natural Science&Engineering Research Council(NSERC),Canada
文摘Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing(such as Hadoop) and stream processing(such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.
文摘The benefits of the use of modeling and simulation in engineering are acknowledged widely.It has proven its advantages e.g.,in virtual prototyping i.e.,simulation aided design and testing as well as in training and R&D.It is recognized to be a tool for modern decision making.However,there are still reasons that slow down the wider utilization of modeling and simulation in companies.Modeling and simulation tools are separate and are not an integrated part of the other engineering information management in the company networks.They do not integrate well enough into the used CAD,PLM/PDM and control systems.The co-use of the simulation tools themselves is poor and the whole modeling process is considered often to be too laborious.In this article we introduce an integration solution for modeling and simulation based on the semantic data modeling approach.Semantic data modeling and ontology mapping techniques have been used in database system integration,but the novelty of this work is in utilizing these techniques in the domain of modeling and simulation.The benefits and drawbacks of the chosen approach are discussed.Furthermore,we describe real industrial project cases where this new approach has been applied.