Characterized by self-monitoring and agile adaptation to fast changing dynamics in complex production environments,smart manufacturing as envisioned under Industry 4.0 aims to improve the throughput and reliability of...Characterized by self-monitoring and agile adaptation to fast changing dynamics in complex production environments,smart manufacturing as envisioned under Industry 4.0 aims to improve the throughput and reliability of production beyond the state-of-the-art.While the widespread application of deep learning(DL)has opened up new opportunities to accomplish the goal,data quality and model interpretability have continued to present a roadblock for the widespread acceptance of DL for real-world applications.This has motivated research on two fronts:data curation,which aims to provide quality data as input for meaningful DL-based analysis,and model interpretation,which intends to reveal the physical reasoning underlying DL model outputs and promote trust from the users.This paper summarizes several key techniques in data curation where breakthroughs in data denoising,outlier detection,imputation,balancing,and semantic annotation have demonstrated the effectiveness in information extraction from noisy,incomplete,insufficient,and/or unannotated data.Also highlighted are model interpretation methods that address the“black-box”nature of DL towards model transparency.展开更多
Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a serie...Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a series of resources have been constructed,serving as a fundamental infrastructure for SARS-CoV-2 research,pandemic prevention and control,and coronavirus disease 2019(COVID-19)therapy.Here we present an over-view of extant SARS-CoV-2 resources that are devoted to genome data deposition and integration.We review deposition resources in data accessibility,metadata standardization,data curation and annotation;review integrative resources in data source,de-redundancy processing,data curation and quality assessment,and variant annotation.Moreover,we address issues that impede SARS-CoV-2 genome data integration,including low-complexity,inconsistency and absence of isolate name,sequence inconsistency,asynchronous update of genome data,and mismatched metadata.We finally provide insights into data standardization consensus and data submission guidelines,to promote SARS-CoV-2 genome data sharing and integration.展开更多
The FAIR data guiding principles have been recently developed and widely adopted to improve the Findability,Accessibility,Interoperability,and Reuse of digital assets in the face of an exponential increase of data vol...The FAIR data guiding principles have been recently developed and widely adopted to improve the Findability,Accessibility,Interoperability,and Reuse of digital assets in the face of an exponential increase of data volume and complexity.The FAIR data principles have been formulated on a general level and the technological implementation of these principles remains up to the industries and organizations working on maximizing the value of their data.Here,we describe the data management and curation methodologies and best practices developed for FAIRification of clinical exploratory biomarker data collected from over 250 clinical studies.We discuss the data curation effort involved,the resulting output,and the business and scientific impact of our work.Finally,we propose prospective planning for FAIR data to optimize data management efforts and maximize data value.展开更多
The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, ...The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of humanrelated biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation.展开更多
Data repository infrastructures for academics have appeared in waves since the dawn of Web technology.These waves are driven by changes in societal needs,archiving needs and the development of cloud computing resource...Data repository infrastructures for academics have appeared in waves since the dawn of Web technology.These waves are driven by changes in societal needs,archiving needs and the development of cloud computing resources.As such,the data repository landscape has many flavors when it comes to sustainability models,target audiences and feature sets.One thing that links all data repositories is a desire to make the content they host reusable,building on the core principles of cataloging content for economical and research speed efficiency.The FAIR principles are a common goal for all repository infrastructures to aim for.No matter what discipline or infrastructure,the goal of reusable content,for both humans and machines,is a common one.This is the first time that repositories can work toward a common goal that ultimately lends itself to interoperability.The idea that research can move further and faster as we un-silo these fantastic resources is an achievable one.This paper investigates the steps that existing repositories need to take in order to remain useful and relevant in a FAIR research world.展开更多
文摘Characterized by self-monitoring and agile adaptation to fast changing dynamics in complex production environments,smart manufacturing as envisioned under Industry 4.0 aims to improve the throughput and reliability of production beyond the state-of-the-art.While the widespread application of deep learning(DL)has opened up new opportunities to accomplish the goal,data quality and model interpretability have continued to present a roadblock for the widespread acceptance of DL for real-world applications.This has motivated research on two fronts:data curation,which aims to provide quality data as input for meaningful DL-based analysis,and model interpretation,which intends to reveal the physical reasoning underlying DL model outputs and promote trust from the users.This paper summarizes several key techniques in data curation where breakthroughs in data denoising,outlier detection,imputation,balancing,and semantic annotation have demonstrated the effectiveness in information extraction from noisy,incomplete,insufficient,and/or unannotated data.Also highlighted are model interpretation methods that address the“black-box”nature of DL towards model transparency.
基金supported by Strategic Priority Research Program of the Chinese Academy of Sciences[XDB38030201,XDB38030400,XDB38050300]Youth Innovation Promotion Association of Chinese Academy of Sciences[2019104]。
文摘Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a series of resources have been constructed,serving as a fundamental infrastructure for SARS-CoV-2 research,pandemic prevention and control,and coronavirus disease 2019(COVID-19)therapy.Here we present an over-view of extant SARS-CoV-2 resources that are devoted to genome data deposition and integration.We review deposition resources in data accessibility,metadata standardization,data curation and annotation;review integrative resources in data source,de-redundancy processing,data curation and quality assessment,and variant annotation.Moreover,we address issues that impede SARS-CoV-2 genome data integration,including low-complexity,inconsistency and absence of isolate name,sequence inconsistency,asynchronous update of genome data,and mismatched metadata.We finally provide insights into data standardization consensus and data submission guidelines,to promote SARS-CoV-2 genome data sharing and integration.
文摘The FAIR data guiding principles have been recently developed and widely adopted to improve the Findability,Accessibility,Interoperability,and Reuse of digital assets in the face of an exponential increase of data volume and complexity.The FAIR data principles have been formulated on a general level and the technological implementation of these principles remains up to the industries and organizations working on maximizing the value of their data.Here,we describe the data management and curation methodologies and best practices developed for FAIRification of clinical exploratory biomarker data collected from over 250 clinical studies.We discuss the data curation effort involved,the resulting output,and the business and scientific impact of our work.Finally,we propose prospective planning for FAIR data to optimize data management efforts and maximize data value.
基金supported by the‘‘100-Talent Program’’of Chinese Academy of Sciencesthe Strategic Priority Research Program of the Chinese Academy of Sciences(Grant No.XDB13040500)+1 种基金the National High-tech R&D Program(863 ProgramGrant No.2012AA020409)by the Ministry of Science and Technology of China awarded to ZZ
文摘The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases.With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of humanrelated biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation.
文摘Data repository infrastructures for academics have appeared in waves since the dawn of Web technology.These waves are driven by changes in societal needs,archiving needs and the development of cloud computing resources.As such,the data repository landscape has many flavors when it comes to sustainability models,target audiences and feature sets.One thing that links all data repositories is a desire to make the content they host reusable,building on the core principles of cataloging content for economical and research speed efficiency.The FAIR principles are a common goal for all repository infrastructures to aim for.No matter what discipline or infrastructure,the goal of reusable content,for both humans and machines,is a common one.This is the first time that repositories can work toward a common goal that ultimately lends itself to interoperability.The idea that research can move further and faster as we un-silo these fantastic resources is an achievable one.This paper investigates the steps that existing repositories need to take in order to remain useful and relevant in a FAIR research world.