Data-intensive science is reality in large scientific organizations such as the Max Planck Society,but due to the inefficiency of our data practices when it comes to integrating data from different sources,many projec...Data-intensive science is reality in large scientific organizations such as the Max Planck Society,but due to the inefficiency of our data practices when it comes to integrating data from different sources,many projects cannot be carried out and many researchers are excluded.Since about 80%of the time in data-intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data-our methods do not scale.Therefore experts worldwide are looking for strategies and methods that have a potential for the future.The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers(PID)and metadata(MD).In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use.It is argued,however,that assigning PIDs is just the first step.If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed,we are close to defining Digital Objects(DOs)which could indeed indicate a solution to solve some of the basic problems in data management and processing.In addition to standardizing the way we assign PIDs,metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations.We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry.A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.展开更多
Persistent identifiers for research objects, researchers, organizations, and funders are the key to creating unambiguous and persistent connections across the global research infrastructure(GRI). Many repositories are...Persistent identifiers for research objects, researchers, organizations, and funders are the key to creating unambiguous and persistent connections across the global research infrastructure(GRI). Many repositories are implementing mechanisms to collect and integrate these identifiers into their submission and record curation processes. This bodes well for a well-connected future, but metadata for existing resources submitted in the past are missing these identifiers, thus missing the connections required for inclusion in the connected infrastructure. Re-curation of these metadata is required to make these connections. This paper introduces the global research infrastructure and demonstrates how repositories, and their user communities, can contribute to and benefit from connections to the global research infrastructure. The Dryad Data Repository has existed since 2008 and has successfully re-curated the repository metadata several times, adding identifiers for research organizations, funders, and researchers. Understanding and quantifying these successes depends on measuring repository and identifier connectivity. Metrics are described and applied to the entire repository here. Identifiers(Digital Object Identifiers, DOIs) for papers connected to datasets in Dryad have long been a critical part of the Dryad metadata creation and curation processes. Since 2019, the portion of datasets with connected papers has decreased from 100% to less than 40%. This decrease has significant ramifications for the re-curation efforts described above as connected papers have been an important source of metadata. In addition, missing connections to papers make understanding and re-using datasets more difficult. Connections between datasets and papers can be difficult to make because of time lags between submission and publication, lack of clear mechanisms for citing datasets and other research objects from papers, changing focus of researchers, and other obstacles. The Dryad community of members, i.e. users, research institutions, publishers, and funders have vested interests in identifying these connections and critical roles in the curation and re-curation efforts. Their engagement will be critical in building on the successes Dryad has already achieved and ensuring sustainable connectivity in the future.展开更多
Domain repositories,i.e.repositories that store,manage,and persist data pertaining to a specific scientific domain,are common and growing in the research landscape.Many of these repositories develop close,long-term co...Domain repositories,i.e.repositories that store,manage,and persist data pertaining to a specific scientific domain,are common and growing in the research landscape.Many of these repositories develop close,long-term communities made up of individuals and organizations that collect,analyze,and publish results based on the data in the repositories.Connections between these datasets,papers,people,and organizations are an important part of the knowledge infrastructure surrounding the repository.All these research objects,people,and organizations can now be identified using various unique and persistent identifiers(PIDs)and it is possible for domain repositories to build on their existing communities to facilitate and accelerate the identifier adoption process.As community members contribute to multiple datasets and articles,identifiers for them,once found,can be used multiple times.We explore this idea by defining a connectivity metric and applying it to datasets collected and papers published by members of the UNAVCO community.Finding identifiers in DataCite and Crossref metadata and spreading those identifiers through the UNAVCO DataCite metadata can increase connectivity from less than 10%to close to 50%for people and organizations.展开更多
The paper gives a brief introduction about the workflow management platform,Flowable,and how it is used for textual-data management.It is relatively new with its first release on 13 October,2016.Despite the short time...The paper gives a brief introduction about the workflow management platform,Flowable,and how it is used for textual-data management.It is relatively new with its first release on 13 October,2016.Despite the short time on the market,it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment.The focus of our project is to build a platform for text analysis on a large scale by including many different text resources.Currently,we have successfully connected to four different text resources and obtained more than one million works.Some resources are dynamic,which means that they might add more data or modify their current data.Therefore,it is necessary to keep data,both the metadata and the raw data,from our side up to date with the resources.In addition,to comply with FAIR principles,each work is assigned a persistent identifier(PID)and indexed for searching purposes.In the last step,we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph.End-users can utilize our platform to search on our data or get access to the knowledge graph.Furthermore,they can submit their code for their analyses to the system.The code will be executed on a High-Performance Cluster(HPC)and users can receive the results later on.In this case,Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system.As one may already notice,the whole process can be expressed as a workflow.A workflow,including error handling and notification,has been created and deployed.Workflow execution can be triggered manually or after predefined time intervals.According to our evaluation,the Flowable platform proves to be powerful and flexible.Further usage of the platform is already planned or implemented for many of our projects.展开更多
Collaboration and the sharing of knowledge is at the heart of Open Science(OS).However,we need to know that the knowledge we find and share is really what it purports to be;and we need to know that the authors we hope...Collaboration and the sharing of knowledge is at the heart of Open Science(OS).However,we need to know that the knowledge we find and share is really what it purports to be;and we need to know that the authors we hope to collaborate with are really the people they claim to be.In this paper,the author argues that a prerequisite for OS is trust and that persistent identifiers help to build that trust.The persistent identifier systems must themselves be trustworthy and they must be able to connect the user or their machine to the information they need now and into the future.Infrastructure is rather like plumbing:It goes unnoticed and unappreciated until it fails.This paper puts infrastructure for persistent identifiers in the spotlight as a core component of OS.展开更多
文摘Data-intensive science is reality in large scientific organizations such as the Max Planck Society,but due to the inefficiency of our data practices when it comes to integrating data from different sources,many projects cannot be carried out and many researchers are excluded.Since about 80%of the time in data-intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data-our methods do not scale.Therefore experts worldwide are looking for strategies and methods that have a potential for the future.The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers(PID)and metadata(MD).In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use.It is argued,however,that assigning PIDs is just the first step.If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed,we are close to defining Digital Objects(DOs)which could indeed indicate a solution to solve some of the basic problems in data management and processing.In addition to standardizing the way we assign PIDs,metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations.We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry.A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.
基金funded by the U.S. National Science Foundation (Crossref Funder ID: 100000001, ROR: https://ror.org/021nxhr62) Award 2134956
文摘Persistent identifiers for research objects, researchers, organizations, and funders are the key to creating unambiguous and persistent connections across the global research infrastructure(GRI). Many repositories are implementing mechanisms to collect and integrate these identifiers into their submission and record curation processes. This bodes well for a well-connected future, but metadata for existing resources submitted in the past are missing these identifiers, thus missing the connections required for inclusion in the connected infrastructure. Re-curation of these metadata is required to make these connections. This paper introduces the global research infrastructure and demonstrates how repositories, and their user communities, can contribute to and benefit from connections to the global research infrastructure. The Dryad Data Repository has existed since 2008 and has successfully re-curated the repository metadata several times, adding identifiers for research organizations, funders, and researchers. Understanding and quantifying these successes depends on measuring repository and identifier connectivity. Metrics are described and applied to the entire repository here. Identifiers(Digital Object Identifiers, DOIs) for papers connected to datasets in Dryad have long been a critical part of the Dryad metadata creation and curation processes. Since 2019, the portion of datasets with connected papers has decreased from 100% to less than 40%. This decrease has significant ramifications for the re-curation efforts described above as connected papers have been an important source of metadata. In addition, missing connections to papers make understanding and re-using datasets more difficult. Connections between datasets and papers can be difficult to make because of time lags between submission and publication, lack of clear mechanisms for citing datasets and other research objects from papers, changing focus of researchers, and other obstacles. The Dryad community of members, i.e. users, research institutions, publishers, and funders have vested interests in identifying these connections and critical roles in the curation and re-curation efforts. Their engagement will be critical in building on the successes Dryad has already achieved and ensuring sustainable connectivity in the future.
文摘Domain repositories,i.e.repositories that store,manage,and persist data pertaining to a specific scientific domain,are common and growing in the research landscape.Many of these repositories develop close,long-term communities made up of individuals and organizations that collect,analyze,and publish results based on the data in the repositories.Connections between these datasets,papers,people,and organizations are an important part of the knowledge infrastructure surrounding the repository.All these research objects,people,and organizations can now be identified using various unique and persistent identifiers(PIDs)and it is possible for domain repositories to build on their existing communities to facilitate and accelerate the identifier adoption process.As community members contribute to multiple datasets and articles,identifiers for them,once found,can be used multiple times.We explore this idea by defining a connectivity metric and applying it to datasets collected and papers published by members of the UNAVCO community.Finding identifiers in DataCite and Crossref metadata and spreading those identifiers through the UNAVCO DataCite metadata can increase connectivity from less than 10%to close to 50%for people and organizations.
文摘The paper gives a brief introduction about the workflow management platform,Flowable,and how it is used for textual-data management.It is relatively new with its first release on 13 October,2016.Despite the short time on the market,it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment.The focus of our project is to build a platform for text analysis on a large scale by including many different text resources.Currently,we have successfully connected to four different text resources and obtained more than one million works.Some resources are dynamic,which means that they might add more data or modify their current data.Therefore,it is necessary to keep data,both the metadata and the raw data,from our side up to date with the resources.In addition,to comply with FAIR principles,each work is assigned a persistent identifier(PID)and indexed for searching purposes.In the last step,we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph.End-users can utilize our platform to search on our data or get access to the knowledge graph.Furthermore,they can submit their code for their analyses to the system.The code will be executed on a High-Performance Cluster(HPC)and users can receive the results later on.In this case,Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system.As one may already notice,the whole process can be expressed as a workflow.A workflow,including error handling and notification,has been created and deployed.Workflow execution can be triggered manually or after predefined time intervals.According to our evaluation,the Flowable platform proves to be powerful and flexible.Further usage of the platform is already planned or implemented for many of our projects.
文摘Collaboration and the sharing of knowledge is at the heart of Open Science(OS).However,we need to know that the knowledge we find and share is really what it purports to be;and we need to know that the authors we hope to collaborate with are really the people they claim to be.In this paper,the author argues that a prerequisite for OS is trust and that persistent identifiers help to build that trust.The persistent identifier systems must themselves be trustworthy and they must be able to connect the user or their machine to the information they need now and into the future.Infrastructure is rather like plumbing:It goes unnoticed and unappreciated until it fails.This paper puts infrastructure for persistent identifiers in the spotlight as a core component of OS.