As the Internet of Things(IoT)and mobile devices have rapidly proliferated,their computationally intensive applications have developed into complex,concurrent IoT-based workflows involving multiple interdependent task...As the Internet of Things(IoT)and mobile devices have rapidly proliferated,their computationally intensive applications have developed into complex,concurrent IoT-based workflows involving multiple interdependent tasks.By exploiting its low latency and high bandwidth,mobile edge computing(MEC)has emerged to achieve the high-performance computation offloading of these applications to satisfy the quality-of-service requirements of workflows and devices.In this study,we propose an offloading strategy for IoT-based workflows in a high-performance MEC environment.The proposed task-based offloading strategy consists of an optimization problem that includes task dependency,communication costs,workflow constraints,device energy consumption,and the heterogeneous characteristics of the edge environment.In addition,the optimal placement of workflow tasks is optimized using a discrete teaching learning-based optimization(DTLBO)metaheuristic.Extensive experimental evaluations demonstrate that the proposed offloading strategy is effective at minimizing the energy consumption of mobile devices and reducing the execution times of workflows compared to offloading strategies using different metaheuristics,including particle swarm optimization and ant colony optimization.展开更多
Workflow-based systems are typically said to lead to better use of staff and better management and productivity. The first phase in building a workflow-based system is capturing the real-world process in a conceptual ...Workflow-based systems are typically said to lead to better use of staff and better management and productivity. The first phase in building a workflow-based system is capturing the real-world process in a conceptual representation suitable for the following phases of formalization and implementation. The specification may be in text or diagram form or written in a formal language. This paper proposes a flow-based diagrammatic methodology as a tool for workflow specification. The expressiveness of the method is appraised though its ability to capture a workflow-based application. Here we show that the proposed conceptual diagrams are able to express situations arising in practice as an alternative to tools currently used in workflow systems. This is demonstrated by using the proposed methodology to partial build demo systems for two government agencies.展开更多
The current application of digital workflows for the understanding,promotion and participation in the conservation of heritage sites involves several technical challenges and should be governed by serious ethical enga...The current application of digital workflows for the understanding,promotion and participation in the conservation of heritage sites involves several technical challenges and should be governed by serious ethical engagement.Recording consists of capturing(or mapping)the physical characteristics of character-defining elements that provide the significance of cultural heritage sites.Usually,the outcome of this work represents the cornerstone information serving for their conservation,whatever it uses actively for maintaining them or for ensuring a posterity record in case of destruction.The records produced could guide the decision-making process at different levels by property owners,site managers,public officials,and conservators around the world,as well as to present historical knowledge and values of these resources.Rigorous documentation may also serve a broader purpose:over time,it becomes the primary means by which scholars and the public apprehends a site that has since changed radically or disappeared.This contribution is aimed at providing an overview of the potential application and threats of technology utilised by a heritage recording professional by addressing the need to develop ethical principles that can improve the heritage recording practice at large.展开更多
Research data currently face a huge increase of data objects with an increasing variety of types(data types,formats)and variety of workflows by which objects need to be managed across their lifecycle by data infrastru...Research data currently face a huge increase of data objects with an increasing variety of types(data types,formats)and variety of workflows by which objects need to be managed across their lifecycle by data infrastructures.Researchers desire to shorten the workflows from data generation to analysis and publication,and the full workflow needs to become transparent to multiple stakeholders,including research administrators and funders.This poses challenges for research infrastructures and user-oriented data services in terms of not only making data and workflows findable,accessible,interoperable and reusable,but also doing so in a way that leverages machine support for better efficiency.One primary need to be addressed is that of findability,and achieving better findability has benefits for other aspects of data and workflow management.In this article,we describe how machine capabilities can be extended to make workflows more findable,in particular by leveraging the Digital Object Architecture,common object operations and machine learning techniques.展开更多
In Geographic Information Systems(GIS),geoprocessing workflows allow analysts to organize their methods on spatial data in complex chains.We propose a method for expressing workflows as linked data,and for semi-automa...In Geographic Information Systems(GIS),geoprocessing workflows allow analysts to organize their methods on spatial data in complex chains.We propose a method for expressing workflows as linked data,and for semi-automatically enriching them with semantics on the level of their operations and datasets.Linked workflows can be easily published on the Web and queried for types of inputs,results,or tools.Thus,GIS analysts can reuse their workflows in a modular way,selecting,adapting,and recommending resources based on compatible semantic types.Our typing approach starts from minimal annotations of workflow operations with classes of GIS tools,and then propagates data types and implicit semantic structures through the workflow using an OWL typing scheme and SPARQL rules by backtracking over GIS operations.The method is implemented in Python and is evaluated on two real-world geoprocessing workflows,generated with Esri's ArcGIS.To illustrate the potential applications of our typing method,we formulate and execute competency questions over these workflows.展开更多
When defining indicators on the environment,the use of existing initiatives should be a priority rather than redefining indicators each time.From an Information,Communication and Technology perspective,data interopera...When defining indicators on the environment,the use of existing initiatives should be a priority rather than redefining indicators each time.From an Information,Communication and Technology perspective,data interoperability and standardization are critical to improve data access and exchange as promoted by the Group on Earth Observations.GEOEssential is following an end-user driven approach by defining Essential Variables(EVs),as an intermediate value between environmental policy indicators and their appropriate data sources.From international to local scales,environmental policies and indicators are increasingly percolating down from the global to the local agendas.The scientific business processes for the generation of EVs and related indicators can be formalized in workflows specifying the necessary logical steps.To this aim,GEOEssential is developing a Virtual Laboratory the main objective of which is to instantiate conceptual workflows,which are stored in a dedicated knowledge base,generating executable workflows.To interpret and present the relevant outputs/results carried out by the different thematic workflows considered in GEOEssential(i.e.biodiversity,ecosystems,extractives,night light,and food-water-energy nexus),a Dashboard is built as a visual front-end.This is a valuable instrument to track progresses towards environmental policies.展开更多
This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependenci...This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependencies.It necessitates the distribution of various computational tasks to appropriate computing node resources in accor-dance with task dependencies to ensure the smooth completion of the entire workflow.Workflow scheduling must consider an array of factors,including task dependencies,availability of computational resources,and the schedulability of tasks.Therefore,this paper delves into the distributed graph database workflow task scheduling problem and proposes a workflow scheduling methodology based on deep reinforcement learning(DRL).The method optimizes the maximum completion time(makespan)and response time of workflow tasks,aiming to enhance the responsiveness of workflow tasks while ensuring the minimization of the makespan.The experimental results indicate that the Q-learning Deep Reinforcement Learning(Q-DRL)algorithm markedly diminishes the makespan and refines the average response time within distributed graph database environments.In quantifying makespan,Q-DRL achieves mean reductions of 12.4%and 11.9%over established First-fit and Random scheduling strategies,respectively.Additionally,Q-DRL surpasses the performance of both DRL-Cloud and Improved Deep Q-learning Network(IDQN)algorithms,with improvements standing at 4.4%and 2.6%,respectively.With reference to average response time,the Q-DRL approach exhibits a significantly enhanced performance in the scheduling of workflow tasks,decreasing the average by 2.27%and 4.71%when compared to IDQN and DRL-Cloud,respectively.The Q-DRL algorithm also demonstrates a notable increase in the efficiency of system resource utilization,reducing the average idle rate by 5.02%and 9.30%in comparison to IDQN and DRL-Cloud,respectively.These findings support the assertion that Q-DRL not only upholds a lower average idle rate but also effectively curtails the average response time,thereby substantially improving processing efficiency and optimizing resource utilization within distributed graph database systems.展开更多
There is a growing recognition of the interdependencies among the supply systems that rely upon food,water and energy.Billions of people lack safe and sufficient access to these systems,coupled with a rapidly growing ...There is a growing recognition of the interdependencies among the supply systems that rely upon food,water and energy.Billions of people lack safe and sufficient access to these systems,coupled with a rapidly growing global demand and increasing resource constraints.Modeling frameworks are considered one of the few means available to understand the complex interrelationships among the sectors,however development of nexus related frameworks has been limited.We describe three opensource models well known in their respective domains(i.e.TerrSysMP,WOFOST and SWAT)where components of each if combined could help decision-makers address the nexus issue.We propose as a first step the development of simple workflows utilizing essential variables and addressing components of the above-mentioned models which can act as building-blocks to be used ultimately in a comprehensive nexus model framework.The outputs of the workflows and the model framework are designed to address the SDGs.展开更多
Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several add...Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information,an overarching data governance,or the integration with(high-performance)processing capabilities.Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized,no concrete guidelines are yet provided.The most common architecture implementing this conceptual organisation is the zone architecture,where data is assigned to a certain zone depending on the degree of processing.This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones,how they can be used to abstract the physical implementation,and how they empower generic and portable processing capabilities based on a provenance-based approach.展开更多
We discuss the problem of accountability when multiple parties cooperate towards an end result,such as multiple companies in a supply chain or departments of a government service under different authorities.In cases w...We discuss the problem of accountability when multiple parties cooperate towards an end result,such as multiple companies in a supply chain or departments of a government service under different authorities.In cases where a fully trusted central point does not exist,it is difficult to obtain a trusted audit trail of a workflow when each individual participant is unaccountable to all others.We propose AudiWFlow,an auditing architecture that makes participants accountable for their contributions in a distributed workflow.Our scheme provides confidentiality in most cases,collusion detection,and availability of evidence after the workflow terminates.AudiWFlow is based on verifiable secret sharing and real-time peer-to-peer verification of records;it further supports multiple levels of assurance to meet a desired trade-off between the availability of evidence and the overhead resulting from the auditing approach.We propose and evaluate two implementation approaches for AudiWFlow.The first one is fully distributed except for a central auxiliary point that,nevertheless,needs only a low level of trust.The second one is based on smart contracts running on a public blockchain,which is able to remove the need for any central point but requires integration with a blockchain.展开更多
With quick development of grid techniques and growing complexity of grid applications, it is becoming critical for reasoning temporal properties of grid workflows to probe potential pitfalls and errors, in order to en...With quick development of grid techniques and growing complexity of grid applications, it is becoming critical for reasoning temporal properties of grid workflows to probe potential pitfalls and errors, in order to ensure reliability and trustworthiness at the initial design phase. A state Pi calculus is proposed and implemented in this work, which not only enables fexible abstraction and management of historical grid verification of grid workflows. Furthermore, a relaxed region system events, but also facilitates modeling and temporal analysis (RRA) approach is proposed to decompose large scale grid workflows into sequentially composed regions with relaxation of parallel workflow branches, and corresponding verification strategies are also decomposed following modular verification principles. Performance evaluation results show that the RRA approach can dramatically reduce CPU time and memory usage of formal verification.展开更多
Computational workflows describe the complex multi-step methods that are used for data collection,data preparation,analytics,predictive modelling,and simulation that lead to new data products.They can inherently contr...Computational workflows describe the complex multi-step methods that are used for data collection,data preparation,analytics,predictive modelling,and simulation that lead to new data products.They can inherently contribute to the FAIR data principles:by processing data according to established metadata;by creating metadata themselves during the processing of data;and by tracking and recording data provenance.These properties aid data quality assessment and contribute to secondary data usage.Moreover,workflows are digital objects in their own right.This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps,their provenance,and their development.展开更多
The FAIR principles have been accepted globally as guidelines for improving data-driven science and data management practices,yet the incentives for researchers to change their practices are presently weak.In addition...The FAIR principles have been accepted globally as guidelines for improving data-driven science and data management practices,yet the incentives for researchers to change their practices are presently weak.In addition,data-driven science has been slow to embrace workflow technology despite clear evidence of recurring practices.To overcome these challenges,the Canonical Workflow Frameworks for Research(CWFR)initiative suggests a large-scale introduction of self-documenting workflow scripts to automate recurring processes or fragments thereof.This standardised approach,with FAIR Digital Objects as anchors,will be a significant milestone in the transition to FAIR data without adding additional load onto the researchers who stand to benefit most from it.This paper describes the CWFR approach and the activities of the CWFR initiative over the course of the last year or so,highlights several projects that hold promise for the CWFR approaches,including Galaxy,Jupyter Notebook,and RO Crate,and concludes with an assessment of the state of the field and the challenges ahead.展开更多
Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the ...Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the high volumes of data and relatively limited storage capability, resource allocation and data streaming have to be storage aware. Also to improve system performance, the data streaming and processing have to be concurrent. This study used a genetic algorithm (GA) for workflow scheduling, using on-line measurements and predictions with gray model (GM). On-demand data streaming is used to avoid data overflow through repertory strategies. Tests show that tasks with on-demand data streaming must be balanced to improve overall performance, to avoid system bottlenecks and backlogs of intermediate data, and to increase data throughput for the data processing workflows as a whole.展开更多
Machine learning(ML)applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing(HPC)power are paving the way.Ensuring FAIR data and reproducible ML pract...Machine learning(ML)applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing(HPC)power are paving the way.Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers.Even though the FAIR principle is well known to many scientists,research communities are slow to adopt them.Canonical Workflow Framework for Research(CWFR)provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers.This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate,focusing on HPC and big data.Specifically,we discuss Fair Digital Object(FDO)and Research Object(RO)in the DeepRain project to achieve granular reproducibility.DeepRain is a project that aims to improve precipitation forecast in Germany by using ML.Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access.We suggest the Juypter notebook as a single reproducible experiment.In addition,we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.展开更多
With the rapid growth of the Industrial Internet of Things(IIoT), the Mobile Edge Computing(MEC) has coming widely used in many emerging scenarios. In MEC, each workflow task can be executed locally or offloaded to ed...With the rapid growth of the Industrial Internet of Things(IIoT), the Mobile Edge Computing(MEC) has coming widely used in many emerging scenarios. In MEC, each workflow task can be executed locally or offloaded to edge to help improve Quality of Service(QoS) and reduce energy consumption. However, most of the existing offloading strategies focus on independent applications, which cannot be applied efficiently to workflow applications with a series of dependent tasks. To address the issue,this paper proposes an energy-efficient task offloading strategy for large-scale workflow applications in MEC. First, we formulate the task offloading problem into an optimization problem with the goal of minimizing the utility cost, which is the trade-off between energy consumption and the total execution time. Then, a novel heuristic algorithm named Green DVFS-GA is proposed, which includes a task offloading step based on the genetic algorithm and a further step to reduce the energy consumption using Dynamic Voltage and Frequency Scaling(DVFS) technique. Experimental results show that our proposed strategy can significantly reduce the energy consumption and achieve the best trade-off compared with other strategies.展开更多
Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorith...Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorithms,and describe the scientific narratives of the experiments in an integrated document.To scale up scientific analyses,many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures(e.g.,highperformance computing and cloud computing environments).The existing solutions are stl limited in many ways,e.g.,1)the workflow(or pipeline)is implicit in a notebook,and some steps can be generically used by different code and executed in parallel,but because of the tight cell structure,all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments,and 2)there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation.In this work,we focus on how to manage the workflow in a notebook seamlessly.We 1)encapsulate the reusable cells as RESTful services and containerize them as portal components,2)provide a composition tool for describing workflow logic of those reusable components,and 3)automate the execution on remote cloud infrastructure.Empirically,we validate the solution's usability via a use case from the Ecology and Earth Science domain,illustrating the processing of massive Light Detection and Ranging(LiDAR)data.The demonstration and analysis show that our method is feasible,but that it needs further improvement,especially on integrating distributed workflow scheduling,automatic deployment,and execution to develop as a mature approach.展开更多
In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulationbased research....In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulationbased research.We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center(DKRZ).What is special about this is that the DKRZ provides the climate science community with resources like high performance computing(HPC),data storage and specialised services,and hosts the World Data Center for Climate(WDCC).Therefore,users can perform their entire research workflows up to the publication of the data on the same infrastructure.Our analysis shows,that the resources are used by two primary user types:those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse,build-on and analyse existing data.We then further subdivided these top-level user categories based on their specific goals and analysed their typical,idealised workflows applied to achieve the respective project goals.We find that due to the subdivision and further granulation of the user groups,the workflows show apparent differences.Nevertheless,similar"Canonical Workflow Modules"can be clearly made out.These modules are"Data and Software(Re)use","Compute","Data and Software Storing","Data and Software Publication","Generating Knowledge"and in their entirety form the basis for a Canonical Workflow Framework for Research(CWFR).It is desirable that parts of the workflows in a CWFR act as FDOs,but we view this aspect critically.Also,we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.展开更多
InCanonicalWorkflowFramework forResearch(CWFR)"packages"arerelevantin twodifferentdirections.In data science,workflows are in general being executed on a set of files which have been aggregated for specific ...InCanonicalWorkflowFramework forResearch(CWFR)"packages"arerelevantin twodifferentdirections.In data science,workflows are in general being executed on a set of files which have been aggregated for specific purposes,such as for training a model in deep learning.We call this type of"package"a data collection and its aggregation and metadata description is motivated by research interests.The other type of"packages"relevant for CWFR are supposed to represent workflows in a self-describing and self-contained way for later execution.In this paper,we will review different packaging technologies and investigate their usability in the context of CWFR.For this purpose,we draw on an exemplary use case and show how packaging technologies can support its realization.We conclude that packaging technologies of different flavors help on providing inputs and outputs for workflow steps in a machine-readable way,as well as on representing a workflow and all its artifacts in a self-describing and self-contained way.展开更多
文摘As the Internet of Things(IoT)and mobile devices have rapidly proliferated,their computationally intensive applications have developed into complex,concurrent IoT-based workflows involving multiple interdependent tasks.By exploiting its low latency and high bandwidth,mobile edge computing(MEC)has emerged to achieve the high-performance computation offloading of these applications to satisfy the quality-of-service requirements of workflows and devices.In this study,we propose an offloading strategy for IoT-based workflows in a high-performance MEC environment.The proposed task-based offloading strategy consists of an optimization problem that includes task dependency,communication costs,workflow constraints,device energy consumption,and the heterogeneous characteristics of the edge environment.In addition,the optimal placement of workflow tasks is optimized using a discrete teaching learning-based optimization(DTLBO)metaheuristic.Extensive experimental evaluations demonstrate that the proposed offloading strategy is effective at minimizing the energy consumption of mobile devices and reducing the execution times of workflows compared to offloading strategies using different metaheuristics,including particle swarm optimization and ant colony optimization.
文摘Workflow-based systems are typically said to lead to better use of staff and better management and productivity. The first phase in building a workflow-based system is capturing the real-world process in a conceptual representation suitable for the following phases of formalization and implementation. The specification may be in text or diagram form or written in a formal language. This paper proposes a flow-based diagrammatic methodology as a tool for workflow specification. The expressiveness of the method is appraised though its ability to capture a workflow-based application. Here we show that the proposed conceptual diagrams are able to express situations arising in practice as an alternative to tools currently used in workflow systems. This is demonstrated by using the proposed methodology to partial build demo systems for two government agencies.
文摘The current application of digital workflows for the understanding,promotion and participation in the conservation of heritage sites involves several technical challenges and should be governed by serious ethical engagement.Recording consists of capturing(or mapping)the physical characteristics of character-defining elements that provide the significance of cultural heritage sites.Usually,the outcome of this work represents the cornerstone information serving for their conservation,whatever it uses actively for maintaining them or for ensuring a posterity record in case of destruction.The records produced could guide the decision-making process at different levels by property owners,site managers,public officials,and conservators around the world,as well as to present historical knowledge and values of these resources.Rigorous documentation may also serve a broader purpose:over time,it becomes the primary means by which scholars and the public apprehends a site that has since changed radically or disappeared.This contribution is aimed at providing an overview of the potential application and threats of technology utilised by a heritage recording professional by addressing the need to develop ethical principles that can improve the heritage recording practice at large.
文摘Research data currently face a huge increase of data objects with an increasing variety of types(data types,formats)and variety of workflows by which objects need to be managed across their lifecycle by data infrastructures.Researchers desire to shorten the workflows from data generation to analysis and publication,and the full workflow needs to become transparent to multiple stakeholders,including research administrators and funders.This poses challenges for research infrastructures and user-oriented data services in terms of not only making data and workflows findable,accessible,interoperable and reusable,but also doing so in a way that leverages machine support for better efficiency.One primary need to be addressed is that of findability,and achieving better findability has benefits for other aspects of data and workflow management.In this article,we describe how machine capabilities can be extended to make workflows more findable,in particular by leveraging the Digital Object Architecture,common object operations and machine learning techniques.
文摘In Geographic Information Systems(GIS),geoprocessing workflows allow analysts to organize their methods on spatial data in complex chains.We propose a method for expressing workflows as linked data,and for semi-automatically enriching them with semantics on the level of their operations and datasets.Linked workflows can be easily published on the Web and queried for types of inputs,results,or tools.Thus,GIS analysts can reuse their workflows in a modular way,selecting,adapting,and recommending resources based on compatible semantic types.Our typing approach starts from minimal annotations of workflow operations with classes of GIS tools,and then propagates data types and implicit semantic structures through the workflow using an OWL typing scheme and SPARQL rules by backtracking over GIS operations.The method is implemented in Python and is evaluated on two real-world geoprocessing workflows,generated with Esri's ArcGIS.To illustrate the potential applications of our typing method,we formulate and execute competency questions over these workflows.
基金This work was supported by European Commission[grant number H2020 ERA-PLANET project No.689443].
文摘When defining indicators on the environment,the use of existing initiatives should be a priority rather than redefining indicators each time.From an Information,Communication and Technology perspective,data interoperability and standardization are critical to improve data access and exchange as promoted by the Group on Earth Observations.GEOEssential is following an end-user driven approach by defining Essential Variables(EVs),as an intermediate value between environmental policy indicators and their appropriate data sources.From international to local scales,environmental policies and indicators are increasingly percolating down from the global to the local agendas.The scientific business processes for the generation of EVs and related indicators can be formalized in workflows specifying the necessary logical steps.To this aim,GEOEssential is developing a Virtual Laboratory the main objective of which is to instantiate conceptual workflows,which are stored in a dedicated knowledge base,generating executable workflows.To interpret and present the relevant outputs/results carried out by the different thematic workflows considered in GEOEssential(i.e.biodiversity,ecosystems,extractives,night light,and food-water-energy nexus),a Dashboard is built as a visual front-end.This is a valuable instrument to track progresses towards environmental policies.
基金funded by the Science and Technology Foundation of State Grid Corporation of China(Grant No.5108-202218280A-2-397-XG).
文摘This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependencies.It necessitates the distribution of various computational tasks to appropriate computing node resources in accor-dance with task dependencies to ensure the smooth completion of the entire workflow.Workflow scheduling must consider an array of factors,including task dependencies,availability of computational resources,and the schedulability of tasks.Therefore,this paper delves into the distributed graph database workflow task scheduling problem and proposes a workflow scheduling methodology based on deep reinforcement learning(DRL).The method optimizes the maximum completion time(makespan)and response time of workflow tasks,aiming to enhance the responsiveness of workflow tasks while ensuring the minimization of the makespan.The experimental results indicate that the Q-learning Deep Reinforcement Learning(Q-DRL)algorithm markedly diminishes the makespan and refines the average response time within distributed graph database environments.In quantifying makespan,Q-DRL achieves mean reductions of 12.4%and 11.9%over established First-fit and Random scheduling strategies,respectively.Additionally,Q-DRL surpasses the performance of both DRL-Cloud and Improved Deep Q-learning Network(IDQN)algorithms,with improvements standing at 4.4%and 2.6%,respectively.With reference to average response time,the Q-DRL approach exhibits a significantly enhanced performance in the scheduling of workflow tasks,decreasing the average by 2.27%and 4.71%when compared to IDQN and DRL-Cloud,respectively.The Q-DRL algorithm also demonstrates a notable increase in the efficiency of system resource utilization,reducing the average idle rate by 5.02%and 9.30%in comparison to IDQN and DRL-Cloud,respectively.These findings support the assertion that Q-DRL not only upholds a lower average idle rate but also effectively curtails the average response time,thereby substantially improving processing efficiency and optimizing resource utilization within distributed graph database systems.
基金The authors would like to acknowledge the European Commission Horizon 2020 Program that funded both the ERAPLANET/GEOEssential(Grant Agreement no.689443)ConnectinGEO(Grant Agreement no.641538)projects.
文摘There is a growing recognition of the interdependencies among the supply systems that rely upon food,water and energy.Billions of people lack safe and sufficient access to these systems,coupled with a rapidly growing global demand and increasing resource constraints.Modeling frameworks are considered one of the few means available to understand the complex interrelationships among the sectors,however development of nexus related frameworks has been limited.We describe three opensource models well known in their respective domains(i.e.TerrSysMP,WOFOST and SWAT)where components of each if combined could help decision-makers address the nexus issue.We propose as a first step the development of simple workflows utilizing essential variables and addressing components of the above-mentioned models which can act as building-blocks to be used ultimately in a comprehensive nexus model framework.The outputs of the workflows and the model framework are designed to address the SDGs.
基金funding by the"Niedersachsisches Vorab"funding line of the Volkswagen Foundation.
文摘Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information,an overarching data governance,or the integration with(high-performance)processing capabilities.Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized,no concrete guidelines are yet provided.The most common architecture implementing this conceptual organisation is the zone architecture,where data is assigned to a certain zone depending on the degree of processing.This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones,how they can be used to abstract the physical implementation,and how they empower generic and portable processing capabilities based on a provenance-based approach.
文摘We discuss the problem of accountability when multiple parties cooperate towards an end result,such as multiple companies in a supply chain or departments of a government service under different authorities.In cases where a fully trusted central point does not exist,it is difficult to obtain a trusted audit trail of a workflow when each individual participant is unaccountable to all others.We propose AudiWFlow,an auditing architecture that makes participants accountable for their contributions in a distributed workflow.Our scheme provides confidentiality in most cases,collusion detection,and availability of evidence after the workflow terminates.AudiWFlow is based on verifiable secret sharing and real-time peer-to-peer verification of records;it further supports multiple levels of assurance to meet a desired trade-off between the availability of evidence and the overhead resulting from the auditing approach.We propose and evaluate two implementation approaches for AudiWFlow.The first one is fully distributed except for a central auxiliary point that,nevertheless,needs only a low level of trust.The second one is based on smart contracts running on a public blockchain,which is able to remove the need for any central point but requires integration with a blockchain.
基金supported by the National Basic Research 973 Program of China under Grant Nos.2011CB302805,2011CB302505the National High Technology Research and Development 863 Program of China under Grant No.2011AA040501+1 种基金the National Natural Science Foundation of China under Grant No.60803017Fan Zhang is supported by IBM 2011-2012 Ph.D. Fellowship
文摘With quick development of grid techniques and growing complexity of grid applications, it is becoming critical for reasoning temporal properties of grid workflows to probe potential pitfalls and errors, in order to ensure reliability and trustworthiness at the initial design phase. A state Pi calculus is proposed and implemented in this work, which not only enables fexible abstraction and management of historical grid verification of grid workflows. Furthermore, a relaxed region system events, but also facilitates modeling and temporal analysis (RRA) approach is proposed to decompose large scale grid workflows into sequentially composed regions with relaxation of parallel workflow branches, and corresponding verification strategies are also decomposed following modular verification principles. Performance evaluation results show that the RRA approach can dramatically reduce CPU time and memory usage of formal verification.
基金Carole Goble acknowledges funding by BioExcel2(H2020823830)IBISBA1.0(H2020730976)and EOSCLife(H2020824087)+3 种基金Daniel Schober’s work was financed by Phenomenal(H2020654241)at the initiation-phase of this effort,current work in kind contributionKristian Peters is funded by the German Network for Bioinformatics Infrastructure(de.NBI)and acknowledges BMBF funding under grant number 031L0107Stian Soiland-Reyes is funded by BioExcel2(H2020823830)Daniel Garijo,Yolanda Gil,gratefully acknowledge support from DARPA award W911NF-18-1-0027,NIH award 1R01AG059874-01,and NSF award ICER-1740683.
文摘Computational workflows describe the complex multi-step methods that are used for data collection,data preparation,analytics,predictive modelling,and simulation that lead to new data products.They can inherently contribute to the FAIR data principles:by processing data according to established metadata;by creating metadata themselves during the processing of data;and by tracking and recording data provenance.These properties aid data quality assessment and contribute to secondary data usage.Moreover,workflows are digital objects in their own right.This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps,their provenance,and their development.
文摘The FAIR principles have been accepted globally as guidelines for improving data-driven science and data management practices,yet the incentives for researchers to change their practices are presently weak.In addition,data-driven science has been slow to embrace workflow technology despite clear evidence of recurring practices.To overcome these challenges,the Canonical Workflow Frameworks for Research(CWFR)initiative suggests a large-scale introduction of self-documenting workflow scripts to automate recurring processes or fragments thereof.This standardised approach,with FAIR Digital Objects as anchors,will be a significant milestone in the transition to FAIR data without adding additional load onto the researchers who stand to benefit most from it.This paper describes the CWFR approach and the activities of the CWFR initiative over the course of the last year or so,highlights several projects that hold promise for the CWFR approaches,including Galaxy,Jupyter Notebook,and RO Crate,and concludes with an assessment of the state of the field and the challenges ahead.
基金Supported by the National Natural Science Foundation of China(No. 60803017)the National High-Tech Research and Development (863) Program of China (Nos. 2006AA10Z237,2007AA01Z179, and 2008AA01Z118)+1 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars,State Education Ministrythe FIT Foundation of Tsinghua University
文摘Data streaming applications, usually composed of sequential/parallel data processing tasks organized as a workflow, bring new challenges to workflow scheduling and resource allocation in grid environments. Due to the high volumes of data and relatively limited storage capability, resource allocation and data streaming have to be storage aware. Also to improve system performance, the data streaming and processing have to be concurrent. This study used a genetic algorithm (GA) for workflow scheduling, using on-line measurements and predictions with gray model (GM). On-demand data streaming is used to avoid data overflow through repertory strategies. Tests show that tasks with on-demand data streaming must be balanced to improve overall performance, to avoid system bottlenecks and backlogs of intermediate data, and to increase data throughput for the data processing workflows as a whole.
基金German Bundesministerium fuer Bildung und Forschung(BMBF)for funding the DeepRain project under grant agreement 01 IS18047A-E.
文摘Machine learning(ML)applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing(HPC)power are paving the way.Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers.Even though the FAIR principle is well known to many scientists,research communities are slow to adopt them.Canonical Workflow Framework for Research(CWFR)provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers.This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate,focusing on HPC and big data.Specifically,we discuss Fair Digital Object(FDO)and Research Object(RO)in the DeepRain project to achieve granular reproducibility.DeepRain is a project that aims to improve precipitation forecast in Germany by using ML.Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access.We suggest the Juypter notebook as a single reproducible experiment.In addition,we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.
基金Supported by the National Natural Science Foundation of China(62102292)the Hubei Key Laboratory of Intelligent Robot(Wuhan Institute of Technology) of China(HBIRL202103,HBIRL202204)+1 种基金Science Foundation Research Project of Wuhan Institute of Technology of China(K202035)Graduate Innovative Fund of Wuhan Institute of Technology of China(CX2021265)。
文摘With the rapid growth of the Industrial Internet of Things(IIoT), the Mobile Edge Computing(MEC) has coming widely used in many emerging scenarios. In MEC, each workflow task can be executed locally or offloaded to edge to help improve Quality of Service(QoS) and reduce energy consumption. However, most of the existing offloading strategies focus on independent applications, which cannot be applied efficiently to workflow applications with a series of dependent tasks. To address the issue,this paper proposes an energy-efficient task offloading strategy for large-scale workflow applications in MEC. First, we formulate the task offloading problem into an optimization problem with the goal of minimizing the utility cost, which is the trade-off between energy consumption and the total execution time. Then, a novel heuristic algorithm named Green DVFS-GA is proposed, which includes a task offloading step based on the genetic algorithm and a further step to reduce the energy consumption using Dynamic Voltage and Frequency Scaling(DVFS) technique. Experimental results show that our proposed strategy can significantly reduce the energy consumption and achieve the best trade-off compared with other strategies.
基金partially funded by the European Union's Horizon 2020 research and innovation programme by the project CLARIFY under the Marie Sklodowska-Curie grant agreement No 860627by the ARTICONF project grant agreement No 825134+2 种基金by the ENVRI-FAIR project grant agreement No 824068by the BLUECLOUD project grant agreement No 862409by the LifeWatch ERIC.
文摘Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorithms,and describe the scientific narratives of the experiments in an integrated document.To scale up scientific analyses,many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures(e.g.,highperformance computing and cloud computing environments).The existing solutions are stl limited in many ways,e.g.,1)the workflow(or pipeline)is implicit in a notebook,and some steps can be generically used by different code and executed in parallel,but because of the tight cell structure,all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments,and 2)there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation.In this work,we focus on how to manage the workflow in a notebook seamlessly.We 1)encapsulate the reusable cells as RESTful services and containerize them as portal components,2)provide a composition tool for describing workflow logic of those reusable components,and 3)automate the execution on remote cloud infrastructure.Empirically,we validate the solution's usability via a use case from the Ecology and Earth Science domain,illustrating the processing of massive Light Detection and Ranging(LiDAR)data.The demonstration and analysis show that our method is feasible,but that it needs further improvement,especially on integrating distributed workflow scheduling,automatic deployment,and execution to develop as a mature approach.
基金funded by the Deutsche Forschungsgemeinschaft(DFG,German Research Foundation)under Germany's Excellence Strategy-EXC 2037 CLICCS-Climate,Climatic Change,and Society-Project No.390683824.
文摘In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulationbased research.We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center(DKRZ).What is special about this is that the DKRZ provides the climate science community with resources like high performance computing(HPC),data storage and specialised services,and hosts the World Data Center for Climate(WDCC).Therefore,users can perform their entire research workflows up to the publication of the data on the same infrastructure.Our analysis shows,that the resources are used by two primary user types:those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse,build-on and analyse existing data.We then further subdivided these top-level user categories based on their specific goals and analysed their typical,idealised workflows applied to achieve the respective project goals.We find that due to the subdivision and further granulation of the user groups,the workflows show apparent differences.Nevertheless,similar"Canonical Workflow Modules"can be clearly made out.These modules are"Data and Software(Re)use","Compute","Data and Software Storing","Data and Software Publication","Generating Knowledge"and in their entirety form the basis for a Canonical Workflow Framework for Research(CWFR).It is desirable that parts of the workflows in a CWFR act as FDOs,but we view this aspect critically.Also,we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.
文摘InCanonicalWorkflowFramework forResearch(CWFR)"packages"arerelevantin twodifferentdirections.In data science,workflows are in general being executed on a set of files which have been aggregated for specific purposes,such as for training a model in deep learning.We call this type of"package"a data collection and its aggregation and metadata description is motivated by research interests.The other type of"packages"relevant for CWFR are supposed to represent workflows in a self-describing and self-contained way for later execution.In this paper,we will review different packaging technologies and investigate their usability in the context of CWFR.For this purpose,we draw on an exemplary use case and show how packaging technologies can support its realization.We conclude that packaging technologies of different flavors help on providing inputs and outputs for workflow steps in a machine-readable way,as well as on representing a workflow and all its artifacts in a self-describing and self-contained way.