Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorith...Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorithms,and describe the scientific narratives of the experiments in an integrated document.To scale up scientific analyses,many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures(e.g.,highperformance computing and cloud computing environments).The existing solutions are stl limited in many ways,e.g.,1)the workflow(or pipeline)is implicit in a notebook,and some steps can be generically used by different code and executed in parallel,but because of the tight cell structure,all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments,and 2)there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation.In this work,we focus on how to manage the workflow in a notebook seamlessly.We 1)encapsulate the reusable cells as RESTful services and containerize them as portal components,2)provide a composition tool for describing workflow logic of those reusable components,and 3)automate the execution on remote cloud infrastructure.Empirically,we validate the solution's usability via a use case from the Ecology and Earth Science domain,illustrating the processing of massive Light Detection and Ranging(LiDAR)data.The demonstration and analysis show that our method is feasible,but that it needs further improvement,especially on integrating distributed workflow scheduling,automatic deployment,and execution to develop as a mature approach.展开更多
基金partially funded by the European Union's Horizon 2020 research and innovation programme by the project CLARIFY under the Marie Sklodowska-Curie grant agreement No 860627by the ARTICONF project grant agreement No 825134+2 种基金by the ENVRI-FAIR project grant agreement No 824068by the BLUECLOUD project grant agreement No 862409by the LifeWatch ERIC.
文摘Literate computing environments,such as the Jupyter(i.e.,Jupyter Notebooks,JupyterLab,and JupyterHub),have been widely used in scientific studies;they allow users to interactively develop scientific code,test algorithms,and describe the scientific narratives of the experiments in an integrated document.To scale up scientific analyses,many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures(e.g.,highperformance computing and cloud computing environments).The existing solutions are stl limited in many ways,e.g.,1)the workflow(or pipeline)is implicit in a notebook,and some steps can be generically used by different code and executed in parallel,but because of the tight cell structure,all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments,and 2)there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation.In this work,we focus on how to manage the workflow in a notebook seamlessly.We 1)encapsulate the reusable cells as RESTful services and containerize them as portal components,2)provide a composition tool for describing workflow logic of those reusable components,and 3)automate the execution on remote cloud infrastructure.Empirically,we validate the solution's usability via a use case from the Ecology and Earth Science domain,illustrating the processing of massive Light Detection and Ranging(LiDAR)data.The demonstration and analysis show that our method is feasible,but that it needs further improvement,especially on integrating distributed workflow scheduling,automatic deployment,and execution to develop as a mature approach.