Datashim and Its Applications in Bioinformatics

Gkoufas, Yiannis; Yuan, David Yu; Pinto, Christian; Koutsovasilis, Panagiotis; Venugopal, Srikumar

doi:10.1007/978-3-030-90539-2_28

Yiannis Gkoufas¹²,
David Yu Yuan ORCID: orcid.org/0000-0003-1075-1628¹³,
Christian Pinto¹²,
Panagiotis Koutsovasilis¹² &
…
Srikumar Venugopal¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

International Conference on High Performance Computing

1863 Accesses
2 Citations

Abstract

Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal.

In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have implemented the concept with a new framework Datashim which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset.

We use Datashim to serve data from object stores to both ML and non-ML pipelines on Kubeflow. We feed training data into ML models directly with Datashim instead of downloading it to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the setup of the TensorBoard, independent of the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. We have now established a new resource type Dataset to represent the concept of data source on Kubernetes with our novel framework Datashim to manage its lifecycle.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Case for Docker in Multicloud Enabled Bioinformatics Applications

GenoVault: a cloud based genomics repository

Article Open access 29 July 2021

Bioinformatics Application with Kubeflow for Batch Processing in Clouds

References

Yuan, D.Y., Wildish, T.: Bioinformatics application with Kubeflow for batch processing in clouds. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 355–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_24
Chapter Google Scholar
Yuan, D.: RSEConUK 2019, University of Birmingham, 17–19 September 2019, Case Study of Porting a Bioinformatics Pipeline into Clouds (2019). https://sched.co/QSRc
Yuan, D.Y., Wildish, T.: Workflow platform for machine learning [version 1]. F1000Research 2020 9(ISCB Comm J), 822 (2020). https://doi.org/10.7490/f1000research.1118095.1
Kubernetes (2021). https://kubernetes.io/
Kubeflow (2021). https://www.kubeflow.org/docs/started/kubeflow-overview/
Persistent volume access modes in Kubernetes (2021). https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
Datashim (2021). https://github.com/datashim-io/datashim/
Kubernetes Container Storage Interface (CSI) Documentation (2021). https://kubernetes-csi.github.io/docs/
Kubeflow Pipelines SDK API reference (2021). https://kubeflow-pipelines.readthedocs.io/en/stable/
Notebook download microscopic images from IDR with Keras (2020). https://gitlab.ebi.ac.uk/TSI/kubeflow/-/blob/latest/notebooks/imgcls/gcp/IDR0042.classification.tf2.1.0.v3.timing.ipynb
OMERO 5.6.0 JSON API (2021). https://docs.openmicroscopy.org/omero/5.6.0/developers/json-api.html
Nirschl, J.J., et al.: A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue (2018). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5882098/
IDR: Image Data Repository (2018). https://idr.openmicroscopy.org/webclient/?show=project-402
OneData (2021). https://onedata.org/
Persistent Volume Claim (2021). https://kubernetes.io/docs/concepts/storage/persistent-volumes/
Operator SDK (2020). https://sdk.operatorframework.io/
Ceph Pull Request (2020). https://github.com/ceph/ceph/pull/37212
Spectrum Scale (2021). https://www.ibm.com/products/spectrum-scale
NooBaa (2021). https://www.noobaa.io/

Download references

Acknowledgements

Datashim has received support as an incubation project by Linux Foundation AI & Data Foundation. In addition, this project has received funding from the European Union’s Horizon 2020 research and innovation programme “evolve” under grant agreement No 825061. It is also supported by the internal funding from European Bioinformatics Institute, European Molecular Biology Laboratory. The authors would like to thank funding agencies and organisations for their generous support.

Author information

Authors and Affiliations

IBM Research, Dublin, Ireland
Yiannis Gkoufas, Christian Pinto, Panagiotis Koutsovasilis & Srikumar Venugopal
Technology and Science Integration, European Bioinformatics Institute, European Molecular Biology Laboratory, Cambridge, United Kingdom
David Yu Yuan

Authors

Yiannis Gkoufas
View author publications
You can also search for this author in PubMed Google Scholar
David Yu Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Christian Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Koutsovasilis
View author publications
You can also search for this author in PubMed Google Scholar
Srikumar Venugopal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Yu Yuan .

Editor information

Editors and Affiliations

University of Tennessee at Knoxville, Knowville, TN, USA
Heike Jagode
Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Hartwig Anzt
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief
University of Tennessee System, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gkoufas, Y., Yuan, D.Y., Pinto, C., Koutsovasilis, P., Venugopal, S. (2021). Datashim and Its Applications in Bioinformatics. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-90539-2_28
Published: 13 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics