Abstract
RNA sequencing has become an increasingly affordable way to profile gene expression analyses. Here we introduce a scientific workflow implementing several open-source software executed by Parsl parallel scripting language in an high-performance computing environment. We have applied the workflow to a single-cardiomyocyte RNA-seq data retrieved from Gene Expression Omnibus database. The workflow allows for the analysis (alignment, QC, sort and count reads, statistics generation) of raw RNA-seq data and seamless integration of differential expression results into a configurable script code. In this work, we aim to investigate an analytical comparison of executing the workflow in Solid State Disk and Lustre as a critical decision for improving the execution efficiency and resilience in current and upcoming RNA-Seq workflows. Based on the resulting profiling of CPU and I/O data collection, we demonstrate that we can correctly identify anomalies in transcriptomics workflow performance which is an essential resource to optimize its use of high-performance computing systems. ParslRNA-Seq showed improvements in the total execution time of up to 70% against its previous sequential implementation. Finally, the article discusses which workflow modeling modifications lead to improved computational performance and scalability based on provenance data information. ParslRNA-Seq is available at https://github.com/lucruzz/rna-seq.
Supported by organization CNPq.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
References
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(R106) (2010). https://doi.org/10.1186/gb-2010-11-10-r106
da Silva, R.F., Filgueira, R., Pietri, I., et al.: A characterization of workflow management systems for extreme-scale applications. Future Gener. Comput. Syst. 75, 228–238 (2017). https://doi.org/10.1016/j.future.2017.02.026
Mattoso, M., Werner, C., Travassos, G., et al.: Towards supporting the life cycle of large-scale scientific experiments. Int. J. Bus. Process. Integr. Manag. 5, 79–92 (2010). https://doi.org/10.1504/IJBPIM.2010.033176
Cruz, L., Coelho, M., Gadelha, L., et al.: Avaliação de Desempenho de um Workflow Científico para Experimentos de RNA-Seq no Supercomputador Santos Dumont. In: Anais Estendidos do XXI Simpósio em Sistemas Computacionais de Alto Desempenho, SBC 2020, pp. 86–93 (2020). https://doi.org/10.5753/wscad_estendido.2020.14093
Liao, Y., Smyth, G., Shi, W.: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30(7), 923–930 (2014). https://doi.org/10.1093/bioinformatics/btt656
Anders, S., Pyl, P.T., Huber, W.: HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2014). https://doi.org/10.1093/bioinformatics/btu638
Iyer, L., Nagarajan, S., Woelfer, M., et al.: A context-specific cardiac \(\beta \)-catenin and GATA4 interaction influences TCF7L2 occupancy and remodels chromatin driving disease progression in the adult heart. Nucleic Acids Res. 46(6), 2850–2867 (2018). https://doi.org/10.1093/nar/gky049
Babuji, Y., Woodard, A., Li, Z., et al.: Parsl: pervasive parallel programming in Python. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing 2019, pp. 25–36 (2019). https://doi.org/10.48550/arXiv.1905.02158
Cruz, L., Coelho, M., Galheigo, M., et al.: Parallel performance and I/O profiling of HPC RNA-Seq applications. Computación y Sistemas (2022, Submitted)
Bez, J.L., Carneiro, A.R., Pavan, P., et al.: I/O performance of the Santos Dumont supercomputer. Int. J. High Perform. Comput. Appl. 34(2), 227–245 (2020). https://doi.org/10.1177/1094342019868526
Mondelli, M.L., Magalhães, T., Loss, G., et al.: BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 6, e5551 (2018). https://doi.org/10.7717/peerj.5551
Wilde, M., Hategan, M., Wozniak, J.M., et al.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011). https://doi.org/10.1016/j.parco.2011.05.005
Goble, C., Soiland-Reyes, S., Bacall, F., et al.: Implementing FAIR digital objects in the EOSC-life workflow collaboratory. Zenodo 2(5), 99–110 (2021). https://doi.org/10.5281/zenodo.4605654
Wratten, L., Wilm, A., Göke, J.: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9
Acknowledgement
To the National Laboratory of Scientific Computing (Brazil) for providing the resources for the Santos Dumont supercomputer. To HPCProSol project (Next-generation HPC PROblems and SOLutions), represented by a joint team (équipe associée) between Inria, in France, and the National Laboratory for Scientific Computing (LNCC), in Brazil.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ocaña, K. et al. (2022). ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression. In: Navaux, P., Barrios H., C.J., Osthoff, C., Guerrero, G. (eds) High Performance Computing. CARLA 2022. Communications in Computer and Information Science, vol 1660. Springer, Cham. https://doi.org/10.1007/978-3-031-23821-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-23821-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23820-8
Online ISBN: 978-3-031-23821-5
eBook Packages: Computer ScienceComputer Science (R0)