ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression

Ocaña, Kary; Cruz, Lucas; Coelho, Micaella; Terra, Rafael; Galheigo, Marcelo; Carneiro, Andre; Carvalho, Diego; Gadelha, Luiz; Boito, Francieli; Navaux, Philippe; Osthoff, Carla

doi:10.1007/978-3-031-23821-5_13

Kary Ocaña⁹,
Lucas Cruz^9,10,
Micaella Coelho⁹,
Rafael Terra⁹,
Marcelo Galheigo⁹,
Andre Carneiro⁹,
Diego Carvalho¹⁰,
Luiz Gadelha⁹,
Francieli Boito¹¹,
Philippe Navaux¹² &
…
Carla Osthoff⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1660))

Included in the following conference series:

Latin American High Performance Computing Conference

350 Accesses
9 Altmetric

Abstract

RNA sequencing has become an increasingly affordable way to profile gene expression analyses. Here we introduce a scientific workflow implementing several open-source software executed by Parsl parallel scripting language in an high-performance computing environment. We have applied the workflow to a single-cardiomyocyte RNA-seq data retrieved from Gene Expression Omnibus database. The workflow allows for the analysis (alignment, QC, sort and count reads, statistics generation) of raw RNA-seq data and seamless integration of differential expression results into a configurable script code. In this work, we aim to investigate an analytical comparison of executing the workflow in Solid State Disk and Lustre as a critical decision for improving the execution efficiency and resilience in current and upcoming RNA-Seq workflows. Based on the resulting profiling of CPU and I/O data collection, we demonstrate that we can correctly identify anomalies in transcriptomics workflow performance which is an essential resource to optimize its use of high-performance computing systems. ParslRNA-Seq showed improvements in the total execution time of up to 70% against its previous sequential implementation. Finally, the article discusses which workflow modeling modifications lead to improved computational performance and scalability based on provenance data information. ParslRNA-Seq is available at https://github.com/lucruzz/rna-seq.

Supported by organization CNPq.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(R106) (2010). https://doi.org/10.1186/gb-2010-11-10-r106
da Silva, R.F., Filgueira, R., Pietri, I., et al.: A characterization of workflow management systems for extreme-scale applications. Future Gener. Comput. Syst. 75, 228–238 (2017). https://doi.org/10.1016/j.future.2017.02.026
Article Google Scholar
Mattoso, M., Werner, C., Travassos, G., et al.: Towards supporting the life cycle of large-scale scientific experiments. Int. J. Bus. Process. Integr. Manag. 5, 79–92 (2010). https://doi.org/10.1504/IJBPIM.2010.033176
Article Google Scholar
Cruz, L., Coelho, M., Gadelha, L., et al.: Avaliação de Desempenho de um Workflow Científico para Experimentos de RNA-Seq no Supercomputador Santos Dumont. In: Anais Estendidos do XXI Simpósio em Sistemas Computacionais de Alto Desempenho, SBC 2020, pp. 86–93 (2020). https://doi.org/10.5753/wscad_estendido.2020.14093
Liao, Y., Smyth, G., Shi, W.: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30(7), 923–930 (2014). https://doi.org/10.1093/bioinformatics/btt656
Article Google Scholar
Anders, S., Pyl, P.T., Huber, W.: HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2014). https://doi.org/10.1093/bioinformatics/btu638
Article Google Scholar
Iyer, L., Nagarajan, S., Woelfer, M., et al.: A context-specific cardiac \(\beta \)-catenin and GATA4 interaction influences TCF7L2 occupancy and remodels chromatin driving disease progression in the adult heart. Nucleic Acids Res. 46(6), 2850–2867 (2018). https://doi.org/10.1093/nar/gky049
Article Google Scholar
Babuji, Y., Woodard, A., Li, Z., et al.: Parsl: pervasive parallel programming in Python. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing 2019, pp. 25–36 (2019). https://doi.org/10.48550/arXiv.1905.02158
Cruz, L., Coelho, M., Galheigo, M., et al.: Parallel performance and I/O profiling of HPC RNA-Seq applications. Computación y Sistemas (2022, Submitted)
Google Scholar
Bez, J.L., Carneiro, A.R., Pavan, P., et al.: I/O performance of the Santos Dumont supercomputer. Int. J. High Perform. Comput. Appl. 34(2), 227–245 (2020). https://doi.org/10.1177/1094342019868526
Article Google Scholar
Mondelli, M.L., Magalhães, T., Loss, G., et al.: BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 6, e5551 (2018). https://doi.org/10.7717/peerj.5551
Article Google Scholar
Wilde, M., Hategan, M., Wozniak, J.M., et al.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011). https://doi.org/10.1016/j.parco.2011.05.005
Article Google Scholar
Goble, C., Soiland-Reyes, S., Bacall, F., et al.: Implementing FAIR digital objects in the EOSC-life workflow collaboratory. Zenodo 2(5), 99–110 (2021). https://doi.org/10.5281/zenodo.4605654
Article Google Scholar
Wratten, L., Wilm, A., Göke, J.: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9
Article Google Scholar

Download references

Acknowledgement

To the National Laboratory of Scientific Computing (Brazil) for providing the resources for the Santos Dumont supercomputer. To HPCProSol project (Next-generation HPC PROblems and SOLutions), represented by a joint team (équipe associée) between Inria, in France, and the National Laboratory for Scientific Computing (LNCC), in Brazil.

Author information

Authors and Affiliations

National Laboratory of Scientific Computing, LNCC, Rio de Janeiro, Brazil
Kary Ocaña, Lucas Cruz, Micaella Coelho, Rafael Terra, Marcelo Galheigo, Andre Carneiro, Luiz Gadelha & Carla Osthoff
Federal Center for Technological Education Celso Suckow da Fonseca, CEFET-RJ, Rio de Janeiro, Brazil
Lucas Cruz & Diego Carvalho
Univ. Bordeaux, CNRS, Bordeaux INP, INRIA, LaBRI, Talence, France
Francieli Boito
Informatics Institute, Federal University of Rio Grande do Sul, UFRGS, Porto Alegre, Brazil
Philippe Navaux

Authors

Kary Ocaña
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Micaella Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Terra
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Galheigo
View author publications
You can also search for this author in PubMed Google Scholar
Andre Carneiro
View author publications
You can also search for this author in PubMed Google Scholar
Diego Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Luiz Gadelha
View author publications
You can also search for this author in PubMed Google Scholar
Francieli Boito
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Navaux
View author publications
You can also search for this author in PubMed Google Scholar
Carla Osthoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kary Ocaña .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Philippe Navaux
Universidad Industrial de Santander, Bucaramanga, Colombia
Carlos J. Barrios H.
Laboratório Nacional de Computação Científica, Petrópolis, Brazil
Carla Osthoff
Laboratorio Nacional de Computación de Alto Rendimiento, Santiago, Chile
Ginés Guerrero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ocaña, K. et al. (2022). ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression. In: Navaux, P., Barrios H., C.J., Osthoff, C., Guerrero, G. (eds) High Performance Computing. CARLA 2022. Communications in Computer and Information Science, vol 1660. Springer, Cham. https://doi.org/10.1007/978-3-031-23821-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-23821-5_13
Published: 21 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23820-8
Online ISBN: 978-3-031-23821-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression