Skip to main content

ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression

  • Conference paper
  • First Online:
High Performance Computing (CARLA 2022)

Abstract

RNA sequencing has become an increasingly affordable way to profile gene expression analyses. Here we introduce a scientific workflow implementing several open-source software executed by Parsl parallel scripting language in an high-performance computing environment. We have applied the workflow to a single-cardiomyocyte RNA-seq data retrieved from Gene Expression Omnibus database. The workflow allows for the analysis (alignment, QC, sort and count reads, statistics generation) of raw RNA-seq data and seamless integration of differential expression results into a configurable script code. In this work, we aim to investigate an analytical comparison of executing the workflow in Solid State Disk and Lustre as a critical decision for improving the execution efficiency and resilience in current and upcoming RNA-Seq workflows. Based on the resulting profiling of CPU and I/O data collection, we demonstrate that we can correctly identify anomalies in transcriptomics workflow performance which is an essential resource to optimize its use of high-performance computing systems. ParslRNA-Seq showed improvements in the total execution time of up to 70% against its previous sequential implementation. Finally, the article discusses which workflow modeling modifications lead to improved computational performance and scalability based on provenance data information. ParslRNA-Seq is available at https://github.com/lucruzz/rna-seq.

Supported by organization CNPq.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://galaxyproject.org/.

  2. 2.

    https://sdumont.lncc.br/.

  3. 3.

    https://github.com/lucruzz/RNA-seq/blob/master/RNA-seq.py.

  4. 4.

    https://sfb1002.med.uni-goettingen.de/production/literature/publications/201.

  5. 5.

    https://www.ncbi.nlm.nih.gov/geo/.

  6. 6.

    http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

  7. 7.

    http://www.htslib.org/doc/samtools.html.

  8. 8.

    http://broadinstitute.github.io/picard/.

  9. 9.

    https://htseq.readthedocs.io/.

  10. 10.

    https://bioconductor.org/packages/DESeq2/.

  11. 11.

    http://intel.ly/vtune-amplifier-xe.

  12. 12.

    https://www.mcs.anl.gov/research/projects/darshan/.

  13. 13.

    https://bioinfo.lncc.br/.

References

  1. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(R106) (2010). https://doi.org/10.1186/gb-2010-11-10-r106

  2. da Silva, R.F., Filgueira, R., Pietri, I., et al.: A characterization of workflow management systems for extreme-scale applications. Future Gener. Comput. Syst. 75, 228–238 (2017). https://doi.org/10.1016/j.future.2017.02.026

    Article  Google Scholar 

  3. Mattoso, M., Werner, C., Travassos, G., et al.: Towards supporting the life cycle of large-scale scientific experiments. Int. J. Bus. Process. Integr. Manag. 5, 79–92 (2010). https://doi.org/10.1504/IJBPIM.2010.033176

    Article  Google Scholar 

  4. Cruz, L., Coelho, M., Gadelha, L., et al.: Avaliação de Desempenho de um Workflow Científico para Experimentos de RNA-Seq no Supercomputador Santos Dumont. In: Anais Estendidos do XXI Simpósio em Sistemas Computacionais de Alto Desempenho, SBC 2020, pp. 86–93 (2020). https://doi.org/10.5753/wscad_estendido.2020.14093

  5. Liao, Y., Smyth, G., Shi, W.: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30(7), 923–930 (2014). https://doi.org/10.1093/bioinformatics/btt656

    Article  Google Scholar 

  6. Anders, S., Pyl, P.T., Huber, W.: HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2014). https://doi.org/10.1093/bioinformatics/btu638

    Article  Google Scholar 

  7. Iyer, L., Nagarajan, S., Woelfer, M., et al.: A context-specific cardiac \(\beta \)-catenin and GATA4 interaction influences TCF7L2 occupancy and remodels chromatin driving disease progression in the adult heart. Nucleic Acids Res. 46(6), 2850–2867 (2018). https://doi.org/10.1093/nar/gky049

    Article  Google Scholar 

  8. Babuji, Y., Woodard, A., Li, Z., et al.: Parsl: pervasive parallel programming in Python. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing 2019, pp. 25–36 (2019). https://doi.org/10.48550/arXiv.1905.02158

  9. Cruz, L., Coelho, M., Galheigo, M., et al.: Parallel performance and I/O profiling of HPC RNA-Seq applications. Computación y Sistemas (2022, Submitted)

    Google Scholar 

  10. Bez, J.L., Carneiro, A.R., Pavan, P., et al.: I/O performance of the Santos Dumont supercomputer. Int. J. High Perform. Comput. Appl. 34(2), 227–245 (2020). https://doi.org/10.1177/1094342019868526

    Article  Google Scholar 

  11. Mondelli, M.L., Magalhães, T., Loss, G., et al.: BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 6, e5551 (2018). https://doi.org/10.7717/peerj.5551

    Article  Google Scholar 

  12. Wilde, M., Hategan, M., Wozniak, J.M., et al.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011). https://doi.org/10.1016/j.parco.2011.05.005

    Article  Google Scholar 

  13. Goble, C., Soiland-Reyes, S., Bacall, F., et al.: Implementing FAIR digital objects in the EOSC-life workflow collaboratory. Zenodo 2(5), 99–110 (2021). https://doi.org/10.5281/zenodo.4605654

    Article  Google Scholar 

  14. Wratten, L., Wilm, A., Göke, J.: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9

    Article  Google Scholar 

Download references

Acknowledgement

To the National Laboratory of Scientific Computing (Brazil) for providing the resources for the Santos Dumont supercomputer. To HPCProSol project (Next-generation HPC PROblems and SOLutions), represented by a joint team (équipe associée) between Inria, in France, and the National Laboratory for Scientific Computing (LNCC), in Brazil.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kary Ocaña .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ocaña, K. et al. (2022). ParslRNA-Seq: An Efficient and Scalable RNAseq Analysis Workflow for Studies of Differentiated Gene Expression. In: Navaux, P., Barrios H., C.J., Osthoff, C., Guerrero, G. (eds) High Performance Computing. CARLA 2022. Communications in Computer and Information Science, vol 1660. Springer, Cham. https://doi.org/10.1007/978-3-031-23821-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23821-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23820-8

  • Online ISBN: 978-3-031-23821-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics