Abstract
The advances in massive parallel sequencing technologies (i.e., Next-Generation Sequencing) allowed RNA sequencing (RNA-seq). The analysis of RNA-seq data uses a large amount of computational resources, and it is very time-consuming. Usually, the processing is performed on a large set of samples, and it is convenient designing an automatic pipeline to eliminate the downtime. The pipelines represent an advantage, however these are difficult to customize, or to use outside the specific context for which they have been tested.
In this paper, we propose FAPE (Flexible Automated Pipeline Engine), a software platform to configure and to deploy automated pipelines. It models a pipeline based on a given template. The latter has a highly understandable and manipulable organization, to meet the operator’s need for customization. In addition, a scientist may model an in-house custom pipeline able to execute all tools based on a command line interface (CLI). FAPE supports both parallel and iterative processes, in order to analyze whole datasets. We tested our solution on a pipeline for Transcript-level Quantification from RNA-seq, based on Hisat2, SamTools, and StringTie. It exhibited high robustness as well as inherent flexibility in supporting any pipeline modeled to specification. Furthermore, it has proven not to be expensive in terms of memory, and it does not introduce a significant latency during the execution, as compared to a pipeline executed through a shell-script program. In addition, the statement parallel of FAPE allowed during the test a reduction of the total elapsed time of \(\sim 6.5\%\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yang, I.S., Kim, S.: Analysis of whole transcriptome sequencing data: workflow and software. Genomics Inform. 13(4), 119–125 (2015)
Li, J., Liu, C.: Coding or noncoding, the converging concepts of RNAs. Front. Genet. 10, 496 (2019)
Thomas, Q.A., et al.: Transcript isoform sequencing reveals widespread promoter-proximal transcriptional termination in Arabidopsis. Nat. Commun. 11(1), 2589 (2020)
Nielsen, M., et al.: Transcription-driven chromatin repression of Intragenic transcription start sites. PLoS Genet. 15(2), e1007969 (2019)
Cinaglia, P., Guzzi, P.H., Veltri, P.: Integro: an algorithm for data-integration and disease-gene association. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2076–2081 (2018)
Denoeud, F., et al.: Annotating genomes with massive-scale RNA sequencing. Genome Biol. 9(12), R175 (2008)
Creason, A., et al.: A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery. Cell Syst. 12(8), 827–838 (2021)
Haas, B.J., et al.: De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8(8), 1494–1512 (2013)
Yang, X., et al.: HTQC: a fast quality control toolkit for Illumina sequencing data. BMC Bioinform. 14, 33 (2013)
Conesa, A., et al.: A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016)
Kim, D., Paggi, J.M., Park, C., Bennett, C., Salzberg, S.L.: Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37(8), 907–915 (2019)
Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., Salzberg, S.L.: StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33(3), 290–295 (2015)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L.: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14(4), R36 (2013)
Trapnell, C., et al.: Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28(5), 511–515 (2010)
Pertea, M., Kim, D., Pertea, G.M., Leek, J.T., Salzberg, S.L.: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11(9), 1650–1667 (2016)
Trapnell, C., et al.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks. Nat. Protoc. 7(3), 562–578 (2012)
Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25(9), 1105–1111 (2009)
Spinozzi, G., Tini, V., Adorni, A., Falini, B., Martelli, M.P.: ARPIR: automatic RNA-seq pipelines with interactive report. BMC Bioinform. 21(Suppl 19), 574 (2020)
Srivastava, H., Ferrell, D., Popescu, G.V.: NetSeekR: a network analysis pipeline for RNA-seq time series data. BMC Bioinform. 23(1), 54 (2022)
Wratten, L., Wilm, A., Göke, J.: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18(10), 1161–1168 (2021)
Danecek, P., et al.: Twelve years of SAMtools and BCFtools. GigaScience 10(2), giab008 (2021)
Cinaglia, P., Cannataro, M.: Forecasting COVID-19 epidemic trends by combining a neural network with rt estimation. Entropy (Basel) 24(7), 929 (2022)
Cinaglia, P., Tradigo, G., Cascini, G.L., Zumpano, E., Veltri, P.: A framework for the decomposition and features extraction from lung dicom images. In: Proceedings of the 22nd International Database Engineering & Applications Symposium, pp. 31–36. IDEAS 2018, Association for Computing Machinery (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cinaglia, P., Cannataro, M. (2022). A Flexible Automated Pipeline Engine for Transcript-Level Quantification from RNA-seq. In: Guizzardi, R., Neumayr, B. (eds) Advances in Conceptual Modeling. ER 2022. Lecture Notes in Computer Science, vol 13650. Springer, Cham. https://doi.org/10.1007/978-3-031-22036-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-22036-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22035-7
Online ISBN: 978-3-031-22036-4
eBook Packages: Computer ScienceComputer Science (R0)