Abstract
Recent studies of evolution at molecular level address two important issues: reconstruction of the evolutionary relationships between species and investigation of the forces of the evolutionary process. Both issues experienced an explosive growth in the last two decades due to massive generation of genomic data, novel statistical methods and computational approaches to process and analyze this large volume of data. Most experiments in molecular evolution are based on computing intensive simulations preceded by other computation tools and post-processed by computing validators. All these tools can be modeled as scientific workflows to improve the experiment management while capturing provenance data. However, these evolutionary analyses experiments are very complex and may execute for weeks. These workflows need to be executed in parallel in High Performance Computing (HPC) environments such as clouds. Clouds are becoming adopted for bioinformatics experiments due to its characteristics, such as, elasticity and availability. Clouds are evolving into HPC environments. In this paper, we introduce SciEvol, a bioinformatics scientific workflow for molecular evolution reconstruction that aims at inferring evolutionary relationships (i.e. to detect positive Darwinian selection) on genomic data. SciEvol is designed and implemented to execute in parallel over the clouds using SciCumulus workflow engine. Our experiments show that SciEvol can help scientists by enabling the reconstruction of evolutionary relationships using the cloud environment. Results present performance improvements of up to 94.64% in the execution time when compared to the sequential execution, which drops from around 10 days to 12 hours.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Miller, W., Makova, K.D., Nekrutenko, A., Hardison, R.C.: Comparative Genomics. Annu. Rev. Genom. Human Genet.Ā 5, 15ā56 (2004)
Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific Workflows for Grids. Springer (2007)
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for Computational Tasks: A Survey. Computing in Science and EngineeringĀ 10, 11ā21 (2008)
Egan, A., Mahurkar, A., Crabtree, J., Badger, J.H., Carlton, J.M., Silva, J.C.: IDEA: Interactive Display for Evolutionary Analyses. BMC BioinformaticsĀ 9, 524 (2008)
Busset, J., Cabau, C., Meslin, C., Pascal, G.: PhyleasProg: a user-oriented web server for wide evolutionary analyses. Nucleic Acids Research 39, W479āW485 (2011)
Katoh, K., Toh, H.: Recent developments in the MAFFT multiple sequence alignment program. Brief. BioinformaticsĀ 9, 286ā298 (2008)
Goldman, N., Yang, Z.: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol.Ā 11, 725ā736 (1994)
Hey, T., Tansley, S., Tolle, K.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research (2009)
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev.Ā 39, 50ā55 (2009)
Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking supernovae in the clouds: a performance study. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 421ā429. ACM, New York (2010)
Yang, Z.: PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol.Ā 24, 1586ā1591 (2007)
OcaƱa, K.A.C.S., de Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Optimizing Phylogenetic Analysis Using SciHmm Cloud-based Scientific Workflow. In: 2011 IEEE Seventh International Conference on e-Science (e-Science), pp. 190ā197. IEEE, Stockholm (2011)
OcaƱa, K.A.C.S., de Oliveira, D., Ogasawara, E., DĆ”vila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (eds.) BSB 2011. LNCS (LNBI), vol.Ā 6832, pp. 66ā70. Springer, Heidelberg (2011)
de Oliveira, D., Ogasawara, E., BaiĆ£o, F., Mattoso, M.: SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: 3rd International Conference on Cloud Computing, pp. 378ā385. IEEE Computer Society, Washington, DC (2010)
Anisimova, M., Bielawski, J.P., Yang, Z.: Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol.Ā 18, 1585ā1592 (2001)
Aguileta, G., RefrĆ©gier, G., Yockteng, R., Fournier, E., Giraud, T.: Rapidly evolving genes in pathogens: methods for detecting positive selection and examples among fungi, bacteria, viruses and protists. Infect. Genet. Evol.Ā 9, 656ā670 (2009)
King, C.-C., Chao, D.-Y., Chien, L.-J., Chang, G.-J.J., Lin, T.-H., Wu, Y.-C., Huang, J.-H.: Comparative analysis of full genomic sequences among different genotypes of dengue virus type 3. Virol. J.Ā 5, 63 (2008)
Nielsen, R., Yang, Z.: Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. GeneticsĀ 148, 929ā936 (1998)
Yang, Z.: Computational Molecular Evolution. Oxford University Press (2006)
Freedman, D., Pisani, R., Purves, R.: Statistics, 4th edn. W. W. Norton (2007)
Muse, S.V., Gaut, B.S.: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol.Ā 11, 715ā724 (1994)
Yang, Z., Swanson, W.J.: Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol. Biol. Evol.Ā 19, 49ā57 (2002)
Felsenstein, J.: PHYLIP - Phylogeny Inference Package (Version 3.2). CladisticsĀ 5, 164ā166 (1989)
Chen, S.L., Hung, C.-S., Xu, J., Reigstad, C.S., Magrini, V., Sabo, A., Blasiar, D., Bieri, T., Meyer, R.R., Ozersky, P., Armstrong, J.R., Fulton, R.S., Latreille, J.P., Spieth, J., Hooton, T.M., Mardis, E.R., Hultgren, S.J., Gordon, J.I.: Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc. Natl. Acad. Sci. U.S.A.Ā 103, 5977ā5982 (2006)
Ge, G., Cowen, L., Feng, X., Widmer, G.: Protein coding gene nucleotide substitution pattern in the apicomplexan protozoa Cryptosporidium parvum and Cryptosporidium hominis. Comp. Funct. Genomics 879023 (2008)
Montin, K., Cervellati, C., Dallocchio, F., Hanau, S.: Thermodynamic characterization of substrate and inhibitor binding to Trypanosoma brucei 6-phosphogluconate dehydrogenase. FEBS J.Ā 274, 6426ā6435 (2007)
Talavera, G., Castresana, J.: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol.Ā 56, 564ā577 (2007)
Vilella, A.J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., Birney, E.: EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res.Ā 19, 327ā335 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
OcaƱa, K.A.C.S., de Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M. (2012). Exploring Molecular Evolution Reconstruction Using a Parallel Cloud Based Scientific Workflow. In: de Souto, M.C., Kann, M.G. (eds) Advances in Bioinformatics and Computational Biology. BSB 2012. Lecture Notes in Computer Science(), vol 7409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31927-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-31927-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31926-6
Online ISBN: 978-3-642-31927-3
eBook Packages: Computer ScienceComputer Science (R0)