Skip to main content
Log in

A Dynamic Cloud Dimensioning Approach for Parallel Scientific Workflows: a Case Study in the Comparative Genomics Domain

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Usually, scientists need to execute experiments that demand high performance computing environments and parallel techniques. This is the scenario found in many bioinformatics experiments modeled as scientific workflows, such as phylogenetic and phylogenomic analyses. To execute these experiments, scientists have adopted virtual machines (VMs) instantiated in clouds. Estimating the number of VMs to instantiate is a crucial task to avoid negative impacts on the execution performance and on the financial costs with under or overestimations. Previously, the necessary number of VMs to execute bioinformatics workflows have been estimated by a GRASP heuristic and have been coupled to a Cloud-based Parallel Scientific Workflow Management System. Although this work was a step forward, this approach only provided a static dimensioning. If the characteristics of the environment change (processing capacity, network speed), this static dimensioning may not be suitable. In this way, it is of interest that the dimensioning is adjusted at runtime. To achieve this, we developed a novel framework for monitoring and dynamically dimensioning resources during the execution of parallel scientific workflows in clouds, called Dynamic Dimensioning of Cloud Computing Framework (DDC-F). We have evaluated DDC-F in real executions of bioinformatics workflows. Experiments showed that DDC-F is able to efficiently calculate the number of VMs necessary to execute bioinformatics workflows of Comparative Genomics (CG), also reducing the financial costs, when compared with other works of the related literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Clustal. http://clustal.org/clustal2

  2. codeml(PAML). http://abacus.gene.ucl.ac.uk/software/paml.html

  3. FASTA. www.ncbi.nlm.nih.gov/blast/fasta.shtml

  4. hmmbuild/hmmsearch (HMMER3). http://hmmer.org/

  5. Kalign. http://msa.sbc.su.se/cgi-bin/msa.cgi

  6. MAFFT. http://mafft.cbrc.jp/alignment/software

  7. ModelGenerator. http://mcinerneylab.com/software/modelgenerator

  8. Muscle. http://www.drive5.com/muscle

  9. ProbCons. http://probcons.stanford.edu/

  10. RAxML. http://sco.h-its.org/exelixis/web/software/raxml/index.html

  11. ReadSeq. https://sourceforge.net/projects/readseq/

  12. RefSeq database. http://www.ncbi.nlm.nih.gov/refseq/

  13. Abouelhoda, M., Issa, S., Ghanem, M.: Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support. BMC Bioinforma. 13(1), 77+ (2012)

    Article  Google Scholar 

  14. Chard, R., Chard, K., Bubendorfer, K., Lacinski, L., Madduri, R., Foster, I.: Cost-Aware Elastic Cloud Provisioning for Scientific Workloads. In: 2015 IEEE 8Th International Conference On Cloud Computing (CLOUD), pp 971–974 (2015)

  15. Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurr. Comput. Pract. Exper. 18(10), 1021–1037 (2006)

    Article  Google Scholar 

  16. Coutinho, R., Drummond, L., Frota, Y., De Oliveira, D.: Optimizing virtual machine allocation for parallel scientific workflows in federated clouds. Fut. Gener. Comput. Syst. 46(0), 51 –68 (2015)

    Article  Google Scholar 

  17. Coutinho, R., Drummond, L., Frota, Y., De Oliveira, D., Ocaña, K.: Evaluating Grasp-Based Cloud Dimensioning for Comparative Genomics: a Practical Approach. In: IEEE International Conference on Cluster Computing (CLUSTER), pp 371–379 (2014)

  18. Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based Data-intensive Workflows. In: Proceedings of the 6Th Workshop on Workflows in Support of Large-Scale Science, WORKS ’11, pp 21–30. ACM, NY, USA (2011)

  19. Deng, K., Song, J., Ren, K., Iosup, A.: Exploring Portfolio Scheduling forLong-term Execution of Scientific Workloads in IaaS Clouds. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’13, pp 55:1–55:12. ACM, NY, USA (2013)

  20. Eddy, S.: A new generation of homology search tools based on probabilistic inference. Genome Informatics. Int. Conf. Genome Inf. 23(5), 205–11 (2009)

    Google Scholar 

  21. Emeakaroha, V., Maurer, M., Stern, P., Abaj, P., Brandic, I., Kreil, D.: Managing and optimizing bioinformatics workflows for data analysis in clouds. J. Grid Comput. 11(3), 407–428 (2013)

    Article  Google Scholar 

  22. Felsenstein, J.: PHYLIP - Phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)

    Google Scholar 

  23. Foster, I., Kesselman, C.: The Grid 2, Second Edition: Blueprint for a New Computing Infrastructure (The Elsevier Series in Grid Computing), 2nd edn. Morgan Kaufmann (2003)

  24. Gilbert, D.: Sequence file format conversion with commandline readseq. Current Protocols in Bioinformatics Appendix 1, Appendix 1E (2003)

  25. Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking Supernovae in the Clouds: a Performance Study. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pp 421–429. ACM, NY, USA (2010)

  26. Lama, P., Zhou, X.: AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud. In: Proceedings of the 9th International Conference on Autonomic Computing, ICAC ’12, pp 63–72. ACM, NY, USA (2012)

  27. Madera, M., Gough, J.: A comparison of profile hidden markov model procedures for remote homology detection. Nucleic Acids Res. 30(19), 4321–4328 (2002)

    Article  Google Scholar 

  28. Maheshwari, K., Jung, E.S., Meng, J., Morozov, V., Vishwanath, V., Kettimuthu, R.: Workflow performance improvement using model-based scheduling over multiple clusters and clouds. Fut. Gener. Comput. Syst. 54, 206–218 (2016)

    Article  Google Scholar 

  29. Malawski, M., Juve, G., Deelman, E., Nabrzyski, J.: Cost- and Deadline-constrained Provisioning for Scientific Workflow Ensembles in IaaS Clouds. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp 22:1–22:11. IEEE Computer Society Press, CA, USA (2012)

  30. Massi, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  31. Nguyen, P., Halem, M.: A MapReduce Workflow System for Architecting Scientific Data Intensive Applications. In: Proceedings of the 2Nd International Workshop on Software Engineering for Cloud Computing, SECLOUD ’11, pp 57–63. ACM, NY, USA (2011)

  32. Ocaña, K.A., De Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Designing a parallel cloud based comparative genomics workflow to improve phylogenetic analyses. Future Generation Computer Systems 29(8), 2205 –2219 (2013)

    Article  Google Scholar 

  33. Ocaña, K., de Oliveira, D., Ogasawara, E.S., Dv̈ila, A.M.R., Lima, A.A.B., Mattoso, M.: Sciphy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes. In: De Souza, O.N., Telles, G.P., Palakal, M.J. (eds.) BSB, Lecture Notes in Computer Science, vol. 6832, pp 66–70. Springer (2011)

  34. Ocaña, K.A., de Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Optimizing Phylogenetic Analysis Using Scihmm Cloud-based Scientific Workflow. IEEE 9th Int. Conf. e-Sci. 0, 62–69 (2011)

    Google Scholar 

  35. Ocaña, K.A., De Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow. IEEE 8th Int. Conf. E-Sci. 0, 1–8 (2012)

    Google Scholar 

  36. Ocaña, K.A., de Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M.: Exploring Molecular Evolution Reconstruction Using a Parallel Cloud Based Scientific Workflow. In: Advances in Bioinformatics and Computational Biology, Lecture Notes in Computer Science, Vol. 7409, pp 179–191. Springer, Berlin Heidelberg (2012)

  37. De Oliveira, D., Ocaña, K.A., Ogasawara, E., Dias, J., Gonlves, J., Baio, F., Mattoso, M.: Performance evaluation of parallel strategies in public clouds: a study with phylogenomic workflows. Fut. Gener. Comput. Syst. 29(7), 1816 –1825 (2013)

    Article  Google Scholar 

  38. De Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: a Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: 3Rd International Conference on Cloud Computing, pp 378–385 (2010)

  39. De Oliveira, D., Viana, V., Ogasawara, E., Ocaña, K., Mattoso, M.: Dimensioning the Virtual Cluster for Parallel Scientific Workflows in Clouds. In: Proceedings of the 4Th ACM Workshop on Scientific Cloud Computing, Science Cloud ’13, pp 5–12. ACM, NY, USA (2013)

  40. Prodan, R., Wieczorek, M., Fard, H.: Double auction-based scheduling of scientific applications in distributed grid and cloud environments. J. Grid Comput. 9(4), 531–548 (2011)

    Article  Google Scholar 

  41. Ragothaman, A., Boddu, S.C., Kim, N., Feinstein, W., Brylinski, M., Jha, S., Kim, J.: Developing eThread Pipeline Using SAGA-pilot Abstraction for Large-Scale Structural Bioinformatics. BioMed Res. Int. 2014, 1–12 (2014)

    Article  Google Scholar 

  42. Rodero, I., Viswanathan, H., Lee, E.K., Gamell, M., Pompili, D., Parashar, M.: Energy-efficient thermal-aware autonomic management of virtualized hpc cloud infrastructure. J. Grid Comput. 10(3), 447–473 (2012)

    Article  Google Scholar 

  43. Sadooghi, I., Hernandez Martin, J., Li, T., Brandstatter, K., Zhao, Y., Maheshwari, K., Pais Pitta de Lacerda Ruivo, T., Timm, S., Garzoglio, G., Raicu, I.: Understanding the performance and potential of cloud computing for scientific applications. IEEE Trans. Cloud Comput. PP (99), 1–1 (2015)

    Article  Google Scholar 

  44. Shen, Z., Subbiah, S., Gu, X., Wilkes, J.: Cloudscale: Elastic Resource Scaling for Multi-tenant Cloud Systems. In: Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC ’11, pp 5:1–5:14. ACM, NY, USA (2011)

  45. Sun, X., Fan, L., Yan, L., Kong, L., Ding, Y., Guo, C., Sun, W.: Deliver Bioinformatics Services in Public Cloud: Challenges and Research Framework. In: Proceedings of the 2011 IEEE 8Th International Conference on E-Business Engineering, ICEBE ’11, pp 352–357. IEEE Computer Society, DC, USA (2011)

  46. Szabo, C., Sheng, Q., Kroeger, T., Zhang, Y., Yu, J.: Science in the cloud: Allocation and execution of data-intensive scientific workflows. J. Grid Comput. 12(2), 245–264 (2014)

    Article  Google Scholar 

  47. Taylor, I.J., Deelman, E., Gannon, D.B.: Workflows for e-Science: Scientific Workflows for Grids. Springer (2007)

  48. Tian, W.: Adaptive Dimensioning of Cloud Data Centers. In: Proceedings of the 8Th International Conference on Dependable, Autonomic and Secure Computing, DASC ’09, pp 5–10. IEEE Computer Society, DC, USA (2009)

  49. Walker, E., Guiang, C.: Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments. In: Proceedings of the 5Th IEEE Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’07, pp 11–18. ACM, NY, USA (2007)

  50. Wang, J., Crawl, D., Altintas, I.: Kepler + Hadoop: A General Architecture Facilitating Data-intensive Applications in Scientific Workflow Systems. In: Proceedings of the 4Th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, pp 12:1–12:8. ACM, NY, USA (2009)

  51. Wozniak, J.M., Armstrong, T.G., Maheshwari, K., Lusk, E.L., Katz, D.S., Wilde, M., Foster, I.T.: Turbine: A distributed memory dataflow engine for high performance many-task applications. Fundam. Inf. J. 128(3), 337–366 (2013)

    Google Scholar 

  52. Xiao, Z., Song, W., Chen, Q.: Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 24(6), 1107–1117 (2013)

    Article  Google Scholar 

  53. Xu, L., Zeng, Z., Ye, X.: Multi-Objective Optimization Based Virtual Resource Allocation Strategy for Cloud Computing. In: Proceedings of the 11Th International Conference on Computer and Information Science, ICIS ’12, pp 56–61. IEEE Computer Society, DC, USA (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafaelli Coutinho.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coutinho, R., Frota, Y., Ocaña, K. et al. A Dynamic Cloud Dimensioning Approach for Parallel Scientific Workflows: a Case Study in the Comparative Genomics Domain. J Grid Computing 14, 443–461 (2016). https://doi.org/10.1007/s10723-016-9367-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-016-9367-x

Keywords

Navigation