Skip to main content

Using Spark and GraphX to Parallelize Large-Scale Simulations of Bacterial Populations over Host Contact Networks

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2017)

Abstract

Large-scale population genetics studies are fundamental for phylogenetic and epidemiology analysis of pathogens. And the validation of both evolutionary models and methods used in such studies depend on large data analysis. It is, however, unrealistic to work with large datasets as only rather small samples of the real pathogen population are available. On the other hand, given model complexity and required population sizes, large-scale simulations are the only way to address this issue. In this paper we study how to efficiently parallelize such extensive simulations on top of Apache Spark, making use of both the MapReduce programming model and the GraphX API. We propose a simulation framework for large bacterial populations, over host contact networks, implementing the Wright-Fisher model. The experimental evaluation shows that we can effectively speedup simulations. We also evaluate inherent parallelism limits, drawing conclusions on the relation between cluster computing power and simulations speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cloud.google.com/.

References

  1. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, AFIPS 1967 (Spring), pp. 483–485. ACM, 18–20, April 1967

    Google Scholar 

  2. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chung, F., Lu, L., Dewey, T.G., Galas, D.J.: Duplication models for biological networks. J. Comput. Biol. 10(5), 677–687 (2003)

    Article  Google Scholar 

  4. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  5. Fraser, C., Hanage, W., Spratt, B.: Neutral microepidemic evolution of bacterial pathogens. PNAS 102(6), 1968–1973 (2005)

    Article  Google Scholar 

  6. Fraser, C., Alm, E.J., Polz, M.F., Spratt, B.G., Hanage, W.P.: The bacterial species challenge: making sense of genetic and ecological diversity. Science 323(5915), 741–746 (2009)

    Article  Google Scholar 

  7. Fraser, C., Hanage, W.P., Spratt, B.G.: Neutral microepidemic evolution of bacterial pathogens. Proc. Natl. Acad. Sci. U.S.A. 102(6), 1968–1973 (2005)

    Article  Google Scholar 

  8. Fraser, C., Hanage, W.P., Spratt, B.G.: Recombination and the nature of bacterial speciation. Science 315(5811), 476–480 (2007)

    Article  Google Scholar 

  9. Hanage, W.P., Spratt, B.G., Turner, K.M., Fraser, C.: Modelling bacterial speciation. Philos. Trans. Roy. Soc. Lond. B: Biol. Sci. 361(1475), 2039–2044 (2006)

    Article  Google Scholar 

  10. Kimura, M.: Evolutionary rate at the molecular level. Nature 217, 624–626 (1968)

    Article  Google Scholar 

  11. Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers (2010)

    Google Scholar 

  12. Maiden, M., Bygraves, J., Feil, E., Morelli, G., Russell, J., Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D., et al.: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. PNAS 95(6), 3140–3145 (1998)

    Article  Google Scholar 

  13. Ochman, H., Lawrence, J.G., Groisman, E.A.: Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304 (2000)

    Article  Google Scholar 

  14. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999–66, Stanford InfoLab (1999)

    Google Scholar 

  15. Robinson, D.A., Falush, D., Feil, E.J.: Bacterial Population Genetics in Infectious Disease. John Wiley & Sons, Hoboken (2010)

    Book  Google Scholar 

  16. Spratt, B.G., Hanage, W.P., Feil, E.J.: The relative contributions of recombination and point mutation to the diversification of bacterial clones. Curr. Opin. Microbiol. 4(5), 602–606 (2001)

    Article  Google Scholar 

  17. Tran, T.D., Hofrichter, J., Jost, J.: An introduction to the mathematical structure of the Wright-Fisher model of population genetics. Theory Biosci. 132(2), 73–82 (2013)

    Article  MATH  Google Scholar 

  18. Verma, S., Leslie, L.M., Shin, Y., Gupta, I.: An experimental comparison of partitioning strategies in distributed graph processing. Proc. VLDB Endow. 10(5), 493–504 (2017)

    Article  Google Scholar 

  19. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pp. 2:1–2:6. ACM (2013)

    Google Scholar 

  20. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, p. 2. USENIX Association (2012)

    Google Scholar 

  21. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association (2010)

    Google Scholar 

  22. Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)

    Google Scholar 

Download references

Acknowledgments

This work was partly supported by DEI, IST, Universidade de Lisboa, and national funds through FCT – Fundação para a Ciência e Tecnologia, under projects TUBITACK/0004/2014, LISBOA-01-0145-FEDER-016394, PTDC/EEISII/5081/2014, PTDC/MAT/STA/3358/2014, and UID/CE C/500021/2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreia Sofia Teixeira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Teixeira, A.S., Monteiro, P.T., Carriço, J.A., Santos, F.C., Francisco, A.P. (2017). Using Spark and GraphX to Parallelize Large-Scale Simulations of Bacterial Populations over Host Contact Networks. In: Ibrahim, S., Choo, KK., Yan, Z., Pedrycz, W. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer Science(), vol 10393. Springer, Cham. https://doi.org/10.1007/978-3-319-65482-9_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65482-9_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65481-2

  • Online ISBN: 978-3-319-65482-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics