Using Spark and GraphX to Parallelize Large-Scale Simulations of Bacterial Populations over Host Contact Networks

Teixeira, Andreia Sofia; Monteiro, Pedro T.; Carriço, João A.; Santos, Francisco C.; Francisco, Alexandre P.

doi:10.1007/978-3-319-65482-9_44

Andreia Sofia Teixeira^17,18,
Pedro T. Monteiro^17,18,
João A. Carriço¹⁹,
Francisco C. Santos^17,18 &
…
Alexandre P. Francisco^17,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10393))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2466 Accesses
2 Citations
1 Altmetric

Abstract

Large-scale population genetics studies are fundamental for phylogenetic and epidemiology analysis of pathogens. And the validation of both evolutionary models and methods used in such studies depend on large data analysis. It is, however, unrealistic to work with large datasets as only rather small samples of the real pathogen population are available. On the other hand, given model complexity and required population sizes, large-scale simulations are the only way to address this issue. In this paper we study how to efficiently parallelize such extensive simulations on top of Apache Spark, making use of both the MapReduce programming model and the GraphX API. We propose a simulation framework for large bacterial populations, over host contact networks, implementing the Wright-Fisher model. The experimental evaluation shows that we can effectively speedup simulations. We also evaluate inherent parallelism limits, drawing conclusions on the relation between cluster computing power and simulations speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Heterogeneous computing for epidemiological model fitting and simulation

Article Open access 16 March 2018

Large-Scale Agent-Based Modeling with Repast HPC: A Case Study in Parallelizing an Agent-Based Model

An Exploratory Study on the Simulation of Stochastic Epidemic Models

Notes

1.
https://cloud.google.com/.

References

Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, AFIPS 1967 (Spring), pp. 483–485. ACM, 18–20, April 1967
Google Scholar
Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)
Article MathSciNet MATH Google Scholar
Chung, F., Lu, L., Dewey, T.G., Galas, D.J.: Duplication models for biological networks. J. Comput. Biol. 10(5), 677–687 (2003)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Fraser, C., Hanage, W., Spratt, B.: Neutral microepidemic evolution of bacterial pathogens. PNAS 102(6), 1968–1973 (2005)
Article Google Scholar
Fraser, C., Alm, E.J., Polz, M.F., Spratt, B.G., Hanage, W.P.: The bacterial species challenge: making sense of genetic and ecological diversity. Science 323(5915), 741–746 (2009)
Article Google Scholar
Fraser, C., Hanage, W.P., Spratt, B.G.: Neutral microepidemic evolution of bacterial pathogens. Proc. Natl. Acad. Sci. U.S.A. 102(6), 1968–1973 (2005)
Article Google Scholar
Fraser, C., Hanage, W.P., Spratt, B.G.: Recombination and the nature of bacterial speciation. Science 315(5811), 476–480 (2007)
Article Google Scholar
Hanage, W.P., Spratt, B.G., Turner, K.M., Fraser, C.: Modelling bacterial speciation. Philos. Trans. Roy. Soc. Lond. B: Biol. Sci. 361(1475), 2039–2044 (2006)
Article Google Scholar
Kimura, M.: Evolutionary rate at the molecular level. Nature 217, 624–626 (1968)
Article Google Scholar
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers (2010)
Google Scholar
Maiden, M., Bygraves, J., Feil, E., Morelli, G., Russell, J., Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D., et al.: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. PNAS 95(6), 3140–3145 (1998)
Article Google Scholar
Ochman, H., Lawrence, J.G., Groisman, E.A.: Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304 (2000)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999–66, Stanford InfoLab (1999)
Google Scholar
Robinson, D.A., Falush, D., Feil, E.J.: Bacterial Population Genetics in Infectious Disease. John Wiley & Sons, Hoboken (2010)
Book Google Scholar
Spratt, B.G., Hanage, W.P., Feil, E.J.: The relative contributions of recombination and point mutation to the diversification of bacterial clones. Curr. Opin. Microbiol. 4(5), 602–606 (2001)
Article Google Scholar
Tran, T.D., Hofrichter, J., Jost, J.: An introduction to the mathematical structure of the Wright-Fisher model of population genetics. Theory Biosci. 132(2), 73–82 (2013)
Article MATH Google Scholar
Verma, S., Leslie, L.M., Shin, Y., Gupta, I.: An experimental comparison of partitioning strategies in distributed graph processing. Proc. VLDB Endow. 10(5), 493–504 (2017)
Article Google Scholar
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pp. 2:1–2:6. ACM (2013)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, p. 2. USENIX Association (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association (2010)
Google Scholar
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)
Google Scholar

Download references

Acknowledgments

This work was partly supported by DEI, IST, Universidade de Lisboa, and national funds through FCT – Fundação para a Ciência e Tecnologia, under projects TUBITACK/0004/2014, LISBOA-01-0145-FEDER-016394, PTDC/EEISII/5081/2014, PTDC/MAT/STA/3358/2014, and UID/CE C/500021/2013.

Author information

Authors and Affiliations

INESC-ID Lisboa, Lisboa, Portugal
Andreia Sofia Teixeira, Pedro T. Monteiro, Francisco C. Santos & Alexandre P. Francisco
Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Andreia Sofia Teixeira, Pedro T. Monteiro, Francisco C. Santos & Alexandre P. Francisco
Faculdade de Medicina, Instituto de Microbiologia and Instituto de Medicina Molecular, Universidade de Lisboa, Lisboa, Portugal
João A. Carriço

Authors

Andreia Sofia Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Pedro T. Monteiro
View author publications
You can also search for this author in PubMed Google Scholar
João A. Carriço
View author publications
You can also search for this author in PubMed Google Scholar
Francisco C. Santos
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre P. Francisco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreia Sofia Teixeira .

Editor information

Editors and Affiliations

Inria, Rennes, France
Shadi Ibrahim
University of Texas at San Antonio, San Antonio, Texas, USA
Kim-Kwang Raymond Choo
Aalto University, Espoo, Finland
Zheng Yan
University of Alberta, Edmonton, Alberta, Canada
Witold Pedrycz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teixeira, A.S., Monteiro, P.T., Carriço, J.A., Santos, F.C., Francisco, A.P. (2017). Using Spark and GraphX to Parallelize Large-Scale Simulations of Bacterial Populations over Host Contact Networks. In: Ibrahim, S., Choo, KK., Yan, Z., Pedrycz, W. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2017. Lecture Notes in Computer Science(), vol 10393. Springer, Cham. https://doi.org/10.1007/978-3-319-65482-9_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-65482-9_44
Published: 11 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65481-2
Online ISBN: 978-3-319-65482-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics