skip to main content
research-article

Autonomous Orchestration of Distributed Discrete Event Simulations in the Presence of Resource Uncertainty

Published: 01 September 2015 Publication History

Abstract

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of events and conditions provides a more nuanced model, but also increases its computational footprint. To manage these processing requirements in a scalable manner, discrete event simulations can be distributed across multiple computing resources. Orchestrating the simulations in a distributed setting involves coping with resource uncertainty. We consider three key aspects of resource uncertainty: resource failures, heterogeneity, and slowdowns. Each of these aspects is managed autonomously, which involves making accurate predictions of future execution times and latencies while also accounting for differences in hardware capabilities and dynamic resource consumption profiles. Further complicating matters, individual tasks within the simulation are stateful and stochastic, requiring inter-task communication and synchronization to produce accurate outcomes. We deal with these challenges through intelligent state collection and migration, active resource monitoring, and empirical evaluation of resource capabilities under changing conditions. To underscore the viability of our solution, we provide benchmarks using a production discrete event simulation that can simultaneously sustain failures, manage resource heterogeneity, and handle slowdowns while being orchestrated by our framework.

References

[1]
A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. 2005. Hadoop: A framework for running applications on large clusters built of commodity hardware. Retrieved August 1, 2015 from http://hadoop.apache.org/.
[2]
M. Chtepen, F. H. A. Claeys, B. Dhoedt, F. De Turck, P. Demeester, and P. A. Vanrolleghem. 2009. Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids. IEEE Transactions on Parallel and Distributed Systems, 20, 2, 180--190.
[3]
W. R. Cotton, R. A. Pielke Sr., R. L. Walko, G. E. Liston, C. J. Tremback, H. Jiang, R. L. McAnelly, J. Y. Harrington, M. E. Nicholls, G. G. Carrio, and others. 2003. RAMS 2001: Current status and future directions. Meteorology and Atmospheric Physics 82, 1--4, 5--29.
[4]
D. Cucuzzo, S. D’Alessio, F. Quaglia, and P. Romano. 2007. A lightweight heuristic-based mechanism for collecting committed consistent global states in optimistic simulation. Proceedings of the International Symposium on Distributed Simulation and Real-Time Applications, 227--234.
[5]
G. D’Angelo. 2011. Parallel and distributed simulation from many cores to the public cloud. Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’11).
[6]
J. Dean and S. Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.
[7]
L. P. Deutsch. 1996. DEFLATE compressed data format specification, version 1.3.
[8]
M. Eklof, F. Moradi, and R. Ayani. 2005. A framework for fault tolerance in HLA-based distributed simulations. Proceedings of Conference on Winter Simulation, 1182--1189.
[9]
K. Ericson and S. Pallickara. 2012. On the performance of high dimensional data clustering and classification algorithms. Future Generation Computer Systems.
[10]
K. Ericson, S. Pallickara, and C. W. Anderson. 2010. Analyzing electroencephalograms using cloud computing techniques. In 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 185--192.
[11]
T. H. Feng and E. A. Lee. 2007. Implementation of Real-Time Distributed Discrete-Event Execution with Fault Tolerance. Technical Report. University of California, Berkeley, Berkeley, CA.
[12]
C. Green and others. 2010. Simulation modeling of alternative control strategies for an HPAI outbreak using NAADSM. In Canadian Association of Veterinary Epidemiology Preventive Medicine (CAVEPM) Meeting, Guelph, Ontario, Canada.
[13]
N. Harvey, A. Reeves, M. A. Schoenbaum, F. J. Zagmutt-Vergara, C. Dube, A. E. Hill, et al. 2007. The North American animal disease spread model: A simulation model to assist decision making in evaluating animal disease incursions. Preventive Veterinary Medicine 82, 3, 176--197.
[14]
Heaton Research, Inc. Encog Machine Learning Framework. Retrieved August 1, 2015 from http://www.heatonresearch.com/encog.
[15]
D. Jefferson and J. Leek. 2010. Application of parallel discrete event simulation to the Space Surveillance Network. In Proceedings of the Advanced Maui Optical and Space Surveillance Technologies Conference, S. Ryan (ed.). Maui Economic Development Board, E, Vol. 34.
[16]
D. Korn and K. Vo. 2002. The VCDIFF generic differencing and compression data format. Retrieved August 1, 2015 from http://www.heise.de/netze/rfc/rfcs/rfc3284.shtml.
[17]
G. Lee, B.-G. Chun, and R. H. Katz. 2011. Heterogeneity-aware resource allocation and scheduling in the cloud. Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 11.
[18]
J. MacDonald. 2008. XDelta. Retrieved August 1, 2015 from http://xdelta.org.
[19]
M. Malensek, S. L. Pallickara, and S. Pallickara. 2012. Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals. Future Generation Computer Systems 29, 4, 1049--1061.
[20]
M. Malensek, Z. Sui, N. Harvey, and S. Pallickara. 2013. Autonomous, failure-resilient orchestration of distributed discrete event simulations. Proceedings of the ACM Cloud and Autonomic Computing Conference. Miami, FL. 2013.
[21]
S. Pallickara, J. Ekanayake, and G. Fox. 2009. Granules: A lightweight, streaming runtime for cloud computing with support, for Map-Reduce. In IEEE International Conference on Cluster Computing and Workshops, 2009 (CLUSTER’09). IEEE, 1--10.
[22]
A. Park and R. M. Fujimoto. 2006. Aurora: An approach to high throughput parallel simulation. 20th Workshop on Principles of Advanced and Distributed Simulation (PADS’06). 3, 10.
[23]
A. Park and R. Fujimoto. 2007. A scalable framework for parallel discrete event simulations on desktop grids. In 8th IEEE/ACM International Conference on Grid Computing.
[24]
D. Patterson, A. Brown, P. Broadwell, and others. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical Report. UCB//CSD-02-1175, University of California, Berkeley Computer Science, Berkeley, CA.
[25]
D. L. Pendell, J. Leatherman, T. C. Schroeder, and G. S. Alward. 2007. The economic impacts of a foot-and-mouth disease outbreak: A regional analysis. Journal of Agricultural and Applied Economics 39, 0, 19--33.
[26]
C. Percival. 2006. Matching with mismatches and assorted applications. Ph.D. Dissertation. University of Oxford. Oxford, UK.
[27]
K. Portacci, A. Reeves, B. Corso, and M. Salman. 2009. Evaluation of vaccination strategies for an outbreak of pseudorabies virus in US commercial swine using the NAADSM. In ISVEE 12: Proceedings of the 12th Symposium of the International Society for Veterinary Epidemiology and Economics, Durban, South Africa. 78.
[28]
J. L. Ramírez Ortiz and R. M. Jiménez. 2011. Fault-tolerant distributed discrete event simulator based on a p2p architecture. In SIMUL 2011, The 3rd International Conference on Advances in System Simulation. 21--26.
[29]
N. Roy, A. Dubey, and A. Gokhale. 2011. Efficient autoscaling in the cloud using predictive models for workload forecasting. 2011 IEEE International Conference on Cloud Computing (CLOUD).
[30]
K. Vanmechelen, S. De Munck, and J. Broeckhove. 2013. Conservative distributed discrete-event simulation on the Amazon EC2 cloud: An evaluation of time synchronization protocol performance and cost efficiency. Simulation Modelling Practice and Theory 34, 126--143.
[31]
V. Springel. 2005. The cosmological simulation code gadget-2. Monthly Notices of the Royal Astronomical Society 364, 4, 1105--1134.
[32]
Z. Sui, N. Harvey, and S. Pallickara. 2013. On the distributed orchestration of stochastic discrete event simulations. Concurrency and Computation: Practice and Experience.

Cited By

View all
  • (2024)Enhancing the productivity of mould shop using continuous improvement tools and simulationInternational Journal of Systems Science: Operations & Logistics10.1080/23302674.2024.242501411:1Online publication date: 15-Nov-2024
  • (2021)Self-Adaptive Software Systems in Contested and Resource-Constrained Environments: Overview and ChallengesIEEE Access10.1109/ACCESS.2020.30434409(10711-10728)Online publication date: 2021
  • (2019)Discrete-event simulation of a production process for increasing the efficiency of a newspaper productionIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/495/1/012026495(012026)Online publication date: 7-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Autonomous and Adaptive Systems
ACM Transactions on Autonomous and Adaptive Systems  Volume 10, Issue 3
October 2015
204 pages
ISSN:1556-4665
EISSN:1556-4703
DOI:10.1145/2819320
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2015
Accepted: 01 March 2015
Revised: 01 December 2014
Received: 01 July 2014
Published in TAAS Volume 10, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fault tolerance
  2. checkpointing
  3. distributed discrete event simulation
  4. neural networks
  5. prediction

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • US Department of Homeland Security's Long Range program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing the productivity of mould shop using continuous improvement tools and simulationInternational Journal of Systems Science: Operations & Logistics10.1080/23302674.2024.242501411:1Online publication date: 15-Nov-2024
  • (2021)Self-Adaptive Software Systems in Contested and Resource-Constrained Environments: Overview and ChallengesIEEE Access10.1109/ACCESS.2020.30434409(10711-10728)Online publication date: 2021
  • (2019)Discrete-event simulation of a production process for increasing the efficiency of a newspaper productionIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/495/1/012026495(012026)Online publication date: 7-Jun-2019
  • (2018)Scalable network analytics for characterization of outbreak influence in voluminous epidemiology datasetsConcurrency and Computation: Practice and Experience10.1002/cpe.499831:7Online publication date: 22-Oct-2018
  • (2016)Network analysis for identifying and characterizing disease outbreak influence from voluminous epidemiology data2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840726(1222-1231)Online publication date: Dec-2016

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media