Skip to main content

Distributed Artificial Intelligent Model Training andĀ Evaluation

  • Conference paper
  • First Online:
High Performance Computing (CARLA 2021)

Abstract

Machine Learning (ML) and in particular Neural Networks (NN) are currently being used for different image/video processing, speech recognition and other tasks. The goal of supervised NN is to classify raw input data according to the patterns learned from an input training set. Training and validation of NN is very computationally intensive. In this paper we present an NN infrastructure to accelerate model training, specifically tuning of hyper-parameters, and model inference or prediction using distributed systems techniques. By accelerating model training, we give ability to researchers to obtain a large set of potential models to use and compare in a shorter amount of time. Automating this process not only reduces development time but will provide an easy means for comparing results for different classifiers and/or different hyper-parameters. With a single set of training data, our application will run different classifiers on different servers each running models with tweaked hyper-parameters. To give more control over the automation process the degree by which these hyper-parameters will be tweaked can be set by the user prior to running. The prediction step in most ML algorithms can also be very slow, especially on video prediction where current systems calculate their inference predictions on an entire input video, and then evaluate accuracy based on human annotations of objects of interest within the video. To reduce this bottleneck, we also accelerate and distribute this important part of the ML algorithm development. This process involves sending to each server the data; the model weights; and human annotations within the video segmentation. Our efficient distribution of input frames among each node greatly reduced the amount taken for testing and to generate accuracy metrics. To make our implementation robust to common distributed system failures (servers going down, lost of communication among some nodes, and others) we use heartbeat/gossip style protocol for failure detection and recovery. We tested our infrastructure for fast testing and inference of ML on video with data generated by a group of marine biologists researching the behavior of different marine species on the deep sea. Results show that by using our infrastructure times improved by a factor of 15.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune: a research platform for distributed model selection and training (2018). arXiv arXiv:1807.05118

  2. Swearingen, T., Drevo, W., Cyphers, B., Cuesta-Infante, A., Ross, A., Veeramachaneni, D.: ATM: a distributed, collaborative, scalable system for automated machine learning. In: 2017 IEEE International Conference on Big Data (1905), pp. 151ā€“162 (2017). https://doi.org/10.1109/BigData.2017.8257923

  3. Stinson, D.: Deep Learning with Go. Purdue University, M.S.E.C.E. (2020)

    Google ScholarĀ 

  4. Schikuta, E., Turner, D.: Go for parallel neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, LNCS. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8

  5. Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code Distributed Artificial Intelligent Model Training. Github Repository. https://github.com/Ezhang98/csc569 (2020)

  6. Tsoukalos, M.: Go Systems Programming: Master Linux and Unix System Level Programming with Go. Oā€™Reilly Media (2017)

    Google ScholarĀ 

  7. Du, Y., Liu, Y., Wang, X., Fang, J., Sheng, G., Jiang, X.: Predicting weather-related failure risk in distribution systems using Bayesian neural network. IEEE Trans. Smart Grid 12, 350ā€“360 (2020)

    Google ScholarĀ 

  8. De Coninck, E., et al.: Distributed neural networks for internet of things: the big-little approach. In: Mandler, B., et al. (eds.) IoT360 2015. LNICST, vol. 170, pp. 484ā€“492. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47075-7_52

    ChapterĀ  Google ScholarĀ 

  9. Teerapittayanon, S., McDanel, B., Kung, H.T.: Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328ā€“339 (2017)

    Google ScholarĀ 

  10. Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code distributed artificial intelligent model evaluation for replication. Github Reposit. (2020) https://github.com/hi4a4/distributedmodelevalulation

  11. Google Research Research Blog: AlphaGo: Mastering the ancient game of Go with Machine Learning, 27 January 2016

    Google ScholarĀ 

  12. Abadi, A., et al.: Large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org

    Google ScholarĀ 

  13. Machine Learning Library in GoLang https://pkg.go.dev/github.com/golang-basic/golearn (2014)

  14. Go module implementing multi-layer Neural Network https://pkg.go.dev/github.com/dathoangnd/gonet (2020)

  15. Go Module for Machine Learning (2019). https://github.com/cdipaolo/goml

  16. Demers, A., et al.: Epidemic algorithms for replicated database maintenance. In: PODC 1987 (1987)

    Google ScholarĀ 

  17. Lamport, L.: Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121), December 2001

    Google ScholarĀ 

  18. GXUI - A Go cross platform UI library (2015). https://github.com/google/gxui

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Pantoja .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H., Pantoja, M. (2022). Distributed Artificial Intelligent Model Training andĀ Evaluation. In: Gitler, I., Barrios HernĆ”ndez, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04209-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04208-9

  • Online ISBN: 978-3-031-04209-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics