Distributed Artificial Intelligent Model Training and Evaluation

Monahan, Christina; Garcia, Alexander; Zhang, Evan; Timokhin, Dmitriy; Egbert, Hanson; Pantoja, Maria

doi:10.1007/978-3-031-04209-6_12

Christina Monahan⁸,
Alexander Garcia⁸,
Evan Zhang⁸,
Dmitriy Timokhin⁸,
Hanson Egbert⁸ &
…
Maria Pantoja ORCID: orcid.org/0000-0002-1942-9769⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1540))

Included in the following conference series:

Latin American High Performance Computing Conference

363 Accesses
1 Citations

Abstract

Machine Learning (ML) and in particular Neural Networks (NN) are currently being used for different image/video processing, speech recognition and other tasks. The goal of supervised NN is to classify raw input data according to the patterns learned from an input training set. Training and validation of NN is very computationally intensive. In this paper we present an NN infrastructure to accelerate model training, specifically tuning of hyper-parameters, and model inference or prediction using distributed systems techniques. By accelerating model training, we give ability to researchers to obtain a large set of potential models to use and compare in a shorter amount of time. Automating this process not only reduces development time but will provide an easy means for comparing results for different classifiers and/or different hyper-parameters. With a single set of training data, our application will run different classifiers on different servers each running models with tweaked hyper-parameters. To give more control over the automation process the degree by which these hyper-parameters will be tweaked can be set by the user prior to running. The prediction step in most ML algorithms can also be very slow, especially on video prediction where current systems calculate their inference predictions on an entire input video, and then evaluate accuracy based on human annotations of objects of interest within the video. To reduce this bottleneck, we also accelerate and distribute this important part of the ML algorithm development. This process involves sending to each server the data; the model weights; and human annotations within the video segmentation. Our efficient distribution of input frames among each node greatly reduced the amount taken for testing and to generate accuracy metrics. To make our implementation robust to common distributed system failures (servers going down, lost of communication among some nodes, and others) we use heartbeat/gossip style protocol for failure detection and recovery. We tested our infrastructure for fast testing and inference of ML on video with data generated by a group of marine biologists researching the behavior of different marine species on the deep sea. Results show that by using our infrastructure times improved by a factor of 15.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune: a research platform for distributed model selection and training (2018). arXiv arXiv:1807.05118
Swearingen, T., Drevo, W., Cyphers, B., Cuesta-Infante, A., Ross, A., Veeramachaneni, D.: ATM: a distributed, collaborative, scalable system for automated machine learning. In: 2017 IEEE International Conference on Big Data (1905), pp. 151–162 (2017). https://doi.org/10.1109/BigData.2017.8257923
Stinson, D.: Deep Learning with Go. Purdue University, M.S.E.C.E. (2020)
Google Scholar
Schikuta, E., Turner, D.: Go for parallel neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, LNCS. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8
Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code Distributed Artificial Intelligent Model Training. Github Repository. https://github.com/Ezhang98/csc569 (2020)
Tsoukalos, M.: Go Systems Programming: Master Linux and Unix System Level Programming with Go. O’Reilly Media (2017)
Google Scholar
Du, Y., Liu, Y., Wang, X., Fang, J., Sheng, G., Jiang, X.: Predicting weather-related failure risk in distribution systems using Bayesian neural network. IEEE Trans. Smart Grid 12, 350–360 (2020)
Google Scholar
De Coninck, E., et al.: Distributed neural networks for internet of things: the big-little approach. In: Mandler, B., et al. (eds.) IoT360 2015. LNICST, vol. 170, pp. 484–492. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47075-7_52
Chapter Google Scholar
Teerapittayanon, S., McDanel, B., Kung, H.T.: Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328–339 (2017)
Google Scholar
Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code distributed artificial intelligent model evaluation for replication. Github Reposit. (2020) https://github.com/hi4a4/distributedmodelevalulation
Google Research Research Blog: AlphaGo: Mastering the ancient game of Go with Machine Learning, 27 January 2016
Google Scholar
Abadi, A., et al.: Large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org
Google Scholar
Machine Learning Library in GoLang https://pkg.go.dev/github.com/golang-basic/golearn (2014)
Go module implementing multi-layer Neural Network https://pkg.go.dev/github.com/dathoangnd/gonet (2020)
Go Module for Machine Learning (2019). https://github.com/cdipaolo/goml
Demers, A., et al.: Epidemic algorithms for replicated database maintenance. In: PODC 1987 (1987)
Google Scholar
Lamport, L.: Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121), December 2001
Google Scholar
GXUI - A Go cross platform UI library (2015). https://github.com/google/gxui

Download references

Author information

Authors and Affiliations

California Polytechnic State University, San Luis Obispo, CA, 95116, USA
Christina Monahan, Alexander Garcia, Evan Zhang, Dmitriy Timokhin, Hanson Egbert & Maria Pantoja

Authors

Christina Monahan
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Evan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Timokhin
View author publications
You can also search for this author in PubMed Google Scholar
Hanson Egbert
View author publications
You can also search for this author in PubMed Google Scholar
Maria Pantoja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Pantoja .

Editor information

Editors and Affiliations

Centro de Investigación y de Estudios Avanzados, Mexico City, Mexico
Isidoro Gitler
Universidad Industrial de Santander, Bucaramanga, Colombia
Carlos Jaime Barrios Hernández
Centro Nacional de Alta Tecnología, San José, Costa Rica
Esteban Meneses

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H., Pantoja, M. (2022). Distributed Artificial Intelligent Model Training and Evaluation. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-04209-6_12
Published: 12 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04208-9
Online ISBN: 978-3-031-04209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distributed Artificial Intelligent Model Training and Evaluation