Abstract
Machine Learning (ML) and in particular Neural Networks (NN) are currently being used for different image/video processing, speech recognition and other tasks. The goal of supervised NN is to classify raw input data according to the patterns learned from an input training set. Training and validation of NN is very computationally intensive. In this paper we present an NN infrastructure to accelerate model training, specifically tuning of hyper-parameters, and model inference or prediction using distributed systems techniques. By accelerating model training, we give ability to researchers to obtain a large set of potential models to use and compare in a shorter amount of time. Automating this process not only reduces development time but will provide an easy means for comparing results for different classifiers and/or different hyper-parameters. With a single set of training data, our application will run different classifiers on different servers each running models with tweaked hyper-parameters. To give more control over the automation process the degree by which these hyper-parameters will be tweaked can be set by the user prior to running. The prediction step in most ML algorithms can also be very slow, especially on video prediction where current systems calculate their inference predictions on an entire input video, and then evaluate accuracy based on human annotations of objects of interest within the video. To reduce this bottleneck, we also accelerate and distribute this important part of the ML algorithm development. This process involves sending to each server the data; the model weights; and human annotations within the video segmentation. Our efficient distribution of input frames among each node greatly reduced the amount taken for testing and to generate accuracy metrics. To make our implementation robust to common distributed system failures (servers going down, lost of communication among some nodes, and others) we use heartbeat/gossip style protocol for failure detection and recovery. We tested our infrastructure for fast testing and inference of ML on video with data generated by a group of marine biologists researching the behavior of different marine species on the deep sea. Results show that by using our infrastructure times improved by a factor of 15.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune: a research platform for distributed model selection and training (2018). arXiv arXiv:1807.05118
Swearingen, T., Drevo, W., Cyphers, B., Cuesta-Infante, A., Ross, A., Veeramachaneni, D.: ATM: a distributed, collaborative, scalable system for automated machine learning. In: 2017 IEEE International Conference on Big Data (1905), pp. 151ā162 (2017). https://doi.org/10.1109/BigData.2017.8257923
Stinson, D.: Deep Learning with Go. Purdue University, M.S.E.C.E. (2020)
Schikuta, E., Turner, D.: Go for parallel neural networks. In: Rojas, I., Joya, G., Catala, A. (eds.) 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, LNCS. Springer, Cham. https://doi.org/10.1007/978-3-030-20518-8
Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code Distributed Artificial Intelligent Model Training. Github Repository. https://github.com/Ezhang98/csc569 (2020)
Tsoukalos, M.: Go Systems Programming: Master Linux and Unix System Level Programming with Go. OāReilly Media (2017)
Du, Y., Liu, Y., Wang, X., Fang, J., Sheng, G., Jiang, X.: Predicting weather-related failure risk in distribution systems using Bayesian neural network. IEEE Trans. Smart Grid 12, 350ā360 (2020)
De Coninck, E., et al.: Distributed neural networks for internet of things: the big-little approach. In: Mandler, B., et al. (eds.) IoT360 2015. LNICST, vol. 170, pp. 484ā492. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47075-7_52
Teerapittayanon, S., McDanel, B., Kung, H.T.: Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328ā339 (2017)
Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H.: Code distributed artificial intelligent model evaluation for replication. Github Reposit. (2020) https://github.com/hi4a4/distributedmodelevalulation
Google Research Research Blog: AlphaGo: Mastering the ancient game of Go with Machine Learning, 27 January 2016
Abadi, A., et al.: Large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org
Machine Learning Library in GoLang https://pkg.go.dev/github.com/golang-basic/golearn (2014)
Go module implementing multi-layer Neural Network https://pkg.go.dev/github.com/dathoangnd/gonet (2020)
Go Module for Machine Learning (2019). https://github.com/cdipaolo/goml
Demers, A., et al.: Epidemic algorithms for replicated database maintenance. In: PODC 1987 (1987)
Lamport, L.: Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121), December 2001
GXUI - A Go cross platform UI library (2015). https://github.com/google/gxui
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Monahan, C., Garcia, A., Zhang, E., Timokhin, D., Egbert, H., Pantoja, M. (2022). Distributed Artificial Intelligent Model Training andĀ Evaluation. In: Gitler, I., Barrios HernĆ”ndez, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-04209-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04208-9
Online ISBN: 978-3-031-04209-6
eBook Packages: Computer ScienceComputer Science (R0)