Abstract:
Machine learning (ML) services are vital to providing timely intelligent decision-making capabilities in Zero-touch Net-work (ZTN). However, such a service relies on effi...Show MoreMetadata
Abstract:
Machine learning (ML) services are vital to providing timely intelligent decision-making capabilities in Zero-touch Net-work (ZTN). However, such a service relies on efficient distributed ML model training which adds overheads in the perspectives of computing and network. A huge amount of model parameters are to be calculated by distributed ML training jobs and exchanged by reliable network service function chains (SFC), which require balanced computing loads and increase bandwidth consumption. It is thus important to optimize the problems of training job placement and SFC orchestration to increase the efficiency of model training and reduce bandwidth consumption. Since both mentioned issues are related, it is necessary to study both problems simultaneously. In this paper, an integer linear problem model is formulated to jointly optimize both problems. We then propose a heuristic algorithm to optimize training job load balancing rate and network resource consumption. We evaluate the performance of our algorithm on realistic Google cluster-usage traces with results showing that our algorithm is 18 times faster than the traditional algorithm. Our study shows that our proposed algorithm achieves near-optimal with no exceeding 3% from the optimal performance.
Published in: 2023 IEEE Globecom Workshops (GC Wkshps)
Date of Conference: 04-08 December 2023
Date Added to IEEE Xplore: 21 March 2024
ISBN Information: