EAIS: Energy-aware adaptive scheduling for CNN inference on high-performance GPUs
Introduction
Since AlexNet [1] won the ImageNet Large Scale Visual Recognition Challenge [2] and dropped the error rate by 10% in 2012, deep neural networks (DNNs) have witnessed success in a variety of applications in the areas of computer vision [3], [4], speech recognition [5], [6], natural language processing [7], [8], and autonomous driving [9], [10]. Such successes have promoted the development of Machine-Learning-as-a-Service [11]. There are two phases in deep learning: training and inference. The training phase builds DNN models with the existing data, and the inference phase uses pre-trained models to provide prediction services. The optimization goals of the two phases are very different. The goal of the training phase is to obtain higher accuracy. However, the inference phase is closer to end-users, and its goal is strongly correlated with actual applications. For example, pedestrian detection in autonomous vehicles (e.g., Tesla Autopilot [12]) has a hard latency deadline, and missing this deadline may lead to serious traffic accidents. Interactive tasks such as translation tools and speech recognition (e.g., Google Translate and Cortana [13]) can tolerate some delay. But if the response time exceeds a certain range, it will seriously influence the user experience.
With their multi-core architecture, Graphic Processing Units (GPUs) are very effective for reading, writing, and executing programs in parallel on high-dimensional tensor data and can easily achieve a speedup of dozens of times faster than CPUs. However, GPUs are high power consumption units. GPU clusters supporting the deep learning industry bring huge energy consumption. Facebook points out that deep learning inference services support billions of queries per day and play a greater role than training in data centers [14]. NVIDIA estimates that 80%90% of the cost in the artificial intelligence business lies in inference processing [15]. A more immediate energy use problem is that data centers use more than 200 terawatt-hours each year and is still growing [16]. Therefore, there is a pressing need to decrease the energy consumption of deep learning inference services.
To decrease energy consumption in convolutional neural network (CNN) inference services on high-performance GPUs, batching and dynamic voltage frequency scaling (DVFS) settings are two commonly used methods. Fig. 1 shows TensorRT [17] performance (i.e., batching execution time and energy per request) under different batch sizes and GPU core frequencies for ResNet-50 running on a Tesla M40 GPU. On the one hand, the batching effectively improves throughput and decreases energy consumption. Intuitively, a single inference cannot fully utilize GPU resources, while batching improves memory utilization of GPUs and parallelization efficiency of matrix multiplication to improve the energy efficiency of inference services [18], [19]. However, it comes at the cost of increased execution time. Because it takes more time to run a batch of requests instead of a single request. This may violate the latency Service-Level-Objective (SLO).
On the other hand, DVFS is a technology that makes a trade-off between execution time and power consumption [20], [21], [22]. The GPU core frequency controls the arithmetic and logic unit core and affects the execution speed of streaming multiprocessors. The energy consumption presents a U-shape trend with the increase of GPU core frequency, which reaches the lowest point at the intermediate frequency. This is because energy consumption is the product of time and power. Decreasing the GPU core frequency will reduce the dynamic power, as shown in Eq. (1), where is the frequency and is the voltage. Power decreases faster than the increasing speed of execution time at intermediate frequencies, which decreases energy consumption. However, decreasing the GPU core frequency increases the execution time of GPU kernel functions, as shown in Eq. (2), which may violate the latency SLO.
Most previous works [23], [24], [25], [26], [27] typically only perform batching or DVFS, which are not enough to minimize energy consumption. A typical system utilizing batching is Clipper [23], a universal low-latency online prediction serving system that uses the dynamic batch size of the additive-increase-multiplicative-decrease (AIMD) scheme. However, many requests will miss latency SLO on burst workloads. Nanily [24] and BatchSizer [25] adopt adaptive batching to schedule CNN inference requests. They always apply the largest batch size possible. PIT [27] coordinates DVFS and precision to make a trade-off between accuracy and power consumption within a given response time. Although PIT decreases energy consumption, it comes at the expense of accuracy.
Besides, most batching systems generate a request queue on CPU-side [23], [28]. Fig. 2 shows the detailed process of batched inference requests. ① Multiple inference requests are organized into a batch; ② All the input data used for inference are uploaded to the GPU together via PCI-e; ③ The GPU processes the batch inference requests; ④ The results of all inference requests are returned to the CPU together. However, the above batching mechanism increases the fixed energy consumption of the GPU and decreases the system processing capacity. This is because the GPU is idle when transferring data between the CPU and the GPU. Nowadays, GPUs support asynchronous execution [29], which makes it possible to overlap data transmission and computation. Ideally, a well-tuned asynchronous execution strategy increases system throughput by 5%40% and decreases energy consumption by 1%10%, as shown in Fig. 6. However, simply overlapping data transmission and computation is not enough due to the dynamic change of the request rate in real workloads. We should coordinate batching and DVFS of adjacent batches according to workloads to improve the system processing capacity.
Our goal is to minimize energy consumption while meeting latency SLO for CNN inference services. However, finding effective solutions is challenging for the following reasons.
(1) Providing valid information about the performance characteristics is challenging. The inference scheduling framework should consider the impact of batch size and GPU core frequency on performance when collaborative batching and DVFS settings, so the configuration space is relatively large. For example, on a Tesla V100 GPU, there are 187 core frequency levels, and the maximum batch size of ResNet-50 reaches 1024. It makes the total number of possible combinations to more than 190 thousand. It is unrealistic to exhaustively search all possible configurations to find the most energy-efficient one in the offline state while meeting the response time requirements. A commonly used method is to choose values that are significantly different from each other to shrink the configuration space. However, it will result in large errors. It is necessary to complement the missing information by sampling data, which brings significant challenges for profiling and modeling.
(2) Overlapping data transmission and computation between adjacent batches is challenging. Fig. 3 shows the execution process of adjacent batches on the GPU, where “HtoD” (Host-to-Device) and “DtoH” (Device-to-Host) indicate moving data from CPU memory to GPU memory and from GPU memory to CPU memory, respectively. Different batches are processed sequentially, which means that the data of the next batch will not be uploaded to the GPU until the current results are returned to the CPU. The GPU is idle when data is transferred between the CPU and the GPU, which not only makes the GPU underutilized, but also increases the static energy consumption of the GPU. In addition, DVFS settings affect not only the GPU execution time and energy consumption of the current batch, but also the upload time of the next batch.
(3) Dealing with fluctuating workloads is challenging. In real inference services, requests usually arrive in a stochastic and burst manner [30], [31], [32]. This means that the number of requests per unit of time fluctuates and will change according to the user’s needs, as shown in Fig. 8, Fig. 9. For instance, consider a web-serving application, which employs CNN models to generate query results. In such applications, the query rate of requests is random and unpredictable. Besides, applications need to have their response time meet the latency SLO. Therefore, it is challenging to deploy a scheduling strategy with adaptive ability.
In this paper, we propose EAIS (Energy-Aware Inference Scheduling), an energy-aware adaptive scheduling framework to solve the above challenges. EAIS is comprised of a performance model, an asynchronous execution strategy, and an energy-aware scheduler. The performance model provides valid information about the performance characteristics of CNN inference services, which is a prerequisite for running EAIS. The asynchronous execution strategy overlaps data upload and GPU execution and builds a model to capture the relationship between request rate and latency. The energy-aware scheduler adapts the energy efficiency greedy algorithm to coordinate batching and DVFS according to fluctuating workloads. Our main contributions are as follows.
- •
A comprehensive analysis of the performance characteristics. We build a performance model to shrink the feasible configuration space.
- •
An asynchronous execution strategy. We overlap data upload and GPU execution to improve the system processing capacity.
- •
An energy-aware adaptive scheduling policy. We design an energy-aware scheduler based on a greedy algorithm to minimize energy consumption while meeting latency SLO.
Our experimental results show that EAIS decreases energy consumption by up to 28.02% and improves the system processing capacity by up to 7.22% while meeting latency SLO compared with the state-of-the-art methods. Besides, EAIS has been proved to have good versatility under different latency SLO constraints.
Section snippets
CNN inference tasks
Previous studies [33], [34] indicate that inference applications can be divided into three categories based on their response time requirements: real-time tasks, interactive tasks, and background tasks, as illustrated in Fig. 4. With the latency increasing, they can be divided into four areas based on user satisfaction: imperceptible, tolerable, unusable, and avoidable.
- •
Real-time tasks.They have strict latency requirements, which usually require requests to respond before the deadline (i.e.,
Methodology
In this section, we elaborate upon the design of EAIS, which aims to minimize the total energy consumption while meeting latency SLO. Table 1 summarizes the symbols used in EAIS.
We are only concerned about the energy consumption of GPUs in CNN inference services. The inference is a typical compute-intensive process, which leads GPUs to a heavy-load condition. GPUs are high power consumption units. For example, the peak power of a Tesla V100 GPU is as high as 300 W. The energy consumption of
Experimental environment
In this section, we introduce the experimental environment, including the experimental setup and the energy measurement method. In Section 4.1, we describe the platform and related parameter settings. In Section 4.2, we collect the instantaneous power using GPU built-in sensor and measure it multiple times to reduce the error.
Experimental evaluation
In this section, we evaluate the effectiveness of EAIS in decreasing energy consumption while meeting the latency SLO. We also observe the versatility of EAIS under different latency SLO constraints.
Conclusion
In this paper, we propose EAIS, an energy-aware adaptive scheduling framework that is comprised of a performance model, an asynchronous execution strategy, and an energy-aware scheduler. Extensive experiments show that the effectiveness of collaborative dynamic batching and DVFS with the asynchronous execution strategy. EAIS decreases energy consumption by up to 28.02% and improves the system processing capacity by up to 7.22% while meeting latency SLO compared with the state-of-the-art CNN
CRediT authorship contribution statement
Chunrong Yao: Methodology, Software, Investigation, Visualization, Writing – original draft. Wantao Liu: Conceptualization, Methodology, Writing – review & editing. Weiqing Tang: Writing – review & editing, Supervision. Songlin Hu: Funding acquisition, Project administration, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This research is supported by the National Key Research and Development Program of China (No. 2017YFB1010000).
Chunrong Yao is a Ph.D. student in School of Computer Science and Engineering of Nanjing University of Science and Technology. He received his M.S. degree from the University of Shanghai for Science and Technology in 2016. His research interests focus on machine learning system and parallel computing.
References (70)
- et al.
Energy and performance aware fog computing: A case of DVFS and green renewable energy
Future Gener. Comput. Syst.
(2019) - et al.
A smart energy and reliability aware scheduling algorithm for workflow execution in DVFS-enabled cloud environment
Future Gener. Comput. Syst.
(2020) - et al.
Experimental and quantitative analysis of server power model for cloud data centers
Future Gener. Comput. Syst.
(2018) - et al.
A hardware-aware CPU power measurement based on the power-exponent function model for cloud servers
Inform. Sci.
(2021) - et al.
Memory-aware resource management algorithm for low-energy cloud data centers
Future Gener. Comput. Syst.
(2020) Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization
Perform. Eval.
(2021)- et al.
GPGPU power estimation with core and memory frequency scaling
SIGMETRICS Perform. Eval. Rev.
(2017) - et al.
Wikipedia workload analysis for decentralized hosting
Comput. Netw.
(2009) - et al.
Imagenet classification with deep convolutional neural networks
- et al.
ImageNet large scale visual recognition challenge
Int. J. Comput. Vis.
(2015)
Deep residual learning for image recognition
Faster R-CNN: Towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
Deep speech 2: End-to-end speech recognition in english and mandarin
Achieving human parity in conversational speech recognition
Google’s neural machine translation system: Bridging the gap between human and machine translation
Deepdriving: Learning affordance for direct perception in autonomous driving
An empirical evaluation of deep learning on highway driving
MLaaS: Machine learning as a service
Autopilot
Cortana
Applied machine learning at facebook: A datacenter infrastructure perspective
Google cloud doubles down on NVIDIA GPUs for inference
How to stop data centres from gobbling up the world’s electricity
TensorRT
Deep learning at scale on NVIDIA V100 accelerators
Evaluating and analyzing the energy efficiency of CNN inference on high-performance GPU
Concurr. Comput.: Pract. Exper.
The impact of GPU DVFS on the energy and performance of deep learning: An empirical study
Clipper: A low-latency online prediction serving system
Nanily: A qos-aware scheduling for dnn inference workload in clouds
BatchSizer: Power-performance trade-off for DNN inference
Ebird: Elastic batch for improving responsiveness and throughput of deep learning services
Coordinated DVFS and precision control for deep neural networks
IEEE Comput. Archit. Lett.
Tensorflow-serving: Flexible, high-performance ML serving
A GPU inference system scheduling algorithm with asynchronous data transfer
Cited by (14)
EALI: Energy-aware layer-level scheduling for convolutional neural network inference services on GPUs
2022, NeurocomputingCitation Excerpt :BS16_DVFS: A method that sets the batch size to a fixed 16 with DVFS. EE_Greedy_Model: An energy-aware coarse-grained adaptive scheduling strategy with batching and model-level DVFS, which regards CNNs as a black box, such as EAIS [51]. SLO setting:We set 200 ms as the latency SLO in the evaluation, which is consistent with previous studies [44,52].
Balancing of Web Applications Workload Using Hybrid Computing (CPU–GPU) Architecture
2024, SN Computer ScienceCNN Workloads Characterization and Integrated CPU-GPU DVFS Governors on Embedded Systems
2023, IEEE Embedded Systems LettersAcceleration of Convolutional Neural Networks
2023, Proceedings - 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering, BIBE 2023
Chunrong Yao is a Ph.D. student in School of Computer Science and Engineering of Nanjing University of Science and Technology. He received his M.S. degree from the University of Shanghai for Science and Technology in 2016. His research interests focus on machine learning system and parallel computing.
Wantao Liu is a senior engineer in Institute of Information Engineering, Chinese Academy of Sciences. He received his Ph.D. degree in Computer Science from Beijing University of Aeronautics and Astronautics in 2013. His current research focuses on machine learning system and cloud computing.
Weiqing Tang is a professor in Institute of Computing Technology, Chinese Academy of Sciences. He is also the Secretary-General of China Computer Federation. He received his M.S. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology in 1987. And then he received his Ph.D. from Institute of Computing Technology, Chinese Academy of Sciences in 1993. His research interests include CAD&CG, CSCW, and scientific visualization.
Songlin Hu is a professor in Institute of Information Engineering, Chinese Academy of Sciences. He is a senior member of China Computer Federation. He received his Ph.D. degree from Beijing University of Aeronautics and Astronautics in 2001. In 2005, with the support of the National Fund for Overseas Study, he joined the Middleware System Research Group at the University of Toronto, Canada for one year. His research interests include big data storage, intelligent processing, and knowledge graph.