EAIS: Energy-aware adaptive scheduling for CNN inference on high-performance GPUs

doi:10.1016/j.future.2022.01.004

Future Generation Computer Systems

Volume 130, May 2022, Pages 253-268

https://doi.org/10.1016/j.future.2022.01.004 Get rights and content

Highlights

•
An energy-aware scheduling framework is proposed to minimize energy consumption.
•
A performance model is presented to shrink the feasible configuration space.
•
An asynchronous execution strategy is proposed to overlap upload and GPU execution.
•
A scheduler is based on a greedy algorithm to deal with fluctuating workloads.

Abstract

Recently, a large number of convolutional neural network (CNN) inference services have emerged on high-performance Graphic Processing Units (GPUs). However, GPUs are high power consumption units, and the energy consumption increases sharply along with the deployment of deep learning tasks. Although previous studies have considered the latency Service-Level-Objective (SLO) of inference services, they fail to directly take account of the energy consumption. Our investigation shows that coordinating batching and dynamic voltage frequency scaling (DVFS) settings can decrease the energy consumption of CNN inference. But it is affected by (i) larger configuration spaces; (ii) GPUs’ underutilization while data are transferred between CPUs and GPUs; (iii) fluctuating workloads. In this paper, we propose EAIS, an energy-aware adaptive scheduling framework that is comprised of a performance model, an asynchronous execution strategy, and an energy-aware scheduler. The performance model provides valid information about the performance characteristics of CNN inference services to shrink the feasible configuration space. The asynchronous execution strategy overlaps data upload and GPU execution to improve the system processing capacity. The energy-aware scheduler adaptively coordinates batching and DVFS according to fluctuating workloads to minimize energy consumption while meeting latency SLO. Our experimental results on NVIDIA Tesla M40 and V100 GPUs show that, compared to the state-of-the-art methods, EAIS decreases the energy consumption by up to 28.02% and improves the system processing capacity by up to 7.22% while meeting latency SLO. Besides, EAIS has been proved to have good versatility under different latency SLO constraints.

Introduction

Since AlexNet [1] won the ImageNet Large Scale Visual Recognition Challenge [2] and dropped the error rate by 10% in 2012, deep neural networks (DNNs) have witnessed success in a variety of applications in the areas of computer vision [3], [4], speech recognition [5], [6], natural language processing [7], [8], and autonomous driving [9], [10]. Such successes have promoted the development of Machine-Learning-as-a-Service [11]. There are two phases in deep learning: training and inference. The training phase builds DNN models with the existing data, and the inference phase uses pre-trained models to provide prediction services. The optimization goals of the two phases are very different. The goal of the training phase is to obtain higher accuracy. However, the inference phase is closer to end-users, and its goal is strongly correlated with actual applications. For example, pedestrian detection in autonomous vehicles (e.g., Tesla Autopilot [12]) has a hard latency deadline, and missing this deadline may lead to serious traffic accidents. Interactive tasks such as translation tools and speech recognition (e.g., Google Translate and Cortana [13]) can tolerate some delay. But if the response time exceeds a certain range, it will seriously influence the user experience.

With their multi-core architecture, Graphic Processing Units (GPUs) are very effective for reading, writing, and executing programs in parallel on high-dimensional tensor data and can easily achieve a speedup of dozens of times faster than CPUs. However, GPUs are high power consumption units. GPU clusters supporting the deep learning industry bring huge energy consumption. Facebook points out that deep learning inference services support billions of queries per day and play a greater role than training in data centers [14]. NVIDIA estimates that 80% $\sim$ 90% of the cost in the artificial intelligence business lies in inference processing [15]. A more immediate energy use problem is that data centers use more than 200 terawatt-hours each year and is still growing [16]. Therefore, there is a pressing need to decrease the energy consumption of deep learning inference services.

To decrease energy consumption in convolutional neural network (CNN) inference services on high-performance GPUs, batching and dynamic voltage frequency scaling (DVFS) settings are two commonly used methods. Fig. 1 shows TensorRT [17] performance (i.e., batching execution time and energy per request) under different batch sizes and GPU core frequencies for ResNet-50 running on a Tesla M40 GPU. On the one hand, the batching effectively improves throughput and decreases energy consumption. Intuitively, a single inference cannot fully utilize GPU resources, while batching improves memory utilization of GPUs and parallelization efficiency of matrix multiplication to improve the energy efficiency of inference services [18], [19]. However, it comes at the cost of increased execution time. Because it takes more time to run a batch of requests instead of a single request. This may violate the latency Service-Level-Objective (SLO). $P_{d y n a m i c} \propto V^{2} f$ $T \propto \frac{1}{f}$

On the other hand, DVFS is a technology that makes a trade-off between execution time and power consumption [20], [21], [22]. The GPU core frequency controls the arithmetic and logic unit core and affects the execution speed of streaming multiprocessors. The energy consumption presents a U-shape trend with the increase of GPU core frequency, which reaches the lowest point at the intermediate frequency. This is because energy consumption is the product of time and power. Decreasing the GPU core frequency will reduce the dynamic power, as shown in Eq. (1), where $f$ is the frequency and $V$ is the voltage. Power decreases faster than the increasing speed of execution time at intermediate frequencies, which decreases energy consumption. However, decreasing the GPU core frequency increases the execution time of GPU kernel functions, as shown in Eq. (2), which may violate the latency SLO.

Most previous works [23], [24], [25], [26], [27] typically only perform batching or DVFS, which are not enough to minimize energy consumption. A typical system utilizing batching is Clipper [23], a universal low-latency online prediction serving system that uses the dynamic batch size of the additive-increase-multiplicative-decrease (AIMD) scheme. However, many requests will miss latency SLO on burst workloads. Nanily [24] and BatchSizer [25] adopt adaptive batching to schedule CNN inference requests. They always apply the largest batch size possible. PIT [27] coordinates DVFS and precision to make a trade-off between accuracy and power consumption within a given response time. Although PIT decreases energy consumption, it comes at the expense of accuracy.

Besides, most batching systems generate a request queue on CPU-side [23], [28]. Fig. 2 shows the detailed process of batched inference requests. ① Multiple inference requests are organized into a batch; ② All the input data used for inference are uploaded to the GPU together via PCI-e; ③ The GPU processes the batch inference requests; ④ The results of all inference requests are returned to the CPU together. However, the above batching mechanism increases the fixed energy consumption of the GPU and decreases the system processing capacity. This is because the GPU is idle when transferring data between the CPU and the GPU. Nowadays, GPUs support asynchronous execution [29], which makes it possible to overlap data transmission and computation. Ideally, a well-tuned asynchronous execution strategy increases system throughput by 5% $\sim$ 40% and decreases energy consumption by 1% $\sim$ 10%, as shown in Fig. 6. However, simply overlapping data transmission and computation is not enough due to the dynamic change of the request rate in real workloads. We should coordinate batching and DVFS of adjacent batches according to workloads to improve the system processing capacity.

Our goal is to minimize energy consumption while meeting latency SLO for CNN inference services. However, finding effective solutions is challenging for the following reasons.

(1) Providing valid information about the performance characteristics is challenging. The inference scheduling framework should consider the impact of batch size and GPU core frequency on performance when collaborative batching and DVFS settings, so the configuration space is relatively large. For example, on a Tesla V100 GPU, there are 187 core frequency levels, and the maximum batch size of ResNet-50 reaches 1024. It makes the total number of possible combinations to more than 190 thousand. It is unrealistic to exhaustively search all possible configurations to find the most energy-efficient one in the offline state while meeting the response time requirements. A commonly used method is to choose values that are significantly different from each other to shrink the configuration space. However, it will result in large errors. It is necessary to complement the missing information by sampling data, which brings significant challenges for profiling and modeling.

(2) Overlapping data transmission and computation between adjacent batches is challenging. Fig. 3 shows the execution process of adjacent batches on the GPU, where “HtoD” (Host-to-Device) and “DtoH” (Device-to-Host) indicate moving data from CPU memory to GPU memory and from GPU memory to CPU memory, respectively. Different batches are processed sequentially, which means that the data of the next batch will not be uploaded to the GPU until the current results are returned to the CPU. The GPU is idle when data is transferred between the CPU and the GPU, which not only makes the GPU underutilized, but also increases the static energy consumption of the GPU. In addition, DVFS settings affect not only the GPU execution time and energy consumption of the current batch, but also the upload time of the next batch.

(3) Dealing with fluctuating workloads is challenging. In real inference services, requests usually arrive in a stochastic and burst manner [30], [31], [32]. This means that the number of requests per unit of time fluctuates and will change according to the user’s needs, as shown in Fig. 8, Fig. 9. For instance, consider a web-serving application, which employs CNN models to generate query results. In such applications, the query rate of requests is random and unpredictable. Besides, applications need to have their response time meet the latency SLO. Therefore, it is challenging to deploy a scheduling strategy with adaptive ability.

In this paper, we propose EAIS (Energy-Aware Inference Scheduling), an energy-aware adaptive scheduling framework to solve the above challenges. EAIS is comprised of a performance model, an asynchronous execution strategy, and an energy-aware scheduler. The performance model provides valid information about the performance characteristics of CNN inference services, which is a prerequisite for running EAIS. The asynchronous execution strategy overlaps data upload and GPU execution and builds a model to capture the relationship between request rate and latency. The energy-aware scheduler adapts the energy efficiency greedy algorithm to coordinate batching and DVFS according to fluctuating workloads. Our main contributions are as follows.

•
A comprehensive analysis of the performance characteristics. We build a performance model to shrink the feasible configuration space.
•
An asynchronous execution strategy. We overlap data upload and GPU execution to improve the system processing capacity.
•
An energy-aware adaptive scheduling policy. We design an energy-aware scheduler based on a greedy algorithm to minimize energy consumption while meeting latency SLO.

Our experimental results show that EAIS decreases energy consumption by up to 28.02% and improves the system processing capacity by up to 7.22% while meeting latency SLO compared with the state-of-the-art methods. Besides, EAIS has been proved to have good versatility under different latency SLO constraints.

Section snippets

CNN inference tasks

Previous studies [33], [34] indicate that inference applications can be divided into three categories based on their response time requirements: real-time tasks, interactive tasks, and background tasks, as illustrated in Fig. 4. With the latency increasing, they can be divided into four areas based on user satisfaction: imperceptible, tolerable, unusable, and avoidable.

•
Real-time tasks.They have strict latency requirements, which usually require requests to respond before the deadline (i.e., $L_{i}$

Methodology

In this section, we elaborate upon the design of EAIS, which aims to minimize the total energy consumption while meeting latency SLO. Table 1 summarizes the symbols used in EAIS.

We are only concerned about the energy consumption of GPUs in CNN inference services. The inference is a typical compute-intensive process, which leads GPUs to a heavy-load condition. GPUs are high power consumption units. For example, the peak power of a Tesla V100 GPU is as high as 300 W. The energy consumption of

Experimental environment

In this section, we introduce the experimental environment, including the experimental setup and the energy measurement method. In Section 4.1, we describe the platform and related parameter settings. In Section 4.2, we collect the instantaneous power using GPU built-in sensor and measure it multiple times to reduce the error.

Experimental evaluation

In this section, we evaluate the effectiveness of EAIS in decreasing energy consumption while meeting the latency SLO. We also observe the versatility of EAIS under different latency SLO constraints.

Conclusion

In this paper, we propose EAIS, an energy-aware adaptive scheduling framework that is comprised of a performance model, an asynchronous execution strategy, and an energy-aware scheduler. Extensive experiments show that the effectiveness of collaborative dynamic batching and DVFS with the asynchronous execution strategy. EAIS decreases energy consumption by up to 28.02% and improves the system processing capacity by up to 7.22% while meeting latency SLO compared with the state-of-the-art CNN

CRediT authorship contribution statement

Chunrong Yao: Methodology, Software, Investigation, Visualization, Writing – original draft. Wantao Liu: Conceptualization, Methodology, Writing – review & editing. Weiqing Tang: Writing – review & editing, Supervision. Songlin Hu: Funding acquisition, Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research is supported by the National Key Research and Development Program of China (No. 2017YFB1010000).

Chunrong Yao is a Ph.D. student in School of Computer Science and Engineering of Nanjing University of Science and Technology. He received his M.S. degree from the University of Shanghai for Science and Technology in 2016. His research interests focus on machine learning system and parallel computing.

References (70)

ToorA. et al.
Energy and performance aware fog computing: A case of DVFS and green renewable energy
Future Gener. Comput. Syst.
(2019)
HassanH.A. et al.
A smart energy and reliability aware scheduling algorithm for workflow execution in DVFS-enabled cloud environment
Future Gener. Comput. Syst.
(2020)
LinW. et al.
Experimental and quantitative analysis of server power model for cloud data centers
Future Gener. Comput. Syst.
(2018)
LinW. et al.
A hardware-aware CPU power measurement based on the power-exponent function model for cloud servers
Inform. Sci.
(2021)
LiangB. et al.
Memory-aware resource management algorithm for low-energy cloud data centers
Future Gener. Comput. Syst.
(2020)
InoueY.
Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization
Perform. Eval.
(2021)
WangQ. et al.
GPGPU power estimation with core and memory frequency scaling
SIGMETRICS Perform. Eval. Rev.
(2017)
UrdanetaG. et al.
Wikipedia workload analysis for decentralized hosting
Comput. Netw.
(2009)
KrizhevskyA. et al.
Imagenet classification with deep convolutional neural networks
RussakovskyO. et al.
ImageNet large scale visual recognition challenge
Int. J. Comput. Vis.
(2015)

HeK. et al.

Deep residual learning for image recognition

RenS. et al.

Faster R-CNN: Towards real-time object detection with region proposal networks

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

AmodeiD. et al.

Deep speech 2: End-to-end speech recognition in english and mandarin

XiongW. et al.

Achieving human parity in conversational speech recognition

(2016)

XiongC. et al.

WuY. et al.

Google’s neural machine translation system: Bridging the gap between human and machine translation

(2016)

ChenC. et al.

Deepdriving: Learning affordance for direct perception in autonomous driving

HuvalB. et al.

An empirical evaluation of deep learning on highway driving

(2015)

RibeiroM. et al.

MLaaS: Machine learning as a service

Autopilot

(2021)

Cortana

(2021)

HazelwoodK. et al.

Applied machine learning at facebook: A datacenter infrastructure perspective

Google cloud doubles down on NVIDIA GPUs for inference

(2021)

How to stop data centres from gobbling up the world’s electricity

(2021)

TensorRT

(2021)

XuR. et al.

Deep learning at scale on NVIDIA V100 accelerators

YaoC. et al.

Evaluating and analyzing the energy efficiency of CNN inference on high-performance GPU

Concurr. Comput.: Pract. Exper.

(2021)

TangZ. et al.

The impact of GPU DVFS on the energy and performance of deep learning: An empirical study

CrankshawD. et al.

Clipper: A low-latency online prediction serving system

TangX. et al.

Nanily: A qos-aware scheduling for dnn inference workload in clouds

NabavinejadS.M. et al.

BatchSizer: Power-performance trade-off for DNN inference

CuiW. et al.

Ebird: Elastic batch for improving responsiveness and throughput of deep learning services

NabavinejadS.M. et al.

Coordinated DVFS and precision control for deep neural networks

IEEE Comput. Archit. Lett.

(2019)

OlstonC. et al.

Tensorflow-serving: Flexible, high-performance ML serving

(2017)

ZhangQ. et al.

A GPU inference system scheduling algorithm with asynchronous data transfer

Cited by (14)

EALI: Energy-aware layer-level scheduling for convolutional neural network inference services on GPUs
2022, Neurocomputing
Citation Excerpt :
BS16_DVFS: A method that sets the batch size to a fixed 16 with DVFS. EE_Greedy_Model: An energy-aware coarse-grained adaptive scheduling strategy with batching and model-level DVFS, which regards CNNs as a black box, such as EAIS [51]. SLO setting:We set 200 ms as the latency SLO in the evaluation, which is consistent with previous studies [44,52].
The success of convolutional neural networks (CNNs) has made low-latency inference services on Graphic Processing Units (GPUs) a hot research topic. However, GPUs are hardware processors with high power consumption. To have the least energy consumption while meeting latency Service-Level-Objective (SLO), batching strategy and dynamic voltage frequency scaling (DVFS) are two important solutions. However, existing studies do not coordinate them and regard CNN as a black box, which makes inference services less energy-efficient. In this paper, we propose EALI, an energy-aware layer-level adaptive scheduling framework that is comprised of a power prediction model, a layer combination strategy, and an energy-aware layer-level scheduler. The power prediction model uses classic machine learning techniques to predict fine-grained layer-level power consumption. The layer combination strategy combines multiple layers into optimization units to lower scheduling overhead and complexity. The energy-aware layer-level scheduler adaptively coordinates batching strategy and layer-level DVFS according to workloads to minimize the energy consumption while meeting SLO. Our experimental results on NVIDIA Tesla M40 and V100 GPUs show that, compared to the state-of-the-art approaches, EALI decreases energy consumption by up to 36.24% while meeting SLO.
Balancing of Web Applications Workload Using Hybrid Computing (CPU–GPU) Architecture
2024, SN Computer Science
CNN Workloads Characterization and Integrated CPU-GPU DVFS Governors on Embedded Systems
2023, IEEE Embedded Systems Letters
Opportunities of Renewable Energy Powered DNN Inference
2023, arXiv
SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms
2023, arXiv
Acceleration of Convolutional Neural Networks
2023, Proceedings - 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering, BIBE 2023

View all citing articles on Scopus

Wantao Liu is a senior engineer in Institute of Information Engineering, Chinese Academy of Sciences. He received his Ph.D. degree in Computer Science from Beijing University of Aeronautics and Astronautics in 2013. His current research focuses on machine learning system and cloud computing.

Weiqing Tang is a professor in Institute of Computing Technology, Chinese Academy of Sciences. He is also the Secretary-General of China Computer Federation. He received his M.S. degree from the School of Computer Science and Technology, Nanjing University of Science and Technology in 1987. And then he received his Ph.D. from Institute of Computing Technology, Chinese Academy of Sciences in 1993. His research interests include CAD&CG, CSCW, and scientific visualization.

Songlin Hu is a professor in Institute of Information Engineering, Chinese Academy of Sciences. He is a senior member of China Computer Federation. He received his Ph.D. degree from Beijing University of Aeronautics and Astronautics in 2001. In 2005, with the support of the National Fund for Overseas Study, he joined the Middleware System Research Group at the University of Toronto, Canada for one year. His research interests include big data storage, intelligent processing, and knowledge graph.

View full text

EAIS: Energy-aware adaptive scheduling for CNN inference on high-performance GPUs

Highlights

Abstract

Introduction

Section snippets

CNN inference tasks

Methodology

Experimental environment

Experimental evaluation

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Inform. Sci.

Future Gener. Comput. Syst.

Perform. Eval.

SIGMETRICS Perform. Eval. Rev.

Comput. Netw.

Imagenet classification with deep convolutional neural networks

ImageNet large scale visual recognition challenge

Int. J. Comput. Vis.

Deep residual learning for image recognition

Faster R-CNN: Towards real-time object detection with region proposal networks

IEEE Trans. Pattern Anal. Mach. Intell.

Deep speech 2: End-to-end speech recognition in english and mandarin

Achieving human parity in conversational speech recognition

Google’s neural machine translation system: Bridging the gap between human and machine translation

Deepdriving: Learning affordance for direct perception in autonomous driving

An empirical evaluation of deep learning on highway driving

MLaaS: Machine learning as a service

Autopilot

Cortana

Applied machine learning at facebook: A datacenter infrastructure perspective

Google cloud doubles down on NVIDIA GPUs for inference

How to stop data centres from gobbling up the world’s electricity

TensorRT

Deep learning at scale on NVIDIA V100 accelerators

Evaluating and analyzing the energy efficiency of CNN inference on high-performance GPU

Concurr. Comput.: Pract. Exper.

The impact of GPU DVFS on the energy and performance of deep learning: An empirical study

Clipper: A low-latency online prediction serving system

Nanily: A qos-aware scheduling for dnn inference workload in clouds

BatchSizer: Power-performance trade-off for DNN inference

Ebird: Elastic batch for improving responsiveness and throughput of deep learning services

Coordinated DVFS and precision control for deep neural networks

IEEE Comput. Archit. Lett.

Tensorflow-serving: Flexible, high-performance ML serving

A GPU inference system scheduling algorithm with asynchronous data transfer