research-article

Public Access

EdgeWise: Energy-efficient CNN Computation on Edge Devices under Stochastic Communication Delays

Authors:

Mehdi Ghasemi,

Daler Rakhmatov,

Carole-Jean Wu,

Sarma VrudhulaAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 21, Issue 5

Article No.: 66, Pages 1 - 27

https://doi.org/10.1145/3530908

Published: 08 October 2022 Publication History

All formats PDF

Abstract

This article presents a framework to enable the energy-efficient execution of convolutional neural networks (CNNs) on edge devices. The framework consists of a pair of edge devices connected via a wireless network: a performance and energy-constrained device D as the first recipient of data and an energy-unconstrained device N as an accelerator for D. Device D decides on-the-fly how to distribute the workload with the objective of minimizing its energy consumption while accounting for the inherent uncertainty in network delay and the overheads involved in data transfer. These challenges are tackled by adopting the data-driven modeling framework of Markov Decision Processes, whereby an optimal policy is consulted by D in O(1) time to make layer-by-layer assignment decisions. As a special case, a linear-time dynamic programming algorithm is also presented for finding optimal layer assignment at once, under the assumption that the network delay is constant throughout the execution of the application. The proposed framework is demonstrated on a platform comprised of a Raspberry PI 3 as D and an NVIDIA Jetson TX2 as N. An average improvement of 31% and 23% in energy consumption is achieved compared to the alternatives of executing the CNNs entirely on D and N. Two state-of-the-art methods were also implemented and compared with the proposed methods.

1 Introduction

The large-scale and rapid pace of deployment of Internet of Things (IoT) makes it extremely cost sensitive. Consequently, there is an increasing reliance on the commercial-off-the-shelf (COTS) approach for the design of IoT devices, which are limited in their functionality, storage, performance, and energy capacity [9]. However, because the complexity of the tasks to be performed by IoT devices continues to grow [6, 30, 32, 38], the conventional approach is to perform all the data processing on remote cloud servers. This approach is not scaleable due to the large number of competing devices, long communication delays, the energy overhead of reaching the cloud, and the limitations of the network bandwidth [12, 18].

For example, consider a drone in Figure 1 that needs to run state-of-the-art object recognition algorithms for its navigation [37]. These drones will use low-end COTS processors and, being battery powered, have limited energy capacity and consequently have to rely on cloud servers for most of the processing.

Fig. 1.

The aforementioned problem can be addressed by edge computing where some or all of the data processing take place closer to the point of data acquisition [29]. The first recipient of the data is the edge device referred to here as the embedded IoT device D. Due to the limited capabilities of D, a practical approach is to make available another edge device, i.e., a nearby wireless network device N, in close proximity to D, that can share the computational burden of one or more IoT devices. The objective of sharing the computation between D and N is to save the energy consumption of D by having it transfer only the necessary data to N, with N performing the compute-intensive blocks and returning the results to D. This is referred to as embedded workload reduction (EWR) and its feasibility and utility depends on a number of factors including the granularity of the blocks, the performance and power consumption of the two devices, the communication network bandwidth and delay, and the data transfer scheme between the two devices. With WiFi, this is further complicated by the uncertainty in the communication delay, due to possible signal interference; variations in the distance between the devices; and so on.

1.1 Overview of the Article

This article presents a framework called EdgeWise, which provides an efficient and optimal solution to the EWR problem targeting the energy minimization of D assuming that the communication delay between D and N is stochastic. While the general approach presented here can be utilized for various applications, we chose to demonstrate our approach on convolutional neural networks (CNNs) not only due to their compute-intensive nature but also because of their increasing use on edge devices. More advanced networks such as Recurrent Neural Networks [35] used in natural language processing and recommender systems [17] require at least an order of magnitude more resources. An input to a CNN, which will be referred to as a frame, is processed by a sequence of CNN layers, where each layer is to be executed either on D or on N. A decision as to which device executes a given CNN layer is made by D at the runtime.

Two EWR models are described. One, referred to as fine grained, involves sampling the network delay prior to the execution of each CNN layer. The other, referred to as coarse grained, involves sampling the network delay prior to the execution of an entire frame and assumes that this value remains unchanged during the execution of all layers. The communication delay is modeled as a sum of two components—a deterministic component that depends only on the amount of data to be transferred and a stochastic component, referred to as round-trip time (RTT).

For the fine-grained model, a general, non-parametric (i.e., no assumption about the distribution of RTT) and data-driven solution is presented, in which the system behavior over time is modeled as a Markov Decision Process (MDP) [7]. A solution to the MDP is a policy that specifies an action (i.e., selecting D or N to execute a given CNN layer) in each state of the system. The policy is implemented as a lookup table, constructed once offline, and then consulted during the runtime in \(O(1)\) time whenever a decision needs to be made. For the coarse-grained model, which is a special case of the more complex fine-grained model, a dynamic programming solution is presented. This optimal solution is constructed by D online in \(O(L)\) time, where L is the number of CNN layers. It provides a complete layer assignment in advance, assuming that the measured RTT does not change during the execution of a frame.

In addition to the stochastic nature of the communication delays, another important aspect of the EWR problem addressed in this work concerns the reliability of the communication channel, which is important in the case of WiFi. If a sequence of CNN layers is assigned to N, and the WiFi connection is suddenly lost, then D would have to recompute all those layers on its own. This situation would make the overall energy consumption of D (as well as the application execution latency) even worse than in the case of N not being used. Hence, the implication of three data transfer schemes on the energy consumption of D is explored in this article: (1) a conservative scheme in which the results after executing each layer on N are sent back to D, (2) an optimistic scheme in which N sends back results to D only when necessary (e.g., when the next computation is scheduled on D), and (3) a selective scheme in which either the conservative or optimistic scheme is selected before a frame is processed, based on a prediction of the network stability.

Extensive evaluation of the proposed solutions are performed on a real platform with a Raspberry PI 3 Model B (RPI) for D and an NVIDIA Jetson TX2 for N. The benchmark applications include several well-known compute-intensive CNN algorithms [19, 20, 25]. The computation delay and power consumption of each layer within these algorithms as well as that of the wireless unit were obtained by repeated direct measurements. These were used to determine the fine-grained and the coarse-grained solutions. The experimental results also include the two cases, corresponding to a CNN application being executed entirely on either D (All-D) or N (All-N). An average improvement of 31% and 23% in energy consumption was achieved compared to these cases. In addition, two state-of-the-art offloading methods [13, 22] were implemented and compared with the proposed method in terms of energy efficiency and the overhead involved when making the decisions.

1.2 Summary of Contributions

In summary, the novel contributions of this article include the following:

•

a fine-grained MDP model that accounts for the stochastic communication delays between an energy-constrained embedded IoT device D and a wirelessly accessible nearby accelerator device N, collaboratively executing CNN applications (hosted by D) with the goal of minimizing the energy consumption of D;

•

a lightweight use of the optimal fine-grained MDP solution (constructed offline in the form of a policy lookup table), that is consulted online in O(1) time to decide which device, D or N, will execute a current CNN layer, given a currently observed network delay;

•

a coarse-grain optimal layer assignment between D and N, constructed online in linear time under the assumption that the network delay measured at the beginning of a given frame stays fixed for the duration of that frame being processed by a CNN, and

•

a study of three different schemes for transferring the layer output data from N back to D (when N is responsible for the execution of one or more CNN layers), taking into account the stability of a wireless link between D and N.

1.3 Organization of the Article

The rest of this article is organized as follows: The survey of prior work in relation to the EWR problem appears in Section 2. Section 3 describes the system components, the communication and energy models, data transfer schemes, and a justification of the model assumptions. The fine-grained MDP formulation and its solution method are presented in Section 4. The coarse-grained model, which results in a much simpler, albeit less energy optimal, solution is presented in Section 5. An extensive discussion of the experimental setup and results appears in Sections 6 and 7. Conclusions are presented in Section 8.

2 Prior Work

The majority of the existing work related to this article focuses on offloading computation from a high-end mobile device to a cloud server, as opposed to our case that involves a highly constrained low-end IoT device coordinating with a wirelessly accessible accelerator nearby. Different frameworks [10, 14] and methods have been presented in the literature to implement the offloading of computationally intensive code fragments from a mobile device to a cloud server.

Kaya et al. [23] use the call graph of a Java program and bipartition the graph using the FM heuristic method [15]. The method was demonstrated on conventional object recognition algorithms (as opposed to Deep Neural Networks (DNNs)).

Recently, attention has shifted toward DNN algorithms. The computational-intensive nature of these algorithms and their high power consumption makes offloading essential for mobile systems. Jeong et al. [21] present a way to implement DNNs by encapsulating the whole execution state as a web application, which is then transferred to an edge device or cloud server. This approach involves a high overhead of sending the whole model to the edge server.

There are some studies that present computation offloading algorithms where only the data is sent to the cloud server to get the result. Glimpse [8] performs real-time object recognition on mobile devices where the entire computation is done on the cloud server. However, when the delay of reaching the cloud server exceeds the frame time, the method uses an active cache of frames to estimate the object classes. This approach relies on the fact that consecutive frames typically do not exhibit sudden scene changes.

MoDNN [31] and DeepThings [43] are instances of frameworks that distribute the computation of input data among a set of mobile devices connected using a wireless connection. Autoscale [24] is a framework for mid-end mobile devices based on reinforcement learning that selects the execution target for DNNs with the options of CPU, co-processors, and an edge device connected in a wireless network. In this framework, the computation of all layers is offloaded to computing resources. Pipe-it framework [40] partitions the layers of DNNs across big and LITTLE cores on a mobile MPSoC to achieve higher throughput.

The idea of collaborative execution of DNNs at the granularity of a single layer between the user-end device and the server was demonstrated in frameworks called Neurosurgeon [22] and Edgnet [28]. The partitions produced by these methods may not be optimal for all cases. In JointDNN [13], the overhead of sending the data of different layers is first computed and then used to find an optimal assignment using 0-1 ILP. The above methods perform computation offloading based on the strong assumption that the communication delay is deterministic. In a shared wireless medium, this is most often not the case, and the communication time can vary substantially even over short intervals. Furthermore, the 0-1 ILP optimization approach used in JointDNN is generally not scalable and impractical for real-time processing on low-end IoT devices. A comparison of our approach with Reference [22] and Reference [13] is presented in Section 7.3.

Gao et al. [16] consider the variations in the execution times of applications during code-offloading from a mobile device to a cloud server. The stochastic execution time of an application is modeled using a semi-Markov chain. Estimates of energy savings in the different states of that semi-Markov model are used to decide whether or not a computation should be performed on the cloud server. No optimization is involved in this approach.

The problem of task placement in a distributed system with several users and edge servers, assuming stochastic communication delays, was first addressed in References [39, 41] and later extended in Reference [5] to maximize a measure of quality of service for the users subject to a constraint on the energy consumption of the edge servers. In these papers, it is assumed that the entire computation takes place on the edge servers (i.e., no collaborative execution between user-end device and edge server), and their focus is on the distributed systems with general workloads.

In this article, an efficient approach is proposed to address the variability of network delay during the execution of CNNs on a highly resource constrained embedded IoT device that is wirelessly linked to a nearby accelerator device. To minimize its energy consumption, the IoT device executes its CNN applications in tandem with the accelerator device at the granularity of individual CNN layers, while considering the stochastic overhead of communication.

In comparison to the existing literature, the novelty of this work arises mainly from the combination of the following important factors: (1) the primary focus of this work is on low-end embedded IoT devices, as opposed to less-constrained high-end mobile devices coupled with cloud servers; (2) the stochastic communication delay is explicitly taken into account using an effective MDP modeling framework; (3) optimal and lightweight solutions are presented to solve the energy minimization problems at hand, with extensive experimental validation; and (4) the performance of different data transfer schemes is explored for reducing the energy consumption of IoT device.

3 System Model

In this section, the main components of the system model are identified, and relevant assumptions are described.

Application Model: An application is modeled as a directed acyclic graph (DAG) \(G=\lbrace V_{G},E_{G}\rbrace\), where the vertices \(V_G\) represent the computational blocks or functions and the edges \(E_G\) represent data dependencies. The nodes of a DAG can be topologically sorted, so that each node is assigned to a level. The level of a primary input is 0, and the level of any given node is one plus the maximum level of all its predecessors. A line graph is a DAG where each node has a unique level. A CNN can be modeled as a line graph [42], in which each layer (node) constitutes a single block of computation.

Processing Elements: As stated in Section 1, there are two main processing engines: an embedded IoT device D with limited energy capacity and limited performance and a nearby accelerator device N that can deliver much higher performance and is not energy constrained. It is assumed that the application code is available on both D and N, and only the data captured by D and block-specific parameters are transmitted to N as needed. When D decides to let N execute some computational block, D uses its WiFi port to transfer the input data of that block to N and then to receive the results from N. While N is performing its assigned computations, D listens to its WiFi port waiting to receive the data from N.

3.1 Communication Delay Model

The communication delay between D and N using a wireless link exhibits substantial variations. This delay is affected by a number of factors, such as the communication distance, packet size, the traffic load, interference by neighboring devices, and so on. It is commonly expressed as [4]

\begin{equation} \delta _{D,N} = S/B + RTT, \end{equation}

(1)

where RTT is the round-trip time or network delay (ms), B is the bandwidth (Mbps), and S is the size of the data (Kb). The bandwidth of the wireless communication channel (B) is assumed to be fixed. However, RTT captures all the sources of network delay variations and consequently is assumed to be a non-negative random variable with an arbitrary distribution. Having both bandwidth and RTT as random variables, the communication delay can be modeled as a sum of two random variables.

Figure 2 illustrates the significance of the wireless RTT, observed in our experimental setup consisting of a RPI, serving as D, and an NVIDIA Jetson TX2 (Jetson), serving as N. Note the substantial variations in communication delay when using WiFi as compared to wired LAN communication. Although the round-trip time varies for a wired connection as well, the magnitude of those variations is negligible in comparison with the execution time of computationally heavy nodes in the CNN flow graph.

Fig. 2.

3.2 Energy Model

Our objective is to determine the best assignment of computation blocks (represented by nodes in \(V_G\)) to the devices D and N such that the energy consumption of D is minimized during the application execution. Given \(G=(V_{G},E_{G})\), let \(m_i = \text{0 or 1}\) if node \(i \in V_G\) is to be executed on D or N, respectively. The energy consumption of D, denoted by \(E_D\), consists of several components, described below. Section 6 discusses how realistic estimates of these quantities are obtained.

(1)

\(E_D(i|D)\) and \(E_D(i|N)\) denote the computation energy consumption of D when the computation block represented by node \(i \in V_G\) is executed on device D and device N, which is dependent on the execution time on D and N (\(\delta (i|D)\) and \(\delta (i|N)\)), respectively. Thus, \(E_D(i|N)\) is equal to the energy consumption of D when it is idle and listening to its WiFi port for incoming data.

(2)

\(E_D(i|x,j|y)\) denotes the communication energy consumption due to the data transfer associated with edge \((i,j) \in E_G\), when node i is executed on device x and node j is executed on device y, where \(x,y \in \lbrace D,N\rbrace\). The meaning of each of the four possibilities is explained below.

(a)

\(E_D(i|D,j|N)\) is the communication energy consumption of D when node i is executed on D and node j is executed on N. As \((i,j) \in E_G\), the output of node i is the input to node j. Therefore, \(E_D(i|D,j|N)\) is the energy consumed by D when transferring the results of node i from D to N.

(b)

\(E_D(i|N, j|D)\) is the communication energy consumption of D when node i is executed on N and node j is executed on D. As \((i,j) \in E_G\), the output of i is the input to j. Therefore, \(E_D(i|N, j|D)\) is the energy consumed by D when receiving i’s results from N.

(c)

\(E_D(i|N, j|N)\) has two interpretations. In the case of our conservative data transfer scheme (where the results of i computed by N must be transferred back to D), \(E_D(i|N, j|N)\) is the same as \(E_D(i|N, j|D)\). However, in the case of our optimistic data transfer scheme (where D does not receive a copy of i’s results), \(E_D(i|N, j|N)\) is zero.

(d)

\(E_D(i|D,j|D) = 0\). As nodes i and j are both executed on D, there is no communication energy consumed by the data transfer.

The computation and communication energy consumption of D can now be expressed in terms of the above quantities and the decision variables as follows:

\begin{equation} E_{D}(i)= m_{i}E_D(i|N)+(1-m_{i})E_{D}(i|D), \end{equation}

(2)

\begin{equation} E_{D}(i, j) = m_{i}m_{j}E_D(i|N,j|N)+m_{i} (1 - m_j)E_D(i|N,j|D)+(1 - m_i)m_{j}E_D(i|D,j|N). \end{equation}

(3)

Our objective is to minimize the total energy consumption of D during the application execution, including the computation energy and communication energy. This is denoted by \(E_{tot}\).

\begin{equation} Minimize \quad \left(E_{tot}=\sum _{i\in V_{G}}^{} E_{D}(i) + \sum _{(i,j)\in E_{G}}^{} E_{D}(i,j)\right)\!. \end{equation}

(4)

3.3 Data Transfer Schemes

The IoT device D, which is the first recipient of the data, serves as the master, on which the layer assignment decisions are made. If a computation block \(i \in V_G\) is delegated to N, then D transfers all the necessary data to N to initiate the execution of i. Besides deciding on which device (D or N) a given block should be executed, another important factor to be considered is when the results of computations on N should be transferred back to D. This is referred to as the data transfer scheme. In the presence of an unstable communication medium, this can have a substantial impact on the energy consumption of D, because when its WiFi port becomes unresponsive, the currently delegated computations (performed by N) must be entirely redone by D.

To simplify the description of the data transfer schemes in question, consider the execution sequence \((D_1, N_2, N_3, N_4, D_5, D_6, N_7, N_8, D_9, D_{10})\), \(D_i\) (\(N_i\)) indicating that the computation block i is executed on D (N), and let \(o_i\) denote the data outputs generated by i.

•

Conservative Scheme: This requires that the results of every computation block executed on N be transmitted immediately back to D. For the above sequence, the set of data outputs transmitted back to D would be \((o_2, o_3, o_4, o_7, o_8)\). This ensures the robust execution of the application even when N suddenly becomes unavailable for whatever reason, in which case D can use the previous results obtained from N without having to redo all of N’s computations. For example, if D fails to receive \(o_4\) in time, then it can take over starting from computation block 4, without having to recompute blocks 2 and 3.

•

Optimistic Scheme: This allows for the results to be sent from N to D only when necessary. That is, N would transmit the results of a block it is executing only if the next block is to be executed by D. For our example sequence, only the results \((o_4, o_8)\) would be transmitted to D. From an energy perspective, this is riskier, because if D fails to receive \(o_4\) in time, then D would have to recompute blocks 2 and 3.

•

Selective Scheme: Unlike the conservative and optimistic schemes, this scheme aims to take into account the future state of the communication medium. Figure 3 depicts how the selective scheme works. Based on the past RTT measurements, it predicts whether or not the network will be unstable prior to processing the current frame. Based on the outcome of that prediction, it selects either the conservative or optimistic scheme, and then determines the corresponding optimal assignment of computation blocks to be executed on D or N.

The overall view of two-level controller. The network stability prediction unit determines the data transfer scheme and the scheduler assigns the mapping of graph nodes to devices.

For the selective scheme, a simple, lightweight criterion is used to define the instability of the WiFi link, whose behavior is modeled using a Markov chain by discretizing the distribution of the RTT values into r intervals \((I_1, I_2, \ldots , I_r)\). The WiFi link is considered to be in state k if \(RTT \in I_k\), for \(1 \le k \le r\). Given a timeout value \(\theta\), the WiFi link is deemed unstable if \(RTT \gt \theta\). It is assumed that \(RTT_{min}\le \theta \le RTT_{max}\). Let \(\kappa (\theta)\) be the index such that \(\theta \in I_\kappa\), let \(p_{k,\ell }\) denote the one-step k-to-\(\ell\) state transition probability of the WiFi link, and let \(q_k\) denote the total probability that the next state of the WiFi link will be unstable given that its present state is k. That is,

\begin{equation} p_{k,\ell }=\text{Prob}(\text{Next Frame } RTT \in I_\ell \mid \text{Current Frame } RTT \in I_k), \end{equation}

(5)

\begin{equation} q_k = \sum _{\ell = \kappa (\theta)}^{r} p_{k,\ell }. \end{equation}

(6)

Now the selective scheme depicted in Figure 3 chooses the conservative scheme if \(q_k \ge 0.5\); otherwise, it selects the optimistic scheme.

3.4 Discussion of Model Assumptions

Application Model: Restricting the graph model of computation to a line graph is well suited for CNNs and, as will be demonstrated in the experimental results, is a valuable special case. The presented method can be applied for other workloads in the form of a line graph with different performance profiles. The proposed solution methods can be extended to general DAGs currently under development.

Network Delay: Our wireless communication delay model expressed by Equation (1) is a well established lumped model [4] that can account for various sources of variability such as packet drop, packet processing delays and others, captured by the random RTT term. The empirical distribution of the RTT is characterized during the profiling step and used in constructing our fine-grained solution. It was also observed that the variation in the execution time on N was not significant compared to the variation in RTT.

Objective Function: The energy consumption of D is the product of the total delay (including computation and communication delays) and the power consumption of D. Thus delay is also accounted for when considering energy as the objective function. Furthermore, with line graphs, \(E_D\) can be used to represent delays (instead of energy consumption), which will translate our original problem into delay minimization without any changes to the modeling approach or algorithms. Also, note that our energy objective function focuses exclusively on D, assuming that the limited capacity of D’s energy source is of utmost concern. If accounting for the energy consumption of N (powered by its own source) is also important, then it can be incorporated by adding related quantities to \(E_D(i|N)\), \(E_D(i|N, j|D)\), and \(E_D(i|D,j|N)\).

Choice of Experimental Platform: Two popular COTS devices (RPI and Jetson) in IoT systems have been selected serving as D and N. However, the methods described here are applicable to other platforms once the profiling step is done. In the scenario where N is serving several other IoT devices, its utilization affects the execution time of layers on N as well. Thus, this changes the energy cost of executing a layer on N. The presented solution here can account for such a variation using the term \(E_D(i|N)\) in the formulation.

Offloading benefit: The benefit of offloading in terms of energy consumption is dependent on the overhead of communication energy and the computation energy profiles on devices. The stochastic variable that significantly affects the communication energy is the communication delay that changes the energy-optimal partitioning. Our proposed energy optimization method is orthogonal to any other existing method for saving the power consumption of D. Given the power profiles of the CNN layers, our approach provides the energy-optimal partitioning across devices. In the evaluation of our approach, the power consumption values for the layers of a single CNN are the same throughout different comparisons on the same CNN (e.g., comparison with All-N and All-D).

4 Fine-grained Solution

This section describes a solution to our workload reduction problem accounting for the stochastic variations in the communication delays during the execution of a CNN application. The problem at hand can be viewed as a sequential decision process whereby D decides on which device, D or N, a block will be executed. Our approach is to model the time evolution of the application execution as a MDP [36]. The proposed solution has three valuable characteristics: (1) it has \(O(1)\) complexity; (2) it is data-driven with minimal assumptions about the data, i.e., it does not depend on complex parametric models; and (3) it produces an optimal assignment of computation blocks.

The outcome of an MDP is a policy that specifies the action to take (i.e., assign the block to D or N) for each possible state of the system. Therefore, the policy is simply a table representing a function from states to actions. Once such a policy is determined, the actual step that D performs is very simple: Before the execution of a given block (respecting the partial order specified by the application dataflow graph G), D determines the present state of the system and, based on that state, consults the policy in \(O(1)\) time (table lookup) and determines the action.

Our MDP model requires a restriction: The application dataflow graph G is topologically sorted, and all nodes with the same layer are combined to form a single block. Thus G is transformed into a line graph representing a sequence of blocks to be executed, where all the outputs of any given block are relayed to the next block in the sequence as inputs. With this restriction, our MDP model assigns individual blocks to either D or N. Note that in the context of CNNs, the terms block and layer become synonymous.

4.1 Markov Decision Processes

There is an extensive body of literature on MDPs, as it is a rich and well-developed subject with many variations [36]. The essential components, and one of several methods of constructing an optimal policy (namely value iteration [7]) are described next.

An MDP is represented as a tuple \((S,A,P,COST,\gamma)\) where

•

A is a finite set of actions. In our case, there are only two actions: assigning a block to either D (local execution) or N (remote execution).

•

S is a finite set of states. A state \(s \in S\) is a tuple (\(\Delta\), i, d), where i denotes the current block or CNN layer, d denotes the device on which i is executed, and \(\Delta\) denotes the interval within which the present RTT value belongs. In our model of the communication delay, the RTT is a random variable. The range of RTT is partitioned into r equilength intervals \(I_1, I_2, \ldots , I_r\). All that is being done is approximating RTT to some value in a finite discrete set. Making a discrete approximation allows us to enumerate the state space when constructing the MDP solution. Likewise, the range of the bandwidth variation can be divided into intervals and added to the state definition. The value of \(\Delta\) is obtained by measuring the network delay just prior to taking an action. The total number of states in the MDP is \(n=2rL + r\), where r and L denote the number of RTT delay intervals and the number of layers, respectively. The value of n determines the number of entries in the lookup table that is used to select the corresponding action for the current state. The total number of states includes r dummy states that represent the initial delay. The dummy states only include \(\Delta\) in their definition.

•

\(P(s_{1},a,s_{2})=Pr(s_{t+1}=s_{2}|s_{t}=s_{1},a_{t}=a)\) is a state transition probability function, i.e., the probability of transitioning from a present state \(s_1\) at decision epoch t to the next state \(s_2\) at decision epoch \(t+1\), with action a. These state change probabilities are the probabilities associated with entering a specific network delay interval from a current interval. Details on how these probabilities are obtained (using network delay profiling) are discussed in Section 6 describing our experimental setup.

•

\(COST(s_{1},a,s_{2})\) is the cost incurred after transition from state \(s_{1}\) to state \(s_{2}\) due to action a. The cost values are the sum of computation and communication energy of D for the selected action.

•

\(\gamma \in\) [0,1] is the discount factor that accounts for the difference between present and future costs.

The purpose of using discount factor \(\gamma\) in the MDP formulation is to compare the policies in the infinite time horizon and evaluate the convergence of the MDP policy. Infinite time horizon refers to the case where the stochastic system is considered from decision epoch 0 to \(\infty\). Optimization over the infinite time horizon essentially ensures the solution optimality regardless of when the process ends. Since the total incurred cost of performing actions in an infinite number of decision epochs would be infinity, there is a need to make the total cost finite when comparing policies. Thus, the cost at each decision epoch is multiplied by the discount factor \(\gamma\) to make the distant future costs close to zero.

Figure 4 shows a graphical representation of an MDP for a set of computation blocks (layers) connected in the form of a chain, in which the network delay falls into one of three discrete intervals. As a result, there are three initial states. In each state, an action (local execution or remote execution) is selected. For each layer from 1 to L (number of layers), the top three states correspond to the action 0 (i.e., local execution on D) and the lower three states correspond to action 1 (i.e., remote execution on N).

Fig. 4.

4.2 Constructing an Optimal Policy

In the MDP framework, after selecting an action a in each state s, the system transitions to another state with a probability P and the action will lead to a cost value COST. The goal is to find the best policy \(\pi\) that assigns an action to every state to minimize the following cost function:

\begin{equation} \sum _{t=0}^{\infty }\gamma ^tCOST(s_{t},a_{t},s_{t+1}), \end{equation}

(7)

where \(a_{t}\) denotes the action in state \(s_{t}\) in the obtained optimal policy. The MDP solution should consider the probabilities of state change and the cost values obtained in different decision epochs after selecting a specific action. In fact, a greedy action in each state does not necessarily lead to the optimal solution in the long run, and the MDP should take into consideration all the state changes and the cost values obtained in the subsequent decision epochs. Hence dynamic programming is used to implicitly consider all the possibilities.

To find the MDP solution, which is a policy that assigns the optimal action to each state, one can use either policy iteration or value iteration [7]. Our MDP problem is solved using the latter, as shown in Algorithm 1, in which \(V(s)\) denotes the cost value in state s. For all states, the algorithm calculates the optimal value of \(V(s)\) iteratively, considering all possible actions and next states \(s^{^{\prime }}\) until the difference \(\epsilon\) in the value of \(V(s)\) in subsequent iterations becomes negligible (less than a small number \(\rho\), which determines the stopping condition and therefore convergence). In our case, the actions are either 0 (execute on D) or 1 (execute on N), and the associated cost values are stored in \(D[s]\) and \(N[s]\). The cost values account for the probability of state change and the discount factor \(\gamma\). Finally, the method finds the optimal policy (action) in each state s based on arg min operator.

To summarize, our fine-grained MDP approach involves the following steps:

•

Profiling: The behavior of the system is profiled using a sufficient number of repeated experiments. The probability of state change and the cost function for different actions are obtained. These values are then formatted as an MDP grammar file.

•

Solving the MDP problem: After constructing the MDP problem including the probabilities and cost function, the MDP is solved and the optimal policy for selecting the suitable action is stored in the lookup table.

•

Using the lookup table: During the runtime, the lookup table is consulted to select the suitable action depending on the current state. An MDP evaluator tests next states to find the incurred cost values on average.

Complexity of Decision: Once the optimal policy is determined offline using the cost values obtained by profiling, the optimal action in each state is simply determined by a table lookup, which is of \(O(1)\) time complexity.

Reduction to Coarse-grained Solution: In special case of a fixed network delay (i.e., when the RTT does not vary across multiple delay intervals), the optimal MDP solution would reduce to a fixed assignment of all individual layers, which can be computed at the beginning of the frame in advance (i.e., before starting the CNN application execution). In fact, the probability of state change would be one for all the state transitions (\(P(s,a,s^{^{\prime }})=1\)), with \(\gamma =1\). Additionally, the stopping condition (\(\delta \lt \rho\)) would still hold in the fixed delay case. Thus, the recursive relationship between \(V(s)\) and \(V(s^{\prime })\) (which are the respective cost values in the current state s and the next state \(s^{\prime }\)) given by \(V(s)=min(\sum {}^{}P(s,0,s^{^{\prime }})(COST(s,0,s^{^{\prime }})+\gamma V(s^{^{\prime }})),\sum {}^{}P(s,1,s^{^{\prime }})(COST(s,1,s^{^{\prime }})+\gamma V(s^{^{\prime }})))\) would reduce to \(V(s) = min(\sum COST(s,0,s^{\prime })+V(s^{\prime }), \sum COST(s,1,s^{\prime }) + V(s^{\prime }))\). In other words, the value iteration procedure becomes much simpler, and the following key points are to be noted:

•

In the case of a fixed RTT, it is no longer necessary to perform the offline profiling of the MDP state transition probabilities. Instead, the network delay measured at the beginning of the frame is assumed to be fixed for the duration of that frame. Such a coarse-grained model still allows for the RTT changes from frame to frame (but not within a frame from layer to layer, which is handled by our fine-grained MDP model).

•

In the coarse-grained case, Algorithm 1 (executed offline to generate the lookup policy table in the fine-grained case) can be replaced by a low-complexity algorithm executed at the beginning of a frame at the runtime, given the measured RTT whose value affects the communication energy costs. The details of this algorithm are presented in the next section.

5 Coarse-grained Solution

The assumption that the RTT remains constant during the processing of a frame leads to the following simplification of the MDP formulation.

Under the assumption of a fixed RTT, wireless transmissions and receptions are modeled as graph nodes shown in Figure 5. The costs of computation and communication are associated with the edges of the graph. Before executing a block or layer, the input data of that block needs to be received. After receiving the data, there is a cost associated with running the block on D or N, which is depicted in the model as \(E_D(i|D)\) or \(E_D(i|N)\) (shown inside the box of each layer). The communication costs between different layers can be one of the four cases shown in the graph. For instance, one case of the previously mentioned possible cost values is \(E_D(i|D,i+1|N)\) when the layer i is executed on D and layer \(i+1\) on N.

Fig. 5.

Two special nodes, labeled S (source) and T (sink) are also introduced, such that S transmits the input data to be processed in the first layer, while T receives the result of the last layer. The cost of S- and T-connected edges is set to zero; however, \(E_D(1|N)\) and \(E_D(L|N)\) are adjusted accordingly, to account for the initial D-to-N and final N-to-D communication costs.

Given the graphical model depicted in Figure 5, the EWR problem with the deterministic communication delay can be solved using Dijkstra’s algorithm. The time complexity of finding a shortest path solution would be \(O(|V_{G}|^2)\) when a graph is represented by its adjacency matrix. However, the special structure of the underlying graph allows for a much simpler method described below.

First, block arrays \(\textsf {D}[i]\) and \(\textsf {N}[i]\) are introduced representing computation within a given block i. Each \(\textsf {D}[i]\) has a Cost attribute, equal to the sum of \(E_D(i|D)\) and the minimum accumulated cost of reaching the block i from S (source) when the block is scheduled to be executed on D. Similarly, each \(\textsf {N}[i]\) has a Cost attribute, equal to the sum of \(E_D(i|N)\) and the minimum accumulated cost of reaching the block i when the block is scheduled to be executed on N.

To update \(\textsf {D}[i+1].{\rm C}{\rm\small{OST}}\), only three quantities are required: \(E_D(i+1|D)\), \(E_D(i|D,i+1|D)\) summed with \(\textsf {D}[i].{\rm C}{\rm\small{OST}}\), and \(E_D(i|N,i+1|D)\) summed with \(\textsf {N}[i].{\rm C}{\rm\small{OST}}\). Then,

\begin{equation} \begin{split} \textsf {D}[i+1].{\rm C}{\rm\small{OST}} = E_D(i+1|D) +\min (\textsf {D}[i].{\rm C}{\rm\small{OST}} + E_D(i|D, i+1|D), \textsf {N}[i].{\rm C}{\rm\small{OST}} + E_D(i|N, i+1|D)). \end{split} \end{equation}

(8)

Similarly,

\begin{equation} \begin{split} \textsf {N}[i+1].{\rm C}{\rm\small{OST}} = E_D(i+1|N) + \min (\textsf {D}[i].{\rm C}{\rm\small{OST}} + E_D(i|D, i+1|N), \textsf {N}[i].{\rm C}{\rm\small{OST}} + E_D(i|N, i+1|N)). \end{split} \end{equation}

(9)

Algorithm 2 shows the resulting partitioning method. After initialization steps 1–5, it performs \(L-1\) iterations: steps 7–9 implement Equation (8), while steps 10–12 implement Equation (9). Upon completion of this loop, step 14 identifies the end point (\(i = L\)) of the minimum-cost path, and step 16 recovers the optimal mapping sequence by tracing the Pred pointers (recorded predecessors) back to the starting layer (\(i = 1\)). During this traceback, the 0-1 Type indicators yield the sought bit-values of the assignment vector \(\textsf {m}[i]\). The space/time complexity of our algorithm is \(\Theta (L)\).

6 Setup of Experiments

This section discusses the CNN applications and experimental devices in use, as well as their computation and communication profiling details.

6.1 CNN Benchmarks

Four well-known CNNs were used as test cases: LeNet [27], AlexNet [25], MobileNet [20], and ResNet18 [19], having 6, 20, 16, and 18 layers, respectively. LeNet was tested on the MNIST dataset and the others on the ImageNet dataset using images of size \(224\times 224\times 3\) [11]. The layers in the CNNs are treated as the computational blocks to be offloaded as they correspond to the most time consuming code portions.

The amount of data to be transferred between D and N, in conjunction with the stochastic RTT, determines the energy costs associated with communication and affects the optimal assignment solution. Figure 6 shows the distribution of the number of output data bytes per layer for the four CNNs. The decrease in the amount of data in the final layers provides an incentive to offload those layers to the nearby accelerator. Note that LeNet is the least expensive in terms of layer-to-layer data movement, while MobileNet is the most expensive.

Fig. 6.

The following example illustrates the impact of uncertainty in RTT, given in Equation (1). Suppose layers 1 and 2 of MobileNet are executed on D and N, respectively, thus requiring a D-to-N data transfer of size \(S = (401408\times 8)/1024=3136\) Kb. Assuming the bandwidth \(B = 12\) Mbps, the communication delay of \(\delta = 255~\text{ms} + RTT\) is obtained. The range of RTT values from 7 to 130 ms (see Figure 2 for example) will result in the variation in \(\delta\) between 3% and 51%. Similarly, if layers 10 and 11 of ResNet18 are executed on D and N, respectively, then the corresponding communication delay may range from \(32 (S/B) + 7 (RTT)\) ms to \(32 (S/B) + 130 (RTT)\) ms. Clearly, the stochastic RTT contribution must be taken into account, which is the key motivation behind this work.

6.2 Computation and Communication Profiling

To demonstrate the value of the proposed approach, the RPI [2] was selected as the IoT device D and the NVIDIA Jetson TX2 (Jetson) [1] as the nearby accelerator N. The devices were connected using a UniFi AP AC LITE wireless access point [3]. The following sections contain a description of how computation (performance and power consumption) and communication (network delay) profiling was done.

6.2.1 Performance Profiling.

The performance of each layer in different CNNs was individually profiled on both D and N. This involved executing each layer several times and measuring the execution time using built-in time functions in Python. Figure 7 shows the ratio of the execution times on D to those on N for the different CNNs.

Fig. 7.

The convolution layers were the most time-consuming on both devices. For LeNet, the Python numpy package was used; this package did not utilize the GPU available on N. However, PyTorch [34] was used for profiling AlexNet, MobileNet, and ResNet18, which utilized the GPU on N. This explains the differences in relative speedup of these CNN layers on N versus D. For instance, for LeNet, a speedup of 3× was obtained using the Jetson (N) over the RPI (D), whereas the speedup for the others ranged from two to five orders of magnitude due to the use of GPUs. Note that for fully connected layers (layers 15 of AlexNet, 16 of MobileNet, and 18 of ResNet18), the RPI was faster than the Jetson due to the lower overhead of loading the weights into main memory. Another important observation is that the execution times of MobileNet layers are much higher than for the other CNNs. Consequently, the network delay becomes less significant when offloading MobileNet layers in comparison to the other CNNs.

6.2.2 Energy Profiling.

The energy consumption of D (i.e., multiplication of power consumption and delay) including computation and communication energy was measured during the profiling step. The average power consumption of running CNNs on D was measured for individual layers of each CNN benchmark. The measurement was done using a National Instrument PXI digital acquisition unit (Figure 8). The total current drawn using the RPI was measured placing a 0.1\(\Omega\) resistance along the path. The current obtained was multiplied by the supply voltage to obtain the power consumption. The power consumption of running ResNet and MobileNet was higher than AlexNet and LeNet. This was verified by checking the CPU utilization of RPI while running the CNNs. Figure 9 shows the CPU utilization of RPI, which shows higher utilization for MobileNet and ResNet compared with AlexNet and LeNet. Figure 10 shows the energy consumption of all the layers in the CNNs. The highest total energy consumption was related to MobileNet. The communication energy is the product of the communication power and the data transfer time. Communication energy profile was used in the decision making of layer partitioning.

Fig. 8.

Fig. 9.

Fig. 10.

6.2.3 Network Delay Profiling.

For the MDP solution, the state transition probabilities associated with the communication network were estimated by repeated measurements of the RTT and assigning the values to the appropriate network delay interval. This yields the discrete probability distribution of the network delay. Estimates of state transition probabilities were obtained by counting the frequency with which interval \(I_j\) was observed following interval \(I_i\), over a sequence of measurements.

7 Experimental Results

In this section, the computation offloading (i.e., IoT device workload reduction) results are presented that are obtained using (1) our fine-grained MDP approach in the stochastic network delay case and (2) our coarse-grained dynamic programming procedure in the fixed network delay case. It should be noted that, in both cases, the corresponding assignments of individual CNN layers to either D or N are optimal solutions to the energy minimization problem stated in Section 3.

7.1 Fine-grained Solution

To build the MDP model, the system must be profiled first. This includes taking repeated direct measurements of the network delay and constructing the state transition probabilities mentioned in the MDP formulation (Section 4). The cost functions were also calculated based on the profiling of both devices. The states, probabilities, and cost functions were then converted into the MDP grammar format. To evaluate the convergence of the method to the optimal solution in an infinite time horizon, discount factor \(\gamma\) for the MDP was set to 0.9. The public domain framework called APPL [26, 33] was used to solve the MDP problem with the given cost functions and state change probabilities. Then, the policy was computed and expressed in the form of a lookup table specifying the suitable action in each state. The number of entries in the lookup table (as shown in Table 1) was at most 287 for AlexNet when the network delay distribution was partitioned into seven network delay intervals. In other words, the overhead of looking up the state and action pair in the lookup table is negligible.

Table 1.

	Number of network delay intervals (r)
CNN	r = 3	r = 5	r = 7
LeNet	39	65	91
AlexNet	123	205	287
MobileNet	99	165	231
ResNet18	111	185	259

Table 1. The Number of Entries in the MDP Lookup Table (i.e., Number of States)

The overhead of looking up the table is negligible.

The MDP lookup table generation is only done once if the communication model is fixed. There can be several generated lookup tables for different communication models that can be used for different scenarios. The overhead of generating the MDP lookup table for \(r=7\) on a Core-i7 3612 CPU with 8 GB RAM is reported in Table 2. The measured overheads demonstrate that the MDP can be re-trained at runtime where the model is deviating from the actual model.

Table 2.

CNN	Overhead (ms)
AlexNet	713
LeNet	40
MobileNet	258
ResNet	343

Table 2. The Overhead of Computing the MDP Lookup Table for \(r=7\)

Table 3 shows the results of applying the MDP to different CNNs under the conservative data transfer scheme (see Section 3). These numbers were obtained after running the experiments 200K times. It shows that using more network delay intervals for the MDP formulation leads to the lower energy consumption of D. This clearly demonstrates the importance of accounting for the variations in the communication delays. The column labeled \(r=1\) corresponds to using the midpoint of the entire network delay range (\((RTT_{min}+RTT_{max})/2\)), which is the coarse-grained special case of the fixed \(RTT = 67.925\) ms. Using more than seven intervals did not result in any further decrease in the energy consumption. Comparing the results of \(r=1\) with \(r=7\), the energy consumption of D was reduced approximately by 41% for LeNet, 25% for MobileNet, 14% for AlexNet, and 22% for ResNet. The average reduction in energy consumption was 25.5%. In Section 7.2, additional coarse-grained special cases of different fixed values of the RTT equal to 10, 20, 40, 80, and 120 ms are discussed.

Table 3.

	Number of network delay intervals (r)
CNN	\(r=1\)	\(r = 3\)	\(r = 5\)	\(r= 7\)
LeNet	0.708 J	0.468 J	0.432 J	0.416 J
MobileNet	8.68 J	6.876 J	6.628 J	6.515 J
AlexNet	2.077 J	1.855 J	1.798 J	1.778 J
ResNet18	6.587 J	5.574 J	5.247 J	5.099 J

Table 3. Fine-grained Solution: Average Energy Consumption of D for 200K Experiments Using Different Number of Network Delay Intervals (Conservative Scheme)

Using \(r=7\) delay intervals resulted in the most energy savings for D. For this case, some CNN layers were always assigned to N (Jetson), while some other layers were never assigned to N (see Figure 11). This shows the potential for model order reduction in the context of our MDP framework, which may in turn give rise to efficient online algorithms incorporating both fine-grained (stochastic) and coarse-grained (deterministic) optimization approaches.

Fig. 11.

Table 4 shows the results of simulating all three data transfer schemes repeatedly for 1,000 trials. The number of the network delay intervals was \(r=7\). For the selective scheme, the value of the timeout was \(\theta = 60\) ms and the simulations of the selective scheme included the presence of failures, whereas the other two were assumed to be ideal. The value of \(\theta\) is different than \(RTT=67.925\) ms, which was used for comparing the MDP solution in the case of \(r=1\). The results clearly show that even when failures are present, the selective scheme results in much less energy consumption by D when compared to the conservative scheme and nearly the same as the (ideal) optimistic scheme. For the CNNs that have higher amount of data for each layer (MobileNet and ResNet18), the conservative scheme involves a significant amount of energy consumption for sending back the results after the execution of each layer. Thus, there is a better potential for energy improvement using the selective approach for these CNNs.

Table 4.

CNN	Optimistic (\(\mu ,\sigma\))	Conservative (\(\mu ,\sigma\))	Selective (\(\mu ,\sigma\))
LeNet	0.31, 0.02	0.41, 0.04	0.31, 0.05
MobileNet	0.50, 0.04	6.51, 0.24	0.65, 1.13
AlexNet	0.76, 0.02	1.77, 0.05	0.78, 0.13
ResNet18	0.53, 0.04	5.09, 0.23	0.60, 0.48

Table 4. Fine-grained Solution: Mean (\(\mu\)) and Standard Deviation (\(\sigma\)) of Energy Consumption of D (in Joules) for Different CNNs (1,000 Trials)

MDP Sensitivity Analysis: In this section, the effect of the MDP model noise on the energy values is evaluated.

•

Let P be the empirical probability function based on n observations of the round-trip delay \(RTT_1,\ldots ,RTT_n\). Let \(\pi _P\) be the optimal MDP policy based on the probability function P.

•

Let \(P^{\prime }\) be the probability function based on n other observations of the round-trip delay \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\) that might be observed when the system is operating. \(P^{\prime }\) can be considered as the noisy model.

•

We conducted 200K experiments, in which we generated random round-trip delay values \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\), and applied the policy \(\pi _P\) with these delay values. The net result was that the energy values when applying \(\pi _P\) using \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\) differed at most by 5% compared to the case when \(\pi _P\) was applied using P, which was based on \(RTT_1,\ldots ,RTT_n\). Below are the details of the experiment.

To perform the sensitivity analysis, we added a noise to the measured values of RTT. This noise was considered to have a normal distribution \(N(0,\sigma ^2)\). In our experiments, the minimum and maximum measured RTT values was 6 and 129 ms, respectively. We selected the value \(\sigma =10\) for the distribution of the noise. We calculated the probability function \(P^{\prime }\) based on the data with added noise. Table 5 shows the result of applying the same policy on the delay values from both P and \(P^{\prime }\). As can be seen, the difference in the average energy consumption for 200k is at most 5%.

Table 5.

CNN	\(\pi _P\) on probability function P	\(\pi _P\) on noisy probability function \(P^{\prime }\)	Difference
LeNet	0.416 J	0.436 J	5%
AlexNet	1.778 J	1.817 J	2%
MobileNet	5.099 J	5.345 J	5%
ResNet18	6.515 J	6.719 J	3%

Table 5. Average Energy Consumption for 200k Trials Using a Same Policy for the Delay Values from the Probability Function P and \(P^{\prime }\)

The values of 95% confidence interval are also reported for different CNNs as shown in Table 6. The confidence interval is \([\mu -1.96\times \sigma /n_{sample},\mu +1.96\times \sigma /n_{sample}]\) where \(\mu\) and \(\sigma\) are mean and standard deviation. These values mean that the discounted cost falls in the mentioned interval in 95% of experiments. As can be seen, the obtained average discounted cost value also falls in this range.

Table 6.

It is worthy to mention that these cost values are not comparable with the total energy consumption values, since they are multiplied by the discount factor in every decision epoch to evaluate the MDP solution in the infinite horizon.

Varying Bandwidth: Another set of experiments was also conducted treating the bandwidth as a random variable, sampling it at the beginning of the execution of every layer.

The bandwidth was modeled as a Gaussian random variable \(N(1,0.2)\). The mean \(\mu =1\) was the midpoint of the range of values (\([0.4,1.6])\) MBps, whereas the standard deviation was taken to be 0.2 to cover the range of variations. The MDP state is now described by \((\Delta ,BW,i,d)\), and similarly to the RTT variation, it is divided into several intervals and the parameter BW is the interval within which the observed bandwidth value belongs.

Figure 12 shows the heatmap of the layer by layer execution of the CNN, assuming varying bandwidth. The figure shows that the proposed solution accounts for these variations compared with Figure 11 where the bandwidth was fixed. Table 7 shows the corresponding average values of the energy consumption of each CNN. These data show that the energy values in the case of varying bandwidth is higher compared with the fixed bandwidth case, as expected. This difference is higher for MobileNet, since it has higher number of data outputs.

Fig. 12.

Table 7.

CNN	Energy (Varying bandwidth) (J)	Energy (Fixed bandwidth) (J)
AlexNet	1.86	1.778
MobileNet	8.727	6.515
ResNet	5.832	5.099

Table 7. The Average Energy Consumption for 200k Trials Considering the Bandwidth Variation

7.2 Coarse-grained Solution

The coarse-grained solutions for the case of fixed \(RTT = 40\) ms are listed in Tables 8 and 9. Column CGDP shows the results of executing Algorithm 2 (optimal layer partitioning between D and N), while columns All-RPI and All-Jetson refer to running the entire application either on D or on N, respectively.

Table 8.

CNN	CGDP	All-Jetson	Ratio	CGDP	All-Jetson	Ratio	All-RPI
	Conservative			Optimistic
LeNet	0.54 J	0.63 J	1.17	0.34 J	0.34 J	1.0	0.95 J
MobileNet	7.40 J	9.1 J	1.23	0.64 J	1.96 J	3.06	20.69 J
AlexNet	1.96 J	3.10 J	1.58	0.89 J	0.89 J	1.0	2.17 J
ResNet18	5.92 J	7.84 J	1.32	0.68 J	2.07 J	3.04	6.59 J

Table 8. Coarse-grained Solution: Energy Consumption of D (in Joules) for Different CNNs for the Fixed Network Delay (RTT) of 40 ms

The ratio of All-Jetson to CGDP is shown in the columns.

Table 9.

CNN	CGDP	All-Jetson	Ratio	CGDP	All-Jetson	Ratio	All-RPI
	Conservative			Optimistic
LeNet	357 ms	421 ms	1.18	225 ms	225 ms	1.0	528 ms
MobileNet	2,291 ms	2,783 ms	1.21	199 ms	600 ms	3.01	5,371 ms
AlexNet	1,228 ms	2,073 ms	1.69	556 ms	598 ms	1.07	1,319 ms
ResNet18	1,617 ms	2,397 ms	1.48	208 ms	632 ms	3.04	1,586 ms

Table 9. Coarse-grained Solution: Completion Time (in ms) for Different CNNs for the Fixed Network Delay (RTT) of 40 ms

The ratio of All-Jetson to CGDP is shown in the columns.

It is worthwhile to note that in all CNNs, the CGDP solution outperforms the All-RPI and All-Jetson scenarios in terms of both delay and energy. MobileNet has the highest energy consumption among the networks, while LeNet has the lowest. In LeNet and MobileNet, the All-Jetson scenario results in lower energy consumption when compared to the All-RPI scenario when using the conservative data transfer scheme. Due to the larger number of layers in AlexNet and ResNet18 and the higher overhead of sending the output data in each layer back to the IoT device, the All-RPI outperforms All-Jetson. The reason that All-RPI shows significantly higher energy consumption in the case of MobileNet is due to the fact that the execution time on the RPI is significantly higher. An average improvement of 31% and 23% in energy consumption was achieved compared with All-RPI and All-Jetson.

When the intermediate results are not sent back from N to D after every layer (optimistic scheme), the energy consumption of D for LeNet, MobileNet, AlexNet, and ResNet was significantly lower than the optimal CGDP solution under the conservative scheme. However, the drawback of using the optimistic scheme is that any network connection failures will entail re-computation of lost data (consuming extra energy and time).

Figure 13 shows the energy-optimal offloading solutions for different fixed RTT values. For LeNet, with the RTT of 10 ms, the first four layers were executed on N and the last two layers were computed on D. This is due to the fact that the computation time of last two layers in LeNet was much shorter than the communication delay. For larger values of RTT, the tendency to perform the computations on D increases. The offloading solution in LeNet is the same for three cases of delay (i.e., 20, 40, and 80 ms). In the case of 120 ms, all layers are scheduled to be executed on D.

Fig. 13.

For AlexNet, the optimal partitioning solution is different for all five values of RTT. As expected, the tendency is to run more layers on D for larger values of RTT. With a network delay value of 120 ms, all the layers are executed on D. Also, it is worth mentioning that the first three layers are executed on D for all RTT values. This is because the overhead of sending data to N is greater than the cost of executing those layers on D itself.

In MobileNet, the offloading solution does not change significantly as the RTT changes. The first convolution layer in MobileNet takes less time to finish on D when compared with the intermediate convolution layers. The last two layers (pooling and fully connected layers) also take a small amount of time to finish on D (less than 10 ms). Thus, the three aforementioned layers always end up being executed on D. The solution is the same for the RTT values of 10, 20, 40, and 80 ms. For \(RTT=120\) ms, the solution changes, with the second and third layers also being computed on D.

The offloading strategy is more sensitive to network delay in ResNet18 than in MobileNet. This is because each individual block in ResNet18 is less compute-intensive than the layers in MobileNet. With the RTT above 80 ms, all the layers are assigned to D.

Table 10 shows the outcome of applying the selective scheme in the coarse-grained case. In comparison to Table 4 for the fine-grained case, one can observe the same qualitative relationships among the three data transfer schemes under consideration. Note that Table 4 also confirms the quantitative benefit of employing the fine-grained approach over the coarse-grained approach (the numbers in Table 10 indicate higher energy consumption of D).

Table 10.

CNN	Optimistic (\(\mu ,\sigma\))	Conservative (\(\mu ,\sigma\))	Selective (\(\mu ,\sigma\))
LeNet	0.31, 0.02	0.43, 0.77	0.32, 0.07
MobileNet	0.52, 0.08	6.59, 0.59	0.76, 1.51
AlexNet	0.80, 0.04	1.78, 0.10	0.83, 0.17
ResNet18	0.56, 0.08	5.21, 0.45	0.65, 0.56

Table 10. Coarse-grained Solution: Mean (\(\mu\)) and Standard Deviation (\(\sigma\)) of Energy Consumption of D (in Joules) for Different CNNs (1,000 Trials)

Similarly to Figure 13, which shows the energy-optimal layer assignments, Figure 14 shows the delay-optimal layer assignments. They are generated using delays as an alternative interpretation of our \(E_D(\,\cdot \,)\) quantities discussed in Section 3. The cases marked with a dot indicate that the energy-optimal and delay-optimal solutions differ. As expected, the value of completion time in a delay-optimal solution is lower than in the corresponding energy-optimal case. The difference between completion times of the energy-optimal and delay-optimal assignments is larger for ResNet18. This is due to the fact that the difference between computation and communication power is higher for this CNN. The results presented here demonstrate again that the network delay can have a significant impact on the optimality of different layer assignments.

Fig. 14.

Figure 15 shows the results of energy-optimal assignment where the communication bandwidth was changed from 1.5 to 1 MBps for \(RTT = 20\) ms. When the bandwidth is decreased, the tendency to run on D increases. For instance, layers \(l_2\) and \(l_3\) of MobileNet were executed on N in the case of \(B=1.5\) MBps where these layers were executed on D when the bandwidth was decreased. In all cases, the energy consumption and delay were increased when the bandwidth was decreased.

Fig. 15.

7.3 Comparison with Related Work

As mentioned in prior work section, the previous papers have not dealt with the computation offloading on low-end IoT devices considering reliability of communication channel and stochastic communication delays. However, in this subsection, our partitioning solution has been compared with two previous papers [13, 22] proposed for high-end mobile devices. It is shown that our approach leads to optimal results with having significantly lower overhead of decision making.

The solution presented in Reference [22] finds the partition point in the layers of a CNN. As the result of partitioning algorithm, the layers before the partition point are mapped onto IoT device and the rest of layers are scheduled on the nearby accelerator. This style of partitioning might not lead to the optimal solution, since there might be interleaving of layers as shown in Figure 13. Table 11 shows the energy consumption of Reference [22] compared with our solution for four CNNs in the experiments for 1,000 frames using conservative data transfer scheme. The energy saving is dependent on the complexity of layers and the possible interleaving of layers. On average, 17% improvement was obtained using our approach compared with the work [22].

Table 11.

CNN	EdgeWise (Proposed solution)	Ref. [22]	Improvement
LeNet	0.43 J	0.47 J	9%
AlexNet	1.78 J	1.92 J	8%
MobileNet	6.59 J	8.17 J	24%
ResNet18	5.12 J	6.53 J	27%

Table 11. Energy Consumption of Reference [22] Compared with Our Approach for 1,000 Frames

Our approach leads to an average of 17% improvement for four CNNs compared with Reference [22].

Figure 16 shows the empirical CDF of the energy consumption for EdgeWise (the proposed approach) compared with the presented method in Reference [22] for MobileNet CNN. As it can be seen, the method in Reference [22] will lead to higher energy consumption compared with EdgeWise.

Fig. 16.

Our partitioning solution was also compared with the method presented in Reference [13]. A solution based on finding the shortest path is presented in Reference [13]. Table 12 shows the overhead of decision making for our proposed solution and the shortest path method proposed in Reference [13]. As it can be seen, our approach has significantly lower overhead due to the O(1) complexity compared with the shortest path approach.

Table 12.

CNN	EdgeWise	Shortest Path	Speedup over Shortest Path
LeNet	21 \(\mu s\)	0.6 ms	28
MobileNet	41 \(\mu s\)	1.2 ms	29
AlexNet	50 \(\mu s\)	1.5 ms	30
ResNet18	45 \(\mu s\)	1.3 ms	28

Table 12. Comparison of Decision Making Overhead for Our Proposed Approach versus the Shortest Path Solution in Reference [13]

Our solution has significantly lower overhead of decision making.

The solution presented in Reference [13] does not consider the reliability of communication channel. The focus of our work is to provide solutions for tackling unreliable communication delays where the assignment of layers to devices is determined based on the prediction of network status. Data transfer scheme is selected based on the prediction of the network behavior to ensure that the computation is not done from the beginning in the case of unavailability of N. The presented MDP solution considers the stochastic behavior of network in the middle of processing a frame and uses a lookup table to select the optimal action in each state of the system.

8 Conclusions

This article addressed the problem of energy-efficient execution of CNNs on an energy-constrained embedded IoT device in collaboration with a nearby accelerator on the wireless network in the presence of stochastic communication delays. The problem of minimizing the energy consumption of an IoT device has been addressed using fine-grained and coarse-grained approaches. In the fine-grained approach, where the network delay is sampled before executing any given layer, the problem is formulated and solved as a MDP that assigns the optimal action (i.e., deciding where to execute that layer) in each state of MDP based on a lookup table generated offline. During runtime, the lookup table is accessed and the corresponding action related to the current state is determined in \(O(1)\) time. In the coarse-grained approach where the network delay is sampled only once at the beginning (and then assumed fixed during the CNN application execution), the problem is formulated and solved via dynamic programming of linear time complexity. It should be noted that for both approaches, the generated solutions are not only lightweight, but also optimal. The proposed solutions can be also used for the case of multiple nearby accelerators where the time complexity of both fine-grained and coarse-grained solutions remains the same.

The experimental results for four CNNs, executed on a Raspberry Pi (IoT device) in collaboration with an NVIDIA Jetson (accelerator device), show that (1) significant energy savings can be achieved using our proposed methods in comparison to the no-collaboration case and (2) the stochastic fine-grained approach yields larger energy savings in comparison to the deterministic coarse-grained approach.

References

[1]

2020. NVIDIA Jetson TX2. Retrieved from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/.

Abstract

1 Introduction

1.1 Overview of the Article

1.2 Summary of Contributions

1.3 Organization of the Article

2 Prior Work

3 System Model

3.1 Communication Delay Model

3.2 Energy Model

3.3 Data Transfer Schemes

3.4 Discussion of Model Assumptions

4 Fine-grained Solution

4.1 Markov Decision Processes

4.2 Constructing an Optimal Policy

5 Coarse-grained Solution

6 Setup of Experiments

6.1 CNN Benchmarks

6.2 Computation and Communication Profiling

6.2.1 Performance Profiling.

6.2.2 Energy Profiling.

6.2.3 Network Delay Profiling.

7 Experimental Results

7.1 Fine-grained Solution

7.2 Coarse-grained Solution

7.3 Comparison with Related Work

8 Conclusions

References

Cited By

Index Terms

Recommendations

Design Considerations for Energy-efficient Inference on Edge Devices

Energy efficient task allocation and energy scheduling in green energy powered edge computing

Energy-efficient computation offloading strategy with tasks scheduling in edge computing

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations