7.1 Fine-grained Solution
To build the MDP model, the system must be profiled first. This includes taking repeated direct measurements of the network delay and constructing the state transition probabilities mentioned in the MDP formulation (Section
4). The cost functions were also calculated based on the profiling of both devices. The states, probabilities, and cost functions were then converted into the MDP grammar format. To evaluate the convergence of the method to the optimal solution in an infinite time horizon, discount factor
\(\gamma\) for the MDP was set to 0.9. The public domain framework called APPL [
26,
33] was used to solve the MDP problem with the given cost functions and state change probabilities. Then, the policy was computed and expressed in the form of a lookup table specifying the suitable action in each state. The number of entries in the lookup table (as shown in Table
1) was at most 287 for AlexNet when the network delay distribution was partitioned into seven network delay intervals. In other words, the overhead of looking up the state and action pair in the lookup table is negligible.
The MDP lookup table generation is only done once if the communication model is fixed. There can be several generated lookup tables for different communication models that can be used for different scenarios. The overhead of generating the MDP lookup table for
\(r=7\) on a Core-i7 3612 CPU with 8 GB RAM is reported in Table
2. The measured overheads demonstrate that the MDP can be re-trained at runtime where the model is deviating from the actual model.
Table
3 shows the results of applying the MDP to different CNNs under the conservative data transfer scheme (see Section
3). These numbers were obtained after running the experiments 200K times. It shows that using more network delay intervals for the MDP formulation leads to the lower energy consumption of
D. This clearly demonstrates the importance of accounting for the variations in the communication delays. The column labeled
\(r=1\) corresponds to using the midpoint of the entire network delay range (
\((RTT_{min}+RTT_{max})/2\)), which is the coarse-grained special case of the fixed
\(RTT = 67.925\) ms. Using more than seven intervals did not result in any further decrease in the energy consumption. Comparing the results of
\(r=1\) with
\(r=7\), the energy consumption of
D was reduced approximately by
41% for LeNet,
25% for MobileNet,
14% for AlexNet, and
22% for ResNet. The average reduction in energy consumption was 25.5%. In Section
7.2, additional coarse-grained special cases of different fixed values of the
RTT equal to 10, 20, 40, 80, and 120 ms are discussed.
Using
\(r=7\) delay intervals resulted in the most energy savings for
D. For this case, some CNN layers were always assigned to
N (Jetson), while some other layers were never assigned to
N (see Figure
11). This shows the potential for model order reduction in the context of our MDP framework, which may in turn give rise to efficient online algorithms incorporating both fine-grained (stochastic) and coarse-grained (deterministic) optimization approaches.
Table
4 shows the results of simulating all three data transfer schemes repeatedly for 1,000 trials. The number of the network delay intervals was
\(r=7\). For the selective scheme, the value of the timeout was
\(\theta = 60\) ms and the simulations of the selective scheme included the presence of failures, whereas the other two were assumed to be ideal. The value of
\(\theta\) is different than
\(RTT=67.925\) ms, which was used for comparing the MDP solution in the case of
\(r=1\). The results clearly show that even when failures are present, the selective scheme results in much less energy consumption by
D when compared to the conservative scheme and nearly the same as the (ideal) optimistic scheme. For the CNNs that have higher amount of data for each layer (MobileNet and ResNet18), the conservative scheme involves a significant amount of energy consumption for sending back the results after the execution of each layer. Thus, there is a better potential for energy improvement using the selective approach for these CNNs.
MDP Sensitivity Analysis: In this section, the effect of the MDP model noise on the energy values is evaluated.
•
Let P be the empirical probability function based on n observations of the round-trip delay \(RTT_1,\ldots ,RTT_n\). Let \(\pi _P\) be the optimal MDP policy based on the probability function P.
•
Let \(P^{\prime }\) be the probability function based on n other observations of the round-trip delay \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\) that might be observed when the system is operating. \(P^{\prime }\) can be considered as the noisy model.
•
We conducted 200K experiments, in which we generated random round-trip delay values \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\), and applied the policy \(\pi _P\) with these delay values. The net result was that the energy values when applying \(\pi _P\) using \(RTT^{\prime }_1,\ldots ,RTT^{\prime }_n\) differed at most by 5% compared to the case when \(\pi _P\) was applied using P, which was based on \(RTT_1,\ldots ,RTT_n\). Below are the details of the experiment.
To perform the sensitivity analysis, we added a noise to the measured values of
RTT. This noise was considered to have a normal distribution
\(N(0,\sigma ^2)\). In our experiments, the minimum and maximum measured RTT values was 6 and 129 ms, respectively. We selected the value
\(\sigma =10\) for the distribution of the noise. We calculated the probability function
\(P^{\prime }\) based on the data with added noise. Table
5 shows the result of applying the same policy on the delay values from both
P and
\(P^{\prime }\). As can be seen, the difference in the average energy consumption for 200k is at most 5%.
The values of 95% confidence interval are also reported for different CNNs as shown in Table
6. The confidence interval is
\([\mu -1.96\times \sigma /n_{sample},\mu +1.96\times \sigma /n_{sample}]\) where
\(\mu\) and
\(\sigma\) are mean and standard deviation. These values mean that the discounted cost falls in the mentioned interval in 95% of experiments. As can be seen, the obtained average discounted cost value also falls in this range.
It is worthy to mention that these cost values are not comparable with the total energy consumption values, since they are multiplied by the discount factor in every decision epoch to evaluate the MDP solution in the infinite horizon.
Varying Bandwidth: Another set of experiments was also conducted treating the bandwidth as a random variable, sampling it at the beginning of the execution of every layer.
The bandwidth was modeled as a Gaussian random variable \(N(1,0.2)\). The mean \(\mu =1\) was the midpoint of the range of values (\([0.4,1.6])\) MBps, whereas the standard deviation was taken to be 0.2 to cover the range of variations. The MDP state is now described by \((\Delta ,BW,i,d)\), and similarly to the RTT variation, it is divided into several intervals and the parameter BW is the interval within which the observed bandwidth value belongs.
Figure
12 shows the heatmap of the layer by layer execution of the CNN, assuming varying bandwidth. The figure shows that the proposed solution accounts for these variations compared with Figure
11 where the bandwidth was fixed. Table
7 shows the corresponding average values of the energy consumption of each CNN. These data show that the energy values in the case of varying bandwidth is higher compared with the fixed bandwidth case, as expected. This difference is higher for MobileNet, since it has higher number of data outputs.
7.2 Coarse-grained Solution
The coarse-grained solutions for the case of fixed
\(RTT = 40\) ms are listed in Tables
8 and
9. Column
CGDP shows the results of executing Algorithm
2 (optimal layer partitioning between
D and
N), while columns
All-RPI and
All-Jetson refer to running the entire application either on
D or on
N, respectively.
It is worthwhile to note that in all CNNs, the CGDP solution outperforms the All-RPI and All-Jetson scenarios in terms of both delay and energy. MobileNet has the highest energy consumption among the networks, while LeNet has the lowest. In LeNet and MobileNet, the All-Jetson scenario results in lower energy consumption when compared to the All-RPI scenario when using the conservative data transfer scheme. Due to the larger number of layers in AlexNet and ResNet18 and the higher overhead of sending the output data in each layer back to the IoT device, the All-RPI outperforms All-Jetson. The reason that All-RPI shows significantly higher energy consumption in the case of MobileNet is due to the fact that the execution time on the RPI is significantly higher. An average improvement of 31% and 23% in energy consumption was achieved compared with All-RPI and All-Jetson.
When the intermediate results are not sent back from N to D after every layer (optimistic scheme), the energy consumption of D for LeNet, MobileNet, AlexNet, and ResNet was significantly lower than the optimal CGDP solution under the conservative scheme. However, the drawback of using the optimistic scheme is that any network connection failures will entail re-computation of lost data (consuming extra energy and time).
Figure
13 shows the energy-optimal offloading solutions for different fixed
RTT values. For LeNet, with the
RTT of 10 ms, the first four layers were executed on
N and the last two layers were computed on
D. This is due to the fact that the computation time of last two layers in LeNet was much shorter than the communication delay. For larger values of
RTT, the tendency to perform the computations on
D increases. The offloading solution in LeNet is the same for three cases of delay (i.e., 20, 40, and 80 ms). In the case of 120 ms, all layers are scheduled to be executed on
D.
For AlexNet, the optimal partitioning solution is different for all five values of RTT. As expected, the tendency is to run more layers on D for larger values of RTT. With a network delay value of 120 ms, all the layers are executed on D. Also, it is worth mentioning that the first three layers are executed on D for all RTT values. This is because the overhead of sending data to N is greater than the cost of executing those layers on D itself.
In MobileNet, the offloading solution does not change significantly as the RTT changes. The first convolution layer in MobileNet takes less time to finish on D when compared with the intermediate convolution layers. The last two layers (pooling and fully connected layers) also take a small amount of time to finish on D (less than 10 ms). Thus, the three aforementioned layers always end up being executed on D. The solution is the same for the RTT values of 10, 20, 40, and 80 ms. For \(RTT=120\) ms, the solution changes, with the second and third layers also being computed on D.
The offloading strategy is more sensitive to network delay in ResNet18 than in MobileNet. This is because each individual block in ResNet18 is less compute-intensive than the layers in MobileNet. With the RTT above 80 ms, all the layers are assigned to D.
Table
10 shows the outcome of applying the selective scheme in the coarse-grained case. In comparison to Table
4 for the fine-grained case, one can observe the same qualitative relationships among the three data transfer schemes under consideration. Note that Table
4 also confirms the quantitative benefit of employing the fine-grained approach over the coarse-grained approach (the numbers in Table
10 indicate higher energy consumption of
D).
Similarly to Figure
13, which shows the energy-optimal layer assignments, Figure
14 shows the delay-optimal layer assignments. They are generated using delays as an alternative interpretation of our
\(E_D(\,\cdot \,)\) quantities discussed in Section
3. The cases marked with a dot indicate that the energy-optimal and delay-optimal solutions differ. As expected, the value of completion time in a delay-optimal solution is lower than in the corresponding energy-optimal case. The difference between completion times of the energy-optimal and delay-optimal assignments is larger for ResNet18. This is due to the fact that the difference between computation and communication power is higher for this CNN. The results presented here demonstrate again that the network delay can have a significant impact on the optimality of different layer assignments.
Figure
15 shows the results of energy-optimal assignment where the communication bandwidth was changed from 1.5 to 1 MBps for
\(RTT = 20\) ms. When the bandwidth is decreased, the tendency to run on
D increases. For instance, layers
\(l_2\) and
\(l_3\) of MobileNet were executed on
N in the case of
\(B=1.5\) MBps where these layers were executed on
D when the bandwidth was decreased. In all cases, the energy consumption and delay were increased when the bandwidth was decreased.
7.3 Comparison with Related Work
As mentioned in prior work section, the previous papers have not dealt with the computation offloading on low-end IoT devices considering reliability of communication channel and stochastic communication delays. However, in this subsection, our partitioning solution has been compared with two previous papers [
13,
22] proposed for high-end mobile devices. It is shown that our approach leads to optimal results with having significantly lower overhead of decision making.
The solution presented in Reference [
22] finds the partition point in the layers of a CNN. As the result of partitioning algorithm, the layers before the partition point are mapped onto IoT device and the rest of layers are scheduled on the nearby accelerator. This style of partitioning might not lead to the optimal solution, since there might be interleaving of layers as shown in Figure
13. Table
11 shows the energy consumption of Reference [
22] compared with our solution for four CNNs in the experiments for 1,000 frames using conservative data transfer scheme. The energy saving is dependent on the complexity of layers and the possible interleaving of layers. On average, 17% improvement was obtained using our approach compared with the work [
22].
Figure
16 shows the empirical CDF of the energy consumption for EdgeWise (the proposed approach) compared with the presented method in Reference [
22] for MobileNet CNN. As it can be seen, the method in Reference [
22] will lead to higher energy consumption compared with EdgeWise.
Our partitioning solution was also compared with the method presented in Reference [
13]. A solution based on finding the shortest path is presented in Reference [
13]. Table
12 shows the overhead of decision making for our proposed solution and the shortest path method proposed in Reference [
13]. As it can be seen, our approach has significantly lower overhead due to the O(1) complexity compared with the shortest path approach.
The solution presented in Reference [
13] does not consider the reliability of communication channel. The focus of our work is to provide solutions for tackling unreliable communication delays where the assignment of layers to devices is determined based on the prediction of network status. Data transfer scheme is selected based on the prediction of the network behavior to ensure that the computation is not done from the beginning in the case of unavailability of
N. The presented MDP solution considers the stochastic behavior of network in the middle of processing a frame and uses a lookup table to select the optimal action in each state of the system.