1 Introduction and Objectives

The advent of low-cost IR sensors with real-time RGBD (depth) data is stimulating new applications that use 3D knowledge to create new user experiences. Such 3D sensing and computation is demanding for mobile devices. We perform a rigorous study of a 3D tracking service to support mobile 3D applications, characterizing feasible battery life with ensembles of todays wearables, smartphones, tablets, and laptops. First, using a combined metric (lifetime-speed), we compare ensemble capability. Second, using a lifetime metric, we bound realistic application times on a variety of ensembles. Most are quite short, even at low frame rates, lifetimes of a few hours, and at 30 fps, a few minutes. Third, we explore cloud support, showing that Wifi-based support is possible, LTE-based support is not – communication consumes too much energy. Finally, we assess the opportunity to improve lifetime by adapting resolution and frame rate, showing a 6-fold potential improvement.

2 Application and Performance Model, Data Comparison

2.1 SLAMBench: A 3D Perception Service

We describe an analytical model for SLAM [5, 10] computation and compare it to real-time system measurements. SLAM includes three major steps: de-noising sensor depth data, frame alignment with the scene model, and model update. In SLAMBench [13], used in our experiments, de-noising use a bilateral filter with Gaussian weights, alignment an ICP [1] algorithm using point-to-plane matching [3, 14] and projective mapping [2], and frame integration with the model using a Truncated Signed Distance Function [4] then raycasting to generate the updated model point-cloud. Our analytical model estimates for the SLAMBench stages are summarized in Table 1 (Fig. 1).

Fig. 1.
figure 1

Acquisition + SLAM + Rendering Pipeline

Table 1. A model for simultaneous localization and mapping (floating-point operations/frame)

Many scene-dependent factors can affect the precise computation count for 3D modelling. For example, the number of model voxels in the field of view determines the number of voxels that need to be updated for model integration and raycasting, and thus affecting the integration and raycasting computation counts. While full analysis of such scene dependence is beyond the scope of this paper, we provide a simple approximation that uses simple factors for the major element (see subpoint 2 in Table 1). These factors are based on offline analysis for the specific experiments used in this paper; an online method is a good area for future work.

Fig. 2.
figure 2

Model and measured instruction counts (Far-Fast)

Fig. 3.
figure 3

Model and measured instruction counts (Near-Slow)

2.2 Basic Characterization and Data Comparison

In this section, we present measurements from a range of experiments, comparing them to our analytical model. We collected \(512 \times 424\) depth images at 30 fps using the Microsoft Kinect V2 sensor that is moving along a 2-meter track. To explore a key dimension of computational challenge, we vary the distance (sensor to scene) from 1.5 m (Near) to 2 m (Far) as well as the rate of camera movement from 0.06 m/s (Slow) to 0.2 m/s (Fast).

Timing and instruction counts are collected on an Intel i5-3350P CPU. We use downrezing (reducing the remove/depth resolution) and subsetting frames we use can compare a range of frame rates and sensor resolutions with the same experimental data. This data is presented in Figs. 2 and 3. To compare measured results with the model, we convert our floating-point counts from the model are to instructions with an average ratio of 2:1 based on overall observed averages. The model captures the key features of required computation, matching measurements well. For example, key features, such as linear growth with increases in pixels/frame and frame rate are captured clearly (see Figs. 2 and 3).

3 Feasibility of Deploying 3D Perception: Ensembles and Lifetimes

3.1 Lifetime-Speed Metric

We explore the execution of a 3D services, SLAMbench, that performs simultaneous localization and mapping (SLAM), essential for 3D aware navigation, rendering, or just location in future 3D applications. We take advantage of SLAMbench’s partitioned structure, mapping its six stages to a variety of different device ensembles. Our goal is to understand the capabilities and lifetimes of all interesting combinations of wearables, smartphones, tablets, and laptops (specifications see Table 2). To assess the usability of the service over a period of time, we define two metrics:

$$\begin{aligned} LS = \text {Lifetime-Speed product} = \# \text {frames computable} \times \text {maximum tracking rate} \end{aligned}$$
$$\begin{aligned} L = \text {Lifetime} = (\# \text {frames computable} / \text {maximum tracking rate}) / 3600 \end{aligned}$$
Table 2. Compute, battery and weight specs for device classes

3.2 Single Devices and Ensembles

We compare devices and ensembles, using the LS metric. First Fig. 4 compares single device for two sensing resolutions. As expected, using lower resolution significantly increases lifetime, and the largest devices have the longest lifetimes.

Fig. 4.
figure 4

LS metrics for single device ensembles (wearable, smartphone, tablet, and laptop), several blue bars are too small to see.

Next we consider two device ensembles in Fig. 5. With two devices, the decomposition of SLAMbench across the devices is important. Our results show that lifetime is greatest for those deployments that put the computationally intense stages on the larger devices, and for lower sensing resolution. Three device ensemble data is presented in Fig. 6, and the results show that decomposition is even more complex, but the best configurations have the same property – the computationally intense stages on the largest devices.

Overall, Figs. 4, 5, and 6 show that for appropriate computation mapping, LS is mostly determined by the largest device. For example, at resolution \(512 \times 424\) in Fig. 5, maximum LS values achieved are 10,000 for smartphone, 70,000 for tablet, and 200,000 for laptops. Best LS is achieved when computation maps to the largest device (heaviest).

Fig. 5.
figure 5

LS metrics for two-device ensembles (Configs with \({<}2\) stages on largest device omitted)

Fig. 6.
figure 6

LS metrics for three-device ensembles (Configs with \({>}2\) stages on smallest and \({<}2\) stages on largest device omitted)

Fig. 7.
figure 7

Lifetime at Best frame rate. Achieved Best frame rates labels at top

3.3 Lifetimes for Weight-Comparable Ensembles

For mobile users, the dominant issue on whether to take a device along may be its weight. Our results show that larger devices have greater capability, but at a cost in portability. To see if the larger devices are better only because of their greater size (bang for the gram), Figs. 7 and 8 consider the best configuration for lifetime in different weight classes. Interestingly, our results show that while laptops and tablets are most capable, the smartphone is most efficient, providing the greatest lifetime for its weight. However, this is only at low performance, 1 fps. Its lifetime falls to a few minutes at 30 fps.

Table 3. Energy/bit for various network technologies [6, 9]
Fig. 8.
figure 8

Lifetime normalized by weight at Best frame rate. Achieved Best frame rates labels at top

3.4 Communication Limits

For ensembles, distributing the 3D service computation means that data must be transmitted between devices. Here we examine the energy cost of that communication within an ensemble, comparing it to energy expended on computation (Fig. 9). In nearly all cases, the computation energy dominates; communication energy manageable for Bluetooth and WiFi. However, LTE is too expensive to use in ensemble 3D service, requiring over \(90\,\%\) of energy required for communication. This suggests that even with advances in LTE efficiency cloud-based SLAM or even partial cloud-based SLAM is unlikely to be viable (Table 3).

Fig. 9.
figure 9

Energy distribution, communication and computation vs network. Resolution: \(512 \times 424\), other configurations shows similar but slightly lower communication share.

4 Adaptive Control to Reduce Computation

4.1 Single Frame Rate and Resolution

Others have explored novel and customized data structures to reduce the computation cost of 3D model building and tracking [7, 8, 11, 12, 15, 16]. Typically, point-cloud based reconstruction has high levels of redundancy, enabling robust reconstruction. The ICP algorithm can support SLA with lower resolutions and frame rates. As a baseline, we consider the highest resolution frames (\(512 \times 424\)) and maximum frame rate can be achieved to create the baseline for both computation and accuracy. We assumed that the mean absolute Trajectory error (ATE) must be kept within 10 % of the best possible, and show results for the lifetime improvement in Fig. 10. By picking the best rate and resolution, our study shows that the lifetime for a range of ensembles can be increased by up to 6-fold.

Fig. 10.
figure 10

Potential benefit for a best single Frame Rate and Resolution (Oracle), Near-Slow (left) and Far-Far (right). Resolution is reduced 16x and 64x and frame rate slightly, keeping mean tracking error within \(10\,\%\). Ensembles are the best for that number of devices (\(L, T:0-1, L:2-5\) and \(S:0-0, T:1-1, L:2-5\) for Near-Slow, \(L, T:0-2, L:3-5\) and \(S:0-0, T:!-2, L:3-5\) for Far-Fast)

Interesting, these results appear to be consistent over a range of movement speeds and scene distances. Even in Far-Fast, 6-fold improvements are possible by using low resolution. In fact, the results are nearly as good as for Near-Slow.

4.2 Best Collection of Frames – Rate and Resolution

While the previous comparison assumed a single fixed, optimal choice, there is much opportunity to adapt at a finer temporal scale. To understand the potential of per frame adaptive resolution and frame rate adaptive control, we search exhaustively for the best combinations of resolution and frame rate in a 3-segment movement pattern. We first collect the depth images with Microsoft Kinect V2 camera moving at 0.5 m/s (Superfast) for a 4-second movement experiment. Second we split the collected depth images into the three segments, dividing equally by distance along the movement track. Finally we consider all possible resolution and frame rate combinations for each segment and compute the mean ATE. Each set of choices produces both a total data volume, and an ATE. Each becomes a point in the 2D scatterplots shown in Figs. 11 and 12. Our results show a remarkable dynamic range of over 300-fold at close to the same ATE.

Fig. 11.
figure 11

Minimum mean ATE achieved at given data rate for Far-Superfast with adaptive control. Optimal adaptive control can be a big win (400 Mbits/1.2 Mbits = 333-fold).

Fig. 12.
figure 12

Minimum mean ATE achieved at given total frames for Far-Superfast with adaptive control

Our results show that choices in adaptation matter a great deal as low data size adaptive control can achieve both very high and low mean ATE. But, the results are encouraging for adaptation because there are low data size adaptive control that matches the best mean ATE (see Figs. 11 and 12. For example, 1,200 kbits over the 4 s experiment is only 40 KB/s, but delivers close to best mean ATE. Likewise, a small fraction of the frames (40 out of 120) or 10 fps achieves close to best mean ATE even for this high speed motion. In short, if some intelligent adaptive control can choose close to the optimal, a remarkably small amount of data is required while producing near lowest mean ATE.

The smaller data – resolution and frame rate – also dramatically reduces the computation required. To assess the potential benefit, we compute the computation cost savings with two simple adaptive control algorithms: (1) Fixed frame rate, adapt resolution based on mean depth of point cloud, and (2) Fixed resolution, adapt frame rate based on tracked sensor velocity. These results are shown in Fig. 13.

Fig. 13.
figure 13

Computation cost saving ratio (tracking at 512\(\,\times \,\)424 resolution and 30 FPS as baseline) for different adaptive control algorithms, configurations that do not achieve competitive ATE are omitted.

Our results show that choosing the best of these two adaptive control algorithm, enables a 160-fold computation cost saving while achieving near lowest ATE. (see Fig. 13). This suggests adaptive control is promising to save energy (communication and computation) while maintaining a high tracking accuracy.

5 Summary and Future Work

Our study of a 3D perception service sheds insights into viable ensembles. At low frame rates, smartphones can support simple applications today. For fast motion, larger devices are required for peak compute speed and lifetime. Cloud support is not feasible. Adaptive frame rate and resolution is a promising approach to save energy. Future efforts should consider a broader range of devices, and movement.