Keywords

1 Introduction

Event cameras, such as in Lichtsteiner et al. [3], are a neuromorphically inspired, asynchronous sensing modality, which detect changes in log light intensity. The changes are encoded as events, \(e=\{x,y,t,p\}\), consisting of the pixel position, xy, timestamp, t, accurate to microseconds, and the polarity, p. The cameras provide numerous benefits, such as extremely low latency for tracking very fast motions, high dynamic range, and significantly lower power consumption.

Recently, several methods have shown that flow and other motion information can be estimated by ‘deblurring’ the event image [1, 5, 7]. For frame data, unsupervised optical flow methods such as [2, 4] have shown that neural networks can learn to predict optical flow from geometric constraints, without any ground truth labels.

In this work, we propose a novel input representation that captures the full spatiotemporal distribution of the events, and a novel unsupervised loss function that allows for efficient learning of motion information from only the event stream. Our input representation, a discretized event volume, discretizes the time domain, and then accumulates events in a linearly weighted fashion similar to interpolation. We train a neural network to predict a per-pixel optical flow from this input, which we use to attempt to deblur the events through motion compensation. During training, we then apply a loss that measures the motion blur in the motion compensated image, which the network is trained to minimize.

2 Method

We propose a novel input representation generated by discretizing the time domain. In order to improve the resolution along the temporal domain beyond the number of bins, we insert events into this volume using a linearly weighted accumulation similar to bilinear interpolation.

Given a set of N input events \(\{(x_i, y_i, t_i, p_i)\}_{i=0,\dots ,N-1}\), we divide the range of the timestamps, \(t_{N-1} - t_0\), which varies depending on the input events, into B bins. We then scale the timestamps to the range \([0, B-1]\), and generate the event volume as follows:

$$\begin{aligned} t^*_i =&(B-1)(t_i - t_0)/(t_{N-1} - t_0)\end{aligned}$$
(1)
$$\begin{aligned} V(x,y,t)=&\sum _{i} p_i\max (0, 1-|x-x_i|)\max (0, 1-|y-y_i|)\max (0, 1-|t-t^*_i|) \end{aligned}$$
(2)

We treat the time domain as channels in a traditional 2D image, and perform 2D convolution across the xy spatial dimensions.

Given optical flow for each pixel, u(xy), v(xy), we propagate the events, with scaled timestamps, \(\{(x_i,y_i,t^*_i,p_i)\}_{i=1,\dots ,N}\), to a single time \(t'\):

$$\begin{aligned} \begin{pmatrix}x_i'\\ y_i'\end{pmatrix}=&\begin{pmatrix}x_i\\ y_i\end{pmatrix}+(t'-t^*_i)\begin{pmatrix}u(x_i,y_i)\\ v(x_i,y_i)\end{pmatrix} \end{aligned}$$
(3)

We then separate these propagated events by polarity, and generate a pair of images, \(T_{+}, T_{-}\), consisting of the average timestamp at that pixel, similar to Mitrokhin et al. [5]. However, by generating these images using interpolation on the pixel coordinates rather than rounding them, this operation is fully differentiable.

$$\begin{aligned} T_{\{+,-\}}(x, y, t') =&\frac{\sum _i \max (0, 1-|x-x_i'|)\max (0, 1-|y-y_i'|)t_i}{N(x, y)} \end{aligned}$$
(4)

where N(xy) is the number of events contributing to each pixel. The loss is, then, the sum of the two images squared, as in Mitrokhin et al. [5].

$$\begin{aligned} \mathcal {L}_{\text {time}}(t')=&\sum _x \sum _y T_{+}(x, y)^2 + T_{-}(x, y)^2 \end{aligned}$$
(5)

As we scale the flow by \((t'-t_i*)\) in (3), the gradient through events with timestamps closer to \(t'\) will be weighted lower. To resolve this unequal weighting, we compute the loss both backwards and forwards:

$$\begin{aligned} \mathcal {L}_{\text {time}} =&\mathcal {L}_{\text {time}}(t_0) + \mathcal {L}_{\text {time}}(t_{N-1}) \end{aligned}$$
(6)

We combine this loss with a spatial smoothness loss, \(\mathcal {L}_{\text {smoothness}}\), applied to the output flow, with our final loss being a weighted sum of the timestamp loss and the smoothness loss:

$$\begin{aligned} \mathcal {L}_{\text {total}}=&\mathcal {L}_{\text {time}} + \lambda \mathcal {L}_{\text {smoothness}} \end{aligned}$$
(7)

Our network consists of an encoder-decoder architecture, as defined in Zhu et al. [6].

Table 1. Quantitative evaluation of our optical flow network against EV-FlowNet and UnFlow. Average Endpoint Error (AEE) is computed in pixels, % Outlier is computed as the percent of points with AEE < 3 pix.
Fig. 1.
figure 1

Top: Result from MVSEC, left to right: blurred event image, deblurred image, predicted flow, ground truth flow. Bottom: Challenging scenes. Top images: sparse flow vectors on the grayscale image, bottom: dense flow output, colored by direction. Left to right: Fidget spinner spinning at 40 rad/s in the dark. Ball thrown quickly (the grayscale image does not pick up the ball). Water flowing outdoors. (Color figure online)

3 Experiments

For all experiments, we train our network on the outdoor\(\_\)day2 sequence from MVSEC [8], consisting of 11 min of stereo event driving data. Each input to the network consists of 30000 events, with volumes with resolution 256 \(\times \) 256 and \(B=9\) bins. The model is trained for 300,000 iterations, and takes around 15 hours to train on a NVIDIA Tesla V100.

For evaluation, we tested on the same sequences as in EV-FlowNet [6], and present a comparison against their results as well as UnFlow [4]. We convert the output of our network, (uv), into units of pixel displacement by the following scale factor: \((\hat{u}, \hat{v})=(u,v) \times (B-1) \times dt/ (t_N-t_0)\), where dt is the test time window. From the quantitative results in Table 1, we can see that our method outperforms EV-FlowNet in almost all experiments, and nears the performance of UnFlow on the 1 frame sequences. As our event volume maintains the distribution of all of the events, we do not suffer from losing information as EV-FlowNet when there is a large motion. Our network also generalizes to a number of challenging scenes, as can be seen in Fig. 1.

4 Conclusions

In this work, we demonstrate a novel input representation for event cameras, which, when combined with our motion compensation based loss function, allows a deep neural network to learn to predict optical flow from the event stream only.