Unsupervised Event-Based Optical Flow Using Motion Compensation

Zhu, Alex Zihao; Yuan, Liangzhe; Chaney, Kenneth; Daniilidis, Kostas

doi:10.1007/978-3-030-11024-6_54

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11134))

Included in the following conference series:

European Conference on Computer Vision

2108 Accesses
18 Citations

Abstract

In this work, we propose a novel framework for unsupervised learning for event cameras that learns to predict optical flow from only the event stream. In particular, we propose an input representation of the events in the form of a discretized 3D volume, which we pass through a neural network to predict the optical flow for each event. This optical flow is used to attempt to remove any motion blur in the event image. We then propose a loss function applied to the motion compensated event image that measures the motion blur in this image. We evaluate this network on the Multi Vehicle Stereo Event Camera dataset (MVSEC), along with qualitative results from a variety of different scenes.

You have full access to this open access chapter, Download conference paper PDF

Secrets of Event-Based Optical Flow

Spike-FlowNet: Event-Based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Learning to Reconstruct HDR Images from Events, with Applications to Depth and Flow Prediction

Article 05 January 2021

Keywords

1 Introduction

Event cameras, such as in Lichtsteiner et al. [3], are a neuromorphically inspired, asynchronous sensing modality, which detect changes in log light intensity. The changes are encoded as events, $e=\{x,y,t,p\}$, consisting of the pixel position, x, y, timestamp, t, accurate to microseconds, and the polarity, p. The cameras provide numerous benefits, such as extremely low latency for tracking very fast motions, high dynamic range, and significantly lower power consumption.

Recently, several methods have shown that flow and other motion information can be estimated by ‘deblurring’ the event image [1, 5, 7]. For frame data, unsupervised optical flow methods such as [2, 4] have shown that neural networks can learn to predict optical flow from geometric constraints, without any ground truth labels.

In this work, we propose a novel input representation that captures the full spatiotemporal distribution of the events, and a novel unsupervised loss function that allows for efficient learning of motion information from only the event stream. Our input representation, a discretized event volume, discretizes the time domain, and then accumulates events in a linearly weighted fashion similar to interpolation. We train a neural network to predict a per-pixel optical flow from this input, which we use to attempt to deblur the events through motion compensation. During training, we then apply a loss that measures the motion blur in the motion compensated image, which the network is trained to minimize.

2 Method

We propose a novel input representation generated by discretizing the time domain. In order to improve the resolution along the temporal domain beyond the number of bins, we insert events into this volume using a linearly weighted accumulation similar to bilinear interpolation.

Given a set of N input events $\{(x_i, y_i, t_i, p_i)\}_{i=0,\dots ,N-1}$, we divide the range of the timestamps, $t_{N-1} - t_0$, which varies depending on the input events, into B bins. We then scale the timestamps to the range $[0, B-1]$, and generate the event volume as follows:

$$\begin{aligned} t^*_i =&(B-1)(t_i - t_0)/(t_{N-1} - t_0)\end{aligned}$$

(1)

$$\begin{aligned} V(x,y,t)=&\sum _{i} p_i\max (0, 1-|x-x_i|)\max (0, 1-|y-y_i|)\max (0, 1-|t-t^*_i|) \end{aligned}$$

(2)

We treat the time domain as channels in a traditional 2D image, and perform 2D convolution across the x, y spatial dimensions.

Given optical flow for each pixel, u(x, y), v(x, y), we propagate the events, with scaled timestamps, $\{(x_i,y_i,t^*_i,p_i)\}_{i=1,\dots ,N}$, to a single time $t'$:

$$\begin{aligned} \begin{pmatrix}x_i'\\ y_i'\end{pmatrix}=&\begin{pmatrix}x_i\\ y_i\end{pmatrix}+(t'-t^*_i)\begin{pmatrix}u(x_i,y_i)\\ v(x_i,y_i)\end{pmatrix} \end{aligned}$$

(3)

We then separate these propagated events by polarity, and generate a pair of images, $T_{+}, T_{-}$, consisting of the average timestamp at that pixel, similar to Mitrokhin et al. [5]. However, by generating these images using interpolation on the pixel coordinates rather than rounding them, this operation is fully differentiable.

$$\begin{aligned} T_{\{+,-\}}(x, y, t') =&\frac{\sum _i \max (0, 1-|x-x_i'|)\max (0, 1-|y-y_i'|)t_i}{N(x, y)} \end{aligned}$$

(4)

where N(x, y) is the number of events contributing to each pixel. The loss is, then, the sum of the two images squared, as in Mitrokhin et al. [5].

$$\begin{aligned} \mathcal {L}_{\text {time}}(t')=&\sum _x \sum _y T_{+}(x, y)^2 + T_{-}(x, y)^2 \end{aligned}$$

(5)

As we scale the flow by $(t'-t_i*)$ in (3), the gradient through events with timestamps closer to $t'$ will be weighted lower. To resolve this unequal weighting, we compute the loss both backwards and forwards:

$$\begin{aligned} \mathcal {L}_{\text {time}} =&\mathcal {L}_{\text {time}}(t_0) + \mathcal {L}_{\text {time}}(t_{N-1}) \end{aligned}$$

(6)

We combine this loss with a spatial smoothness loss, $\mathcal {L}_{\text {smoothness}}$, applied to the output flow, with our final loss being a weighted sum of the timestamp loss and the smoothness loss:

$$\begin{aligned} \mathcal {L}_{\text {total}}=&\mathcal {L}_{\text {time}} + \lambda \mathcal {L}_{\text {smoothness}} \end{aligned}$$

(7)

Our network consists of an encoder-decoder architecture, as defined in Zhu et al. [6].

Table 1. Quantitative evaluation of our optical flow network against EV-FlowNet and UnFlow. Average Endpoint Error (AEE) is computed in pixels, % Outlier is computed as the percent of points with AEE < 3 pix.

Full size table

3 Experiments

For all experiments, we train our network on the outdoor$\_$day2 sequence from MVSEC [8], consisting of 11 min of stereo event driving data. Each input to the network consists of 30000 events, with volumes with resolution 256 $\times $ 256 and $B=9$ bins. The model is trained for 300,000 iterations, and takes around 15 hours to train on a NVIDIA Tesla V100.

For evaluation, we tested on the same sequences as in EV-FlowNet [6], and present a comparison against their results as well as UnFlow [4]. We convert the output of our network, (u, v), into units of pixel displacement by the following scale factor: $(\hat{u}, \hat{v})=(u,v) \times (B-1) \times dt/ (t_N-t_0)$, where dt is the test time window. From the quantitative results in Table 1, we can see that our method outperforms EV-FlowNet in almost all experiments, and nears the performance of UnFlow on the 1 frame sequences. As our event volume maintains the distribution of all of the events, we do not suffer from losing information as EV-FlowNet when there is a large motion. Our network also generalizes to a number of challenging scenes, as can be seen in Fig. 1.

4 Conclusions

In this work, we demonstrate a novel input representation for event cameras, which, when combined with our motion compensation based loss function, allows a deep neural network to learn to predict optical flow from the event stream only.

References

Gallego, G., Rebecq, H., Scaramuzza, D.: A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In: IEEE International Conference on Computer Vision Pattern Recognition (CVPR), vol. 1 (2018)
Google Scholar
Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 3–10. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_1
Chapter Google Scholar
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 $\times $ 128 120 db 15 $\mu $ s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43(2), 566–576 (2008)
Article Google Scholar
Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI, New Orleans, February 2018
Google Scholar
Mitrokhin, A., Fermuller, C., Parameshwara, C., Aloimonos, Y.: Event-based moving object detection and tracking. arXiv preprint arXiv:1803.04523 (2018)
Zhu, A., Yuan, L., Chaney, K., Daniilidis, K.: EV-FlowNet: self-supervised optical flow estimation for event-based cameras. In: Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. https://doi.org/10.15607/RSS.2018.XIV.062
Zhu, A.Z., Chen, Y., Daniilidis, K.: Realtime time synchronized event-based stereo. In: The European Conference on Computer Vision (ECCV), September 2018
Chapter Google Scholar
Zhu, A.Z., Thakur, D., Ozaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multi vehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robot. Autom. Lett. 3(3), 2032–2039 (2018)
Article Google Scholar

Download references

Acknowledgements

Thanks to Tobi Delbruck and the team at iniLabs for providing and supporting the DAVIS-346b cameras. We also gratefully appreciate support through the following grants: NSF-IIS-1703319, NSF-IIP-1439681 (I/UCRC), ARL RCTA W911NF-10-2-0016, and the DARPA FLA program. This work was supported in part by the Semiconductor Research Corporation (SRC) and DARPA.

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, PA, 19104, USA
Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney & Kostas Daniilidis

Authors

Alex Zihao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Liangzhe Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Chaney
View author publications
You can also search for this author in PubMed Google Scholar
Kostas Daniilidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Zihao Zhu .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K. (2019). Unsupervised Event-Based Optical Flow Using Motion Compensation. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11134. Springer, Cham. https://doi.org/10.1007/978-3-030-11024-6_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-11024-6_54
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11023-9
Online ISBN: 978-3-030-11024-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Event-Based Optical Flow Using Motion Compensation

Abstract