Abstract
Scene flow provides the 3D motion field of point clouds, which correspond to image pixels. Current algorithms usually need complex stereo calibration before estimating flow, which has strong restrictions on the position of the camera. This paper proposes a monocular camera scene flow estimation algorithm. Firstly, an energy functional is constructed, where three important assumptions are turned into data terms derivation: a brightness constancy assumption, a gradient constancy assumption, and a short time object velocity constancy assumption. Two smooth operators are used as regularization terms. Then, an occluded map computation algorithm is used to ensure estimating scene flow only on un-occluded points. After that, the energy functional is solved with a coarse-to-fine variational equation on Gaussian pyramid, which can prevent the iteration from converging to a local minimum value. The experiment results show that the algorithm can use three sequential frames at least to get scene flow in world coordinate, without optical flow or disparity inputting.











Similar content being viewed by others
References
Adiv G (1985) Determining three-dimensional motion and structure from optical flow generated by several moving objects. IEEE Trans Pattern Anal Mach Intell 4:384–401
Alcantarilla PF, Yebes JJ, Almazn J, Bergasa LM (2012) On combining visual slam and dense scene flow to increase the robustness of localization and mapping in dynamic environments. IEEE International Conference on Robotics and Automation (ICRA) 1290–1297
Baker S, Scharstein D, Lewis J, Roth S, Black MJ, Szeliski R (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92(1):1–31
Basha T, Moses Y, Kiryati N (2013) Multi-view scene flow estimation: a view centered variational approach. Int J Comput Vis 101(1):6–21
Birkbeck N, Cobzas D, Jagersand M (2011) Basis constrained 3d scene flow on a dynamic proxy. IEEE International Conference on Computer Vision (ICCV) 1967–1974
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. Eur Conf Comput Vis 25–36. doi:10.1007/978-3-540-24673-2_3
Civera J, Davison AJ, Montiel J (2008) Inverse depth parametrization for monocular slam. IEEE Trans Robot 24(5):932–945
Cruz L, Lucio D, Velho L (2012) Kinect and rgbd images: challenges and applications. IEEE Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T) 36–49
Dame A, Prisacariu V A, Ren C Y, Reid I (2013) Dense reconstruction using 3d object shape priors. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1288–1295
Geiger A, Ziegler J, Stiller C (2011) Stereo scan: dense 3d reconstruction in real-time. IEEE Conference on Intelligent Vehicles Symposium 963–968
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237
Henry P, Krainin M, Herbst E, Ren X, Fox D (2012) Rgb-d mapping: using kinect-style depth cameras for dense 3d modeling of indoor environments. Int J Robot Res 31(5):647–663
Herbst E, Ren X, Fox D (2013) Rgb-d flow: dense 3-d motion estimation using color and depth. IEEE International Conference on Robotics and Automation (ICRA) 2276–2282
Hornacek M, Rhemann C, Gelautz M, Rother C (2013) Depth super resolution by rigid body self-similarity in 3d. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1123–1130
Huang S, Dissanayake G (2007) Convergence and consistency analysis for extended kalman filter based slam. IEEE Trans Robot 23(5):1036–1049
Izadi S, Kim D, Hilliges O, Molyneaux D, Newcombe R, Kohli P, Shotton J, Hodges S, Freeman D, Davison A (2011) Kinect fusion: real time 3d reconstruction and interaction using a moving depth camera. ACM symposium on User interface software and technology 559–568
Jan Č, Sanchez-Riera J, Horaud R (2011) Scene flow estimation by growing correspondence seeds. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3129–3136
Khoshelham K, Elberink S O (2012) Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 1437–1454
Klette R (2015) http://ccv.wordpress.fos.auckland.ac.nz/eisats/. Accessed 2 May 2014
Letouzey A, Petit B, Boyer E (2011) Scene flow from depth and color images. Proc Br Mach Vis Conf 46:1–11. doi:10.5244/C.25.46
Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam: dense tracking and mapping in real-time. IEEE International Conference on Computer Vision (ICCV) 2320–2327
Nie L, Akbari M, Li T, Chua T (2014) A joint local–global approach for medical terminology assignment. MedIR@SIGIR 24–27
Nie L, Li T, Akbari M, Shen J, Chua T (2014) WenZher: comprehensive vertical search for healthcare domain. ACM Conference on Research and Development in Information Retrieval (SIGIR) 1245–1246
Nie L, Zhang L, Yang Y, Wang M, Hong R, Chua T (2015) Beyond doctors: future health prediction from multimedia and multimodal observations. ACM Multimedia 591–600
Nie L, Zhao Y, Akbari M, Shen J, Chua T (2015) Bridging the vocabulary gap between health seekers and healthcare knowledge. IEEE Trans Knowl Data Eng 27(2):396–409
Nie L, Wang M, Zhang L, Yan S, Zhang B, Chua T (2015) Disease inference from health-related questions via sparse deep learning. IEEE Trans Knowl Data Eng 27(8):2107–2119
Stoyanov D (2012) Stereoscopic scene flow for robotic assisted minimally invasive surgery. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2012:479–486
Vedula S, Baker S, Rander P, Collins R, Kanade T (1999) Three dimensional scene flow. IEEE Int Conf Comput Vis 2:722–729
Vogel C, Schindler K, Roth S (2011) 3d scene flow estimation with a rigid motion prior. IEEE International Conference on Computer Vision (ICCV) 1291–1298
Wedel A, Brox T, Vaudrey T, Rabe C, Franke U, Cremers D (2011) Stereoscopic scene flow computation for 3d motion understanding. Int J Comput Vis 95(1):29–51
Yan Y, Ricci E, Subramanian R, Lanz O, Sebe N (2013) No matter where you are: flexible graph-guided multi-task learning for multi-view head pose classification under target motion. IEEE International Conference on Computer Vision (ICCV) 1177–1184
Yan Y, Liu G, Ricci E, Sebe N (2014) Multi-task linear discriminant analysis for multi-view action recognition. IEEE Trans Image Process (TIP) 23(12):5599–5611
Yan Y, Yang Y, Meng D, Liu G, Tong W (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process (TIP) 24(6):1867–1878
Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process (TIP) 24(10):2984–2995
Yang Z, Xiong Z, Zhang Y, Wang J, Wu F (2013) Depth acquisition from density modulated binary patterns. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 25–32
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE MultiMedia 19(2):4–10
Acknowledgments
The authors would like to thank the anonymous reviewers for their insightful comments and suggestions. This work is supported in part by National Natural Science Foundation of China (Grant No.61272062, 61300036), the Projects in the National Science & Technology Pillar Program (Grant No.2013BAH38F01).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Iterative process
Our iterative process is divided into two layers. The outer layer is constructed by the Gaussian pyramid from coarse to fine, iterations at this level is for getting unknown quantity itself. At each layer of the pyramid, the inner iteration can get the unknown variables incremental from SOR iteration. As Fig. 12 shows, the Gaussian pyramid layers is built according to input outer layer iteration value, during the build process, we calculate the scaling factor for each layer. In order to ensure correct correspondence between the image space and the world space, the scaling factor should not only work for image resolution, but also for focal length of the camera and the optical center position. At the inner iteration process, the scene flow quantity initial values are set to zero. By setting the value of inner iteration times, starting from the lowest resolution level of Gaussian pyramid, SOR iteration obtains the unknown value increment before convergence or reaching the number of iterations. Every final value of inner iteration will be added to current outer layer, and set as initial value of next outer layer. Algorithm 2 shows the whole iteration process.
It is necessary to determine the number of iterations of the inner and outer layers in the iterative process. The number of iterations of outer layers determines the pyramid layers. Our experiments set the outer iteration number as 10 due to memory limitations. Figure 13 shows the relationship between erroneous percentage and inner iteration numbers, the iterations are processed at the last outer layer. We set the number of inner iteration as 10, for the polyline shows: iterations more than 10 will cause over smooth.

Rights and permissions
About this article
Cite this article
Xiao, D., Yang, Q., Yang, B. et al. Monocular scene flow estimation via variational method. Multimed Tools Appl 76, 10575–10597 (2017). https://doi.org/10.1007/s11042-015-3091-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3091-6