skip to main content
10.1145/3503161.3547979acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scale-flow: Estimating 3D Motion from Video

Published: 10 October 2022 Publication History

Abstract

This paper addresses the problem of normalized scene flow (NSF): given a pair of RGB video frames, estimating the 3D motion, which consisted of optical flow and motion-in-depth estimation. NSF is a powerful tool for action prediction and autonomous robot navigation, presenting the advantage of only needing a monocular and uncalibrated camera. However, most existing methods directly regress motion-in-depth from two RGB frames or optical flow, resulting in sub-accurate and non-robust results. Our key insight is the scale matching scheme-establishing correlations between two frames containing objects in different scales, to estimate dense and continuous motion-in-depth. Based on the scale matching, we propose a unified framework: Scale-flow, which combines scale matching and optical flow estimation. This combination makes optical flow estimation can use dense and continuous scale information for the first time, so that the moving foreground objects can be estimated more accurately. On KITTI, our monocular approach achieves the lowest error in the foreground scene flow task, even compared with the multi-camera method. Moreover, on the motion-in-depth estimation task, Scale-flow reduces the error by 34% compared with the best-published method. Code will be available.

Supplementary Material

MP4 File (MM22-fp0946.mp4)
In this video, we introduce a new normalized scene flow method, namely Scale-flow, which can robustly extract 3D motion from video.

References

[1]
Abhishek Badki, Orazio Gallo, Jan Kautz, and Pradeep Sen. 2021. Binary ttc: A temporal geofence for autonomous navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12946--12955.
[2]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision (ECCV). Springer, 404--417.
[3]
Aseem Behl, Omid Hosseini Jafari, Siva Karthik Mustikovela, Hassan Abu Alhaija, Carsten Rother, and Andreas Geiger. 2017. Bounding boxes, segmentations and object coordinates: How important is recognition for 3d scene flow estimation in autonomous driving scenarios?. In Proceedings of the IEEE International Conference on Computer Vision. 2574--2583.
[4]
Jeffrey Byrne and Camillo J Taylor. 2009. Expansion segmentation for visual collision detection and estimation. In 2009 IEEE International Conference on Robotics and Automation. IEEE, 875--882.
[5]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.
[6]
Simon Hadfield and Richard Bowden. 2013. Hollywood 3D: Recognizing actions in 3D natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3398--3405.
[7]
Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence, Vol. 17, 1--3 (1981), 185--203.
[8]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2462--2470.
[9]
Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. 2021a. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9772--9781.
[10]
Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. 2021b. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9772--9781.
[11]
Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel Gordon, Kurt Konolige, and Anelia Angelova. 2020. What matters in unsupervised optical flow. In European Conference on Computer Vision. Springer, 557--572.
[12]
Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision (2022), 1--36.
[13]
David N Lee. 1976. A theory of visual control of braking based on information about time-to-collision. Perception, Vol. 5, 4 (1976), 437--459.
[14]
Congcong Li, Haoyu Ma, and Qingmin Liao. 2021. Two-stage adaptive object scene flow using hybrid cnn-crf model. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 3876--3883.
[15]
Tony Lindeberg. 1998. Feature detection with automatic scale selection. International journal of computer vision, Vol. 30, 2 (1998), 79--116.
[16]
Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. 2021. CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation. arXiv preprint arXiv:2111.10502 (2021).
[17]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), Vol. 60, 2 (2004), 91--110.
[18]
Yao Lu, Jack Valmadre, Heng Wang, Juho Kannala, Mehrtash Harandi, and Philip Torr. 2020. Devon: Deformable volume network for learning optical flow. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2705--2713.
[19]
Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan, Jue Wang, and Jian Sun. 2021. Upflow: Upsampling pyramid for unsupervised optical flow learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1045--1054.
[20]
Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. 2020. Consistent video depth estimation. ACM Transactions on Graphics (ToG), Vol. 39, 4 (2020), 71--1.
[21]
Wei-Chiu Ma, Shenlong Wang, Rui Hu, Yuwen Xiong, and Raquel Urtasun. 2019. Deep rigid instance scene flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3614--3622.
[22]
Aashi Manglik, Xinshuo Weng, Eshed Ohn-Bar, and KRIS KITANI. 2019. Future near-collision prediction from monocular video: Feasibility, dataset, and challenges. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
[23]
Thiago Marinho, Massinissa Amrouche, Venanzio Cichella, Duvs an Stipanović, and Naira Hovakimyan. 2018. Guaranteed collision avoidance based on line-of-sight angle and time-to-collision. In 2018 Annual American Control Conference (ACC). IEEE, 4305--4310.
[24]
Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3061--3070.
[25]
Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and L Van Gool. 2005. A comparison of affine region detectors. International journal of computer vision, Vol. 65, 1 (2005), 43--72.
[26]
Tomoyuki Mori and Sebastian Scherer. 2013. First results in detecting and avoiding frontal obstacles from a monocular camera for micro unmanned aerial vehicles. In 2013 IEEE International Conference on Robotics and Automation. IEEE, 1750--1757.
[27]
Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line, Vol. 2013 (2013), 137--150.
[28]
Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4161--4170.
[29]
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision (ICCV). Ieee, 2564--2571.
[30]
Jamie Shotton, Ross Girshick, Andrew Fitzgibbon, Toby Sharp, Mat Cook, Mark Finocchio, Richard Moore, Pushmeet Kohli, Antonio Criminisi, Alex Kipman, et al. 2012. Efficient human pose estimation from single depth images. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 12 (2012), 2821--2840.
[31]
Deqing Sun, Stefan Roth, JP Lewis, and Michael J Black. 2008. Learning optical flow. In European Conference on Computer Vision (ECCV). Springer, 83--97.
[32]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8934--8943.
[33]
Zachary Teed and Jia Deng. 2020a. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision (ECCV). Springer, 402--419.
[34]
Zachary Teed and Jia Deng. 2020b. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision (ECCV). Springer, 402--419.
[35]
Zachary Teed and Jia Deng. 2021. Raft-3d: Scene flow using rigid-motion embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8375--8384.
[36]
Christoph Vogel, Konrad Schindler, and Stefan Roth. 2011. 3D scene flow estimation with a rigid motion prior. In 2011 International Conference on Computer Vision. IEEE, 1291--1298.
[37]
Christoph Vogel, Konrad Schindler, and Stefan Roth. 2015. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision (IJCV), Vol. 115, 1 (2015), 1--28.
[38]
Jiang Wang, Zicheng Liu, Jan Chorowski, Zhuoyuan Chen, and Ying Wu. 2012. Robust 3d action recognition with random occupancy patterns. In European conference on computer vision. Springer, 872--885.
[39]
Shenlong Wang, Linjie Luo, Ning Zhang, and Jia Li. 2016. Autoscaler: scale-attention networks for visual correspondence. arXiv preprint arXiv:1611.05837 (2016).
[40]
Lu Xia and JK Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2834--2841.
[41]
Li Xu, Zhenlong Dai, and Jiaya Jia. 2012. Scale invariant optical flow. In European Conference on Computer Vision (ECCV). Springer, 385--399.
[42]
Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. Advances in neural information processing systems, Vol. 32 (2019).
[43]
Gengshan Yang and Deva Ramanan. 2020. Upgrading optical flow to 3d scene flow through optical expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1334--1343.
[44]
Gengshan Yang and Deva Ramanan. 2021. Learning To Segment Rigid Motions From Two Frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1266--1275.
[45]
Xiaodong Yang and YingLi Tian. 2014. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 804--811.
[46]
Hanchao Yu, Xiao Chen, Humphrey Shi, Terrence Chen, Thomas S Huang, and Shanhui Sun. 2020. Motion pyramid networks for accurate and efficient cardiac motion estimation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 436--446.
[47]
Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. 2019. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 185--194.
[48]
Ye Zhang and Chandra Kambhamettu. 2001. On 3D scene flow and structure estimation. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 2. IEEE, II--II.
[49]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2881--2890.io

Cited By

View all
  • (2025)FP-TTC: Fast Prediction of Time-to-Collision Using Monocular ImagesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346862535:2(1028-1040)Online publication date: Feb-2025
  • (2024)Beimin: Serverless-based Adaptive Real-Time Video Processing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687715(1-6)Online publication date: 15-Jul-2024
  • (2024)ADFactory: An Effective Framework for Generalizing Optical Flow With NeRF2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01946(20591-20600)Online publication date: 16-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action prediction
  2. motion-in-depth
  3. optical flow
  4. scale matching

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • the Open Project Program of the State Key Lab of CAD and CG
  • the State Key Lab. Foundation for Novel Software Technology of Nanjing University
  • the Sichuan Science and Technology Program

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)8
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)FP-TTC: Fast Prediction of Time-to-Collision Using Monocular ImagesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346862535:2(1028-1040)Online publication date: Feb-2025
  • (2024)Beimin: Serverless-based Adaptive Real-Time Video Processing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687715(1-6)Online publication date: 15-Jul-2024
  • (2024)ADFactory: An Effective Framework for Generalizing Optical Flow With NeRF2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01946(20591-20600)Online publication date: 16-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media