research-article

Scale-flow: Estimating 3D Motion from Video

Authors:

Zichen WangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 6530 - 6538

https://doi.org/10.1145/3503161.3547979

Published: 10 October 2022 Publication History

Abstract

This paper addresses the problem of normalized scene flow (NSF): given a pair of RGB video frames, estimating the 3D motion, which consisted of optical flow and motion-in-depth estimation. NSF is a powerful tool for action prediction and autonomous robot navigation, presenting the advantage of only needing a monocular and uncalibrated camera. However, most existing methods directly regress motion-in-depth from two RGB frames or optical flow, resulting in sub-accurate and non-robust results. Our key insight is the scale matching scheme-establishing correlations between two frames containing objects in different scales, to estimate dense and continuous motion-in-depth. Based on the scale matching, we propose a unified framework: Scale-flow, which combines scale matching and optical flow estimation. This combination makes optical flow estimation can use dense and continuous scale information for the first time, so that the moving foreground objects can be estimated more accurately. On KITTI, our monocular approach achieves the lowest error in the foreground scene flow task, even compared with the multi-camera method. Moreover, on the motion-in-depth estimation task, Scale-flow reduces the error by 34% compared with the best-published method. Code will be available.

Supplementary Material

MP4 File (MM22-fp0946.mp4)

In this video, we introduce a new normalized scene flow method, namely Scale-flow, which can robustly extract 3D motion from video.

Download
98.85 MB

References

[1]

Abhishek Badki, Orazio Gallo, Jan Kautz, and Pradeep Sen. 2021. Binary ttc: A temporal geofence for autonomous navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12946--12955.

[2]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision (ECCV). Springer, 404--417.

Digital Library

[3]

Aseem Behl, Omid Hosseini Jafari, Siva Karthik Mustikovela, Hassan Abu Alhaija, Carsten Rother, and Andreas Geiger. 2017. Bounding boxes, segmentations and object coordinates: How important is recognition for 3d scene flow estimation in autonomous driving scenarios?. In Proceedings of the IEEE International Conference on Computer Vision. 2574--2583.

[4]

Jeffrey Byrne and Camillo J Taylor. 2009. Expansion segmentation for visual collision detection and estimation. In 2009 IEEE International Conference on Robotics and Automation. IEEE, 875--882.

Digital Library

[5]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.

Digital Library

[6]

Simon Hadfield and Richard Bowden. 2013. Hollywood 3D: Recognizing actions in 3D natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3398--3405.

Digital Library

[7]

Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence, Vol. 17, 1--3 (1981), 185--203.

Digital Library

[8]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2462--2470.

[9]

Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. 2021a. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9772--9781.

[10]

Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. 2021b. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9772--9781.

[11]

Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel Gordon, Kurt Konolige, and Anelia Angelova. 2020. What matters in unsupervised optical flow. In European Conference on Computer Vision. Springer, 557--572.

Digital Library

[12]

Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision (2022), 1--36.

Digital Library

[13]

David N Lee. 1976. A theory of visual control of braking based on information about time-to-collision. Perception, Vol. 5, 4 (1976), 437--459.

[14]

Congcong Li, Haoyu Ma, and Qingmin Liao. 2021. Two-stage adaptive object scene flow using hybrid cnn-crf model. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 3876--3883.

[15]

Tony Lindeberg. 1998. Feature detection with automatic scale selection. International journal of computer vision, Vol. 30, 2 (1998), 79--116.

[16]

Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Wenjie Li, and Lijun Chen. 2021. CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation. arXiv preprint arXiv:2111.10502 (2021).

[17]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), Vol. 60, 2 (2004), 91--110.

Digital Library

[18]

Yao Lu, Jack Valmadre, Heng Wang, Juho Kannala, Mehrtash Harandi, and Philip Torr. 2020. Devon: Deformable volume network for learning optical flow. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2705--2713.

[19]

Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan, Jue Wang, and Jian Sun. 2021. Upflow: Upsampling pyramid for unsupervised optical flow learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1045--1054.

[20]

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. 2020. Consistent video depth estimation. ACM Transactions on Graphics (ToG), Vol. 39, 4 (2020), 71--1.

Digital Library

[21]

Wei-Chiu Ma, Shenlong Wang, Rui Hu, Yuwen Xiong, and Raquel Urtasun. 2019. Deep rigid instance scene flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3614--3622.

[22]

Aashi Manglik, Xinshuo Weng, Eshed Ohn-Bar, and KRIS KITANI. 2019. Future near-collision prediction from monocular video: Feasibility, dataset, and challenges. In IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]

Thiago Marinho, Massinissa Amrouche, Venanzio Cichella, Duvs an Stipanović, and Naira Hovakimyan. 2018. Guaranteed collision avoidance based on line-of-sight angle and time-to-collision. In 2018 Annual American Control Conference (ACC). IEEE, 4305--4310.

[24]

Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3061--3070.

[25]

Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and L Van Gool. 2005. A comparison of affine region detectors. International journal of computer vision, Vol. 65, 1 (2005), 43--72.

Digital Library

[26]

Tomoyuki Mori and Sebastian Scherer. 2013. First results in detecting and avoiding frontal obstacles from a monocular camera for micro unmanned aerial vehicles. In 2013 IEEE International Conference on Robotics and Automation. IEEE, 1750--1757.

[27]

Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 optical flow estimation. Image Processing On Line, Vol. 2013 (2013), 137--150.

[28]

Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4161--4170.

[29]

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision (ICCV). Ieee, 2564--2571.

Digital Library

[30]

Jamie Shotton, Ross Girshick, Andrew Fitzgibbon, Toby Sharp, Mat Cook, Mark Finocchio, Richard Moore, Pushmeet Kohli, Antonio Criminisi, Alex Kipman, et al. 2012. Efficient human pose estimation from single depth images. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 12 (2012), 2821--2840.

[31]

Deqing Sun, Stefan Roth, JP Lewis, and Michael J Black. 2008. Learning optical flow. In European Conference on Computer Vision (ECCV). Springer, 83--97.

Digital Library

[32]

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8934--8943.

[33]

Zachary Teed and Jia Deng. 2020a. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision (ECCV). Springer, 402--419.

Digital Library

[34]

Zachary Teed and Jia Deng. 2020b. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision (ECCV). Springer, 402--419.

Digital Library

[35]

Zachary Teed and Jia Deng. 2021. Raft-3d: Scene flow using rigid-motion embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8375--8384.

[36]

Christoph Vogel, Konrad Schindler, and Stefan Roth. 2011. 3D scene flow estimation with a rigid motion prior. In 2011 International Conference on Computer Vision. IEEE, 1291--1298.

Digital Library

[37]

Christoph Vogel, Konrad Schindler, and Stefan Roth. 2015. 3d scene flow estimation with a piecewise rigid scene model. International Journal of Computer Vision (IJCV), Vol. 115, 1 (2015), 1--28.

Digital Library

[38]

Jiang Wang, Zicheng Liu, Jan Chorowski, Zhuoyuan Chen, and Ying Wu. 2012. Robust 3d action recognition with random occupancy patterns. In European conference on computer vision. Springer, 872--885.

[39]

Shenlong Wang, Linjie Luo, Ning Zhang, and Jia Li. 2016. Autoscaler: scale-attention networks for visual correspondence. arXiv preprint arXiv:1611.05837 (2016).

[40]

Lu Xia and JK Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2834--2841.

Digital Library

[41]

Li Xu, Zhenlong Dai, and Jiaya Jia. 2012. Scale invariant optical flow. In European Conference on Computer Vision (ECCV). Springer, 385--399.

[42]

Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. Advances in neural information processing systems, Vol. 32 (2019).

[43]

Gengshan Yang and Deva Ramanan. 2020. Upgrading optical flow to 3d scene flow through optical expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1334--1343.

[44]

Gengshan Yang and Deva Ramanan. 2021. Learning To Segment Rigid Motions From Two Frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1266--1275.

[45]

Xiaodong Yang and YingLi Tian. 2014. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 804--811.

Digital Library

[46]

Hanchao Yu, Xiao Chen, Humphrey Shi, Terrence Chen, Thomas S Huang, and Shanhui Sun. 2020. Motion pyramid networks for accurate and efficient cardiac motion estimation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 436--446.

Digital Library

[47]

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. 2019. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 185--194.

[48]

Ye Zhang and Chandra Kambhamettu. 2001. On 3D scene flow and structure estimation. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 2. IEEE, II--II.

[49]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2881--2890.io

Cited By

Li CQian YZhang SWang CYang M(2025)FP-TTC: Fast Prediction of Time-to-Collision Using Monocular ImagesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346862535:2(1028-1040)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3468625
Zhang JMeng ZXu M(2024)Beimin: Serverless-based Adaptive Real-Time Video Processing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687715(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687715
Ling HSun QSun YXu XLi X(2024)ADFactory: An Effective Framework for Generalizing Optical Flow With NeRF2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01946(20591-20600)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01946

Index Terms

Scale-flow: Estimating 3D Motion from Video
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
        Vision for robotics
      2. Image and video acquisition
        Motion capture

Recommendations

Multi-scale 3D scene flow from binocular stereo sequences

Scene flow methods estimate the three-dimensional motion field for points in the world, using multi-camera video data. Such methods combine multi-view reconstruction with motion estimation. This paper describes an alternative formulation for dense scene ...
Local scene flow by tracking in intensity and depth

We propose a method to compute local scene flow by tracking in intensity and depth.We propose a pixel motion model to constrain the 3D motion vector on 2D.We extend the Lucas-Kanade framework to work with intensity and depth data.Throughout some ...
Markerless tracking using Polar Correlation of camera optical flow
VR '10: Proceedings of the 2010 IEEE Virtual Reality Conference

We present a novel, real-time, markerless vision-based tracking system, employing a rigid orthogonal configuration of two pairs of opposing cameras. Our system uses optical flow over sparse features to overcome the limitation of vision-based systems ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
the Open Project Program of the State Key Lab of CAD and CG
the State Key Lab. Foundation for Novel Software Technology of Nanjing University
the Sichuan Science and Technology Program

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
277
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)8

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li CQian YZhang SWang CYang M(2025)FP-TTC: Fast Prediction of Time-to-Collision Using Monocular ImagesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346862535:2(1028-1040)Online publication date: Feb-2025
https://doi.org/10.1109/TCSVT.2024.3468625
Zhang JMeng ZXu M(2024)Beimin: Serverless-based Adaptive Real-Time Video Processing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687715(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687715
Ling HSun QSun YXu XLi X(2024)ADFactory: An Effective Framework for Generalizing Optical Flow With NeRF2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01946(20591-20600)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01946

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten