Learning to See by Moving: Self-supervising 3D Scene Representations for Perception, Control, and Visual Reasoning
We propose learning frameworks for artificial agents to learn several aspects of visual common sense (instantiate and retrieve object concepts, reason about space and 3D geometry, manipulate diverse objects) while moving and interacting with 3D environments. Current state-of-the-art visual systems can achieve human-level object recognition performance on Internet photos, but their performance degrades drastically when applied to videos captured by a moving camera. The performance gap is due to the great difference in the image statistics: in Internet photos, objects
are centered, unoccluded, in canonical scales and poses; in photos captured by mobile agents, objects come in a wide variety of scales, poses, locations, and occlusion
configurations. How can machines learn to see without relying upon humans to detect and center the interesting content in images and videos? We explore neural architectures and training schemes for learning visual scene
representations that can work under a moving camera, and can exploit the moving camera viewpoint to self-improve without human annotations. This thesis re-visits
the paradigm of vision as inference of a 3D scene representation, also known as “vision as inverse graphics”. Nevertheless, instead of inferring explicit 3D representations
such as meshes or pointclouds, we infer learnable 3D feature representations from RGB or RGBD inputs. The feature representations can be optimized by training end-to-end with many task objectives, including object detection, view prediction, object dynamics prediction, and object manipulation. The proposed models integrate recent advances in Simultaneous Localization And Mapping (SLAM) and deep learning. Similar to SLAM, our model generates stable 3D scene representations that retain information regarding size, shape, and spatial arrangements of objects, which permit object permanence to emerge across camera viewpoints, despite changes in the field of view. Different from SLAM, which constructs a 3D
point cloud map of a scene by piecing together multi-view images, our model learns to infer a complete 3D scene feature map even from a single view. The feature map
encodes task-relevant semantic information, much more than just object occupancy or 3D surfaces. We demonstrate the effectiveness of the proposed differentiable 2D-to-3D feature mapping in multiple tasks, including detecting objects in 3D, predicting 3D object interactions, manipulating diverse objects, recognizing visual concepts, grounding language expressions, and generating 3D scenes that comply with a language utterance. We show the proposed models can self-supervise themselves using unlabelled data and outperform supervised models in the tasks above.
History
Date
2021-03-22Degree Type
- Dissertation
Department
- Machine Learning
Degree Name
- Doctor of Philosophy (PhD)