Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images

Gao, Yuan; Yuille, Alan L.

doi:10.1007/s11263-019-01195-z

Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images

Published: 02 August 2019

Volume 127, pages 1501–1526, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

549 Accesses
8 Citations
Explore all metrics

Abstract

Many man-made objects have intrinsic symmetries and often Manhattan structure. By assuming an orthographic or a weak perspective projection model, this paper addresses the estimation of 3D structures and camera projection using symmetry and/or Manhattan structure cues, for the two cases when the input is a single image or multiple images from the same category, e.g. multiple different cars from various viewpoints. More specifically, analysis on the single image case shows that Manhattan alone is sufficient to recover the camera projection and then the 3D structure can be reconstructed uniquely by exploiting symmetry. But Manhattan structure can be hard to observe from a single image due to occlusion. Hence, we extend to the multiple-image case which can also exploit symmetry but does not require Manhattan structure. We propose novel structure from motion methods for both rigid and non-rigid object deformations, which exploit symmetry and use multiple images from the same object category as input. We perform experiments on the Pascal3D+ dataset with either human labeled 2D keypoints or with 2D keypoints localized from a convolutional neural network. The results show that our methods which exploit symmetry significantly outperform the baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 4

Fig. 6

Fig. 8

Fig. 9

Fig. 10

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

Fully-Convolutional Siamese Networks for Object Tracking

3D point cloud-based place recognition: a survey

Article Open access 07 March 2024

Notes

However, the general framework in Hong and Fitzgibbon (2015) cannot be used to SfM directly, because it did not constrain that all the keypoints within the same frame should have the same translation. Instead, Hong and Fitzgibbon (2015) focused on better optimization of rank-r matrix factorization and better runtime.
Note that we set hard constraints on \(\mathbb {{\bar{S}}}\) and \(\mathbb {{\bar{S}}}^{\dag }\), i.e. replace \(\mathbb {{\bar{S}}}^{\dag }\) by \({\mathcal {A}}_P \mathbb {{\bar{S}}}\) in Eq. (57), because it can be guaranteed by our Sym-RSfM initialization in Sect. 6. While the initialization on \({\mathbf {V}}\) and \({\mathbf {V}}^{\dag }\) by PCA cannot guarantee such a desirable property, thus a Language multiplier term is used for the constraint on \({\mathbf {V}}\) and \({\mathbf {V}}^{\dag }\) in the following Eq. (61).
For the subtypes of more categories, please refer to the Pascal3D+ official website at http://cvgl.stanford.edu/projects/pascal3d.html.
For the rigid case, as we use the images from the same subtype as input (so that we can reasonably assume rigid deformation among them), therefore, we also report the rotation error according to subtype for the rigid experiments.
As there is no baseline method for comparison, we also calculate the average rotation errors measured by averaged geodesic distance \(\frac{1}{N} \sum _{n=1}^{N} ||\log ({R_n^{\text {aligned}}}^\top R_n^*) ||_\text {F} / \sqrt{2}\), which represents the angle difference between two rotation matrices. The results show that the rotation error is 4.1766 degree in average.
As analyzed in Remark 10 and Eq. (38), the relationship between the number of allowed deformation bases K and the number of keypoint pairs P follows: \(K \le P/3\).
This is because the self-occluded information/features can be recovered by the training images from a different viewpoint, but the training data cannot exhaustively retain various occlusions introduced by other objects or various truncated types.
They are not directly comparable because (i) Tables 1 and 2 use 2D annotations from (Bourdev et al. 2010) [the same as those used in Kar et al. (2015)], while the keypoint localization network for Tables 4 and 5 is trained on 2D annotations from Pascal3D+ (Xiang et al. 2014). (ii) We exclude the occluded-by-others and truncated objects in Tables 4 and 5 [the same as those in Pavlakos et al. (2017)] because the stacked hourglass network (Newell et al. 2016) does not produce satisfied results on those images.

References

Agudo, A., Agapito, L., Calvo, B., & Montiel, J. (2014). Good vibrations: A modal analysis approach for sequential non-rigid structure from motion. In CVPR (pp. 1558–1565).
Akhter, I., Sheikh, Y., & Khan, S. (2009). In defense of orthonormality constraints for nonrigid structure from motion. In CVPR.
Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2008). Nonrigid structure from motion in trajectory space. In NIPS.
Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2011). Trajectory space: A dual representation for nonrigid structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7), 1442–1456.
Article Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
MATH Google Scholar
Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In ECCV.
Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering non-rigid 3D shape from image streams. In CVPR.
Ceylan, D., Mitra, N. J., Zheng, Y., & Pauly, M. (2014). Coupled structure-from-motion and 3D symmetry detection for urban facades. ACM Transactions on Graphics, 33, 2. https://doi.org/10.1145/2517348.
Article MATH Google Scholar
Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS (pp. 1736–1744).
Coughlan, J. M., & Yuille, A. L. (1999). Manhattan world: Compass direction from a single image by bayesian inference. In ICCV.
Coughlan, J. M., & Yuille, A. L. (2003). Manhattan world: Orientation and outlier detection by bayesian inference. Neural Computation, 15(5), 1063–1088.
Article Google Scholar
Dai, Y., Li, H., & He, M. (2012). A simple prior-free method for non-rigid structure-from-motion factorization. In CVPR.
Dai, Y., Li, H., & He, M. (2014). A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision, 107, 101–122.
Article MathSciNet MATH Google Scholar
Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Manhattan-world stereo. In CVPR.
Gao, Y., Ma, J., Zhao, M., Liu, W., & Yuille, A. L. (2019). NDDR-CNN: Layerwise feature fusing in multi-task CNNs by neural discriminative dimensionality reduction. In CVPR.
Gao, Y., & Yuille, A. L. (2016). Symmetry non-rigid structure from motion for category-specific object structure estimation. In ECCV.
Gao, Y., & Yuille, A. L. (2017). Exploiting symmetry and/or manhattan properties for 3D object structure estimation from single and multiple images. In IEEE international conference on computer vision and pattern recognition.
Gordon, G. G. (1990). Shape from symmetry. In Proceedings of SPIE.
Gotardo, P., & Martinez, A. (2011). Computing smooth time-trajectories for camera and deformable shape in structure from motion with occlusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 2051–2065.
Article Google Scholar
Grossmann, E., Ortin, D., & Santos-Victor, J. (2002). Single and multi-view reconstruction of structured scenes. In ACCV.
Grossmann, E., & Santos-Victor, J. (2002). Maximum likehood 3D reconstruction from one or more images under geometric constraints. In BMVC.
Grossmann, E., & Santos-Victor, J. (2005). Least-squares 3D reconstruction from one or more views and geometric clues. Computer Vision and Image Understanding, 99(2), 151–174.
Article Google Scholar
Hamsici, O. C., Gotardo, P. F., & Martinez, A. M. (2012). Learning spatially-smooth mappings in non-rigid structure from motion. In ECCV (pp. 260–273).
Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge: Cambridge University Press.
Book MATH Google Scholar
Hong, J. H., & Fitzgibbon, A. (2015). Secrets of matrix factorization: Approximations, numerics, manifold optimization and random restarts. In ICCV.
Hong, W., Yang, A. Y., Huang, K., & Ma, Y. (2004). On symmetry and multiple-view geometry: Structure, pose, and calibration from a single image. International Journal of Computer Vision, 60, 241–265.
Article Google Scholar
Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In CVPR.
Kontsevich, L. L. (1993). Pairwise comparison technique: A simple solution for depth reconstruction. JOSA A, 10(6), 1129–1135.
Article Google Scholar
Kontsevich, L. L., Kontsevich, M. L., & Shen, A. K. (1987). Two algorithms for reconstructing shapes. Optoelectronics, Instrumentation and Data Processing, 5, 76–81.
Google Scholar
Li, Y., & Pizlo, Z. (2007). Reconstruction of shapes of 3D symmetric objects by using planarity and compactness constraints. In Proceedings of SPIE-IS&T electronic imaging.
Ma, J., Zhao, J., Tian, J., Tu, Z., & Yuille, A. L. (2013). Robust estimation of nonrigid transformation for point set registration. In CVPR (pp. 2147–2154).
Marques, M., & Costeira, J. (2009). Estimating 3D shape from degenerate sequences with missing data. Computer Vision and Image Understanding, 113(2), 261–272.
Article Google Scholar
Ma, J., Zhao, J., Ma, Y., & Tian, J. (2015). Non-rigid visible and infrared face registration via regularized gaussian fields criterion. Pattern Recognition, 48(3), 772–784.
Article Google Scholar
Ma, J., Zhao, J., Tian, J., Bai, X., & Tu, Z. (2013). Regularized vector field learning with sparse approximation for mismatch removal. Pattern Recognition, 46(12), 3519–3532.
Article MATH Google Scholar
Morris, D. D., Kanatani, K., & Kanade, T. (2001). Gauge fixing for accurate 3D estimation. In CVPR.
Mukherjee, D. P., Zisserman, A., & Brady, M. (1995). Shape from symmetry: Detecting and exploiting symmetry in affine images. Philosophical Transactions: Physical Sciences and Engineering, 351, 77–106.
Article MATH Google Scholar
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (pp. 483–499). Springer.
Olsen, S. I., & Bartoli, A. (2008). Implicit non-rigid structure-from-motion with priors. Journal of Mathematical Imaging and Vision, 31(2–3), 233–244.
Pavlakos, G., Zhou, X., Chan, A., Derpanis, K. G., & Daniilidis, K. (2017). 6-DoF object pose from semantic keypoints. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2011–2018). IEEE.
Rosen, J. (2011). Symmetry discovered: Concepts and applications in nature and science. Mineola: Dover Publications.
MATH Google Scholar
Schönemann, P. H. (1966). A generalized solution of the orthogonal procrustes problem. Psychometrika, 31, 1–10.
Article MathSciNet MATH Google Scholar
Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In ICCV.
Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision, 9(2), 137–154.
Article Google Scholar
Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3D shape from 2D motion. In NIPS.
Torresani, L., Hertzmann, A., & Bregler, C. (2008). Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 878–892.
Vetter, T., & Poggio, T. (1994). Symmetric 3D objects are an easy case for 2D object recognition. Spatial Vision, 8, 443–453.
Article Google Scholar
Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing PASCAL VOC. In CVPR.
Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3D object detection in the wild. In WACV.
Xiao, J., Chai, J., & Kanade, T. (2004). A closed-form solution to nonrigid shape and motion recovery. In ECCV.

Download references

Acknowledgements

We would like to thank Ehsan Jahangiri, Cihang Xie, Weichao Qiu, Xuan Dong, Siyuan Qiao for giving feedbacks on the manuscript. This work was partially supported by ARO 62250-CS, ONR N00014-15-1-2356, and the NSF award CCF-1317376.

Author information

Authors and Affiliations

National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China
Yuan Gao
Tencent AI Lab, Shenzhen, China
Yuan Gao
Departments of Computer Science and Cognitive Science, Johns Hopkins University, Baltimore, MD, USA
Alan L. Yuille
Department of Statistics, UCLA, Los Angeles, CA, USA
Alan L. Yuille

Authors

Yuan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Alan L. Yuille
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Gao.

Additional information

Communicated by Cordelia Schmid.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 268 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, Y., Yuille, A.L. Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images. Int J Comput Vis 127, 1501–1526 (2019). https://doi.org/10.1007/s11263-019-01195-z

Download citation

Received: 27 August 2017
Accepted: 06 July 2019
Published: 02 August 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11263-019-01195-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Fully-Convolutional Siamese Networks for Object Tracking

3D point cloud-based place recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 268 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Fully-Convolutional Siamese Networks for Object Tracking

3D point cloud-based place recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 268 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation