Object Detection using Geometrical Context Feedback

Sun, Min; Bao, Sid Yingze; Savarese, Silvio

doi:10.1007/s11263-012-0547-2

Object Detection using Geometrical Context Feedback

Published: 02 August 2012

Volume 100, pages 154–169, (2012)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Min Sun¹,
Sid Yingze Bao¹ &
Silvio Savarese¹

1401 Accesses
8 Citations
Explore all metrics

Abstract

We propose a new coherent framework for joint object detection, 3D layout estimation, and object supporting region segmentation from a single image. Our approach is based on the mutual interactions among three novel modules: (i) object detector; (ii) scene 3D layout estimator; (iii) object supporting region segmenter. The interactions between such modules capture the contextual geometrical relationship between objects, the physical space including these objects, and the observer. An important property of our algorithm is that the object detector module is capable of adaptively changing its confidence in establishing whether a certain region of interest contains an object (or not) as new evidence is gathered about the scene layout. This enables an iterative estimation procedure where the detector becomes more and more accurate as additional evidence about a specific scene becomes available. Extensive quantitative and qualitative experiments are conducted on the table-top dataset (Sun et al. in ECCV, 2010b) and two publicly available datasets (Hoiem et al. in CVPR, 2006; Sudderth et al. in IJCV, 2008), and demonstrate competitive object detection, 3D layout estimation, and segmentation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

Notes

Here we omit the superscript o to have a concise notation.
When the area of the intersection between the foreground region (fg) and the object bounding box over the area of the object bounding box is bigger than 0.5, the object is considered as sufficient overlap with the foreground region.
The training instances and testing instances are separated.
As explained in Bao et al. (2010) and in Sect. 2.2.2, at least 3 objects are necessary for estimating the layout.
$e_{H}=\frac{1}{N}\sum_{i}|\frac{\widehat{H_{i}}-H_{i}}{H_{i}}|$, where $\widehat{H_{i}}$ and H _i are the best estimated and ground truth vanishing line.

References

Bao, S. Y., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scenelayout understanding. In CVPR.
Google Scholar
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In ECCV.
Google Scholar
Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3D city modeling using cognitive loops. In 3DPVT.
Google Scholar
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Google Scholar
Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision.
Google Scholar
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.
Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV.
Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. In IJCV.
Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. In IJCV.
Google Scholar
Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR.
Google Scholar
Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzàlez, J. (2010). Harmony potentials for joint classification and segmentation. In CVPR.
Google Scholar
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.
Google Scholar
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In ICCV.
Google Scholar
Gupta, A., & Davis, L. S. (2008). Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV.
Google Scholar
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV.
Google Scholar
Heitz, G., Gould, S., Saxena, A., & Koller, D. (2008). Cascaded classification models: combining models for holistic scene understanding. In NIPS.
Google Scholar
Hoiem, D., Efros, A. A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.
Google Scholar
Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in perspective. In CVPR.
Google Scholar
Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. In IJCV.
Google Scholar
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Closing the loop on scene interpretation. In CVPR.
Google Scholar
Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2010). Graph cut based inference with co-occurrence statistics. In ECCV.
Google Scholar
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision.
Google Scholar
Li, C., Kowdle, A., Saxena, A., & Chen, T. (2010). Towards holistic scene understanding: feedback enabled cascaded classification models. In NIPS.
Google Scholar
Li, L. J., & Fei-Fei, L. (2007). What, where and who? classifying event by scene and object recognition. In ICCV.
Google Scholar
Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In CVPR.
Google Scholar
Liebelt, J., & Schmid, C. (2010). Multi-view object class detection with a 3D geometric model. In CVPR.
Google Scholar
Payet, N., & Todorovic, S. (2011). Scene shape from textures of objects. In CVPR.
Google Scholar
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In ICCV.
Google Scholar
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. In IJCV.
Google Scholar
Savarese, S., & Fei-Fei, L. (2007). 3D generic object categorization, localization and pose estimation. In CVPR.
Google Scholar
Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3D: learning 3D scene structure from a single still image. In PAMI.
Google Scholar
Su, H., Sun, M., Fei-Fei, L., & Savarese, S. (2009). Learning a dense multi-view representation for detection, viewpoint classification, and synthesis of object categories. In ICCV.
Google Scholar
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2008). Describing visual scenes using transformed objects and parts. In IJCV.
Google Scholar
Sun, M., Su, H., Savarese, S., & Fei-Fei, L. (2009). A multi-view probabilistic model for 3D object classes. In CVPR.
Google Scholar
Sun, M., Bao, S. Y., & Savarese, S. (2010a). Object detection with geometrical context feedback loop. In BMVC.
Google Scholar
Sun, M., Bradski, G., Xu, B. X., & Savarese, S. (2010b). Depth-encoded hough voting for coherent object detection, pose estimation, and shape recovery. In ECCV.
Google Scholar
Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., & Van Gool, L. (2006). Towards multi-view object class detection. In CVPR.
Google Scholar
Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In ICCV.
Google Scholar
Viola, P., & Jones, M. (2002). Robust real-time object detection. In IJCV.
Google Scholar

Download references

Acknowledgements

We acknowledge the support of NSF (Grant CNS 0931474) and the Gigascale Systems Research Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation Entity. We thank Gary Bradski for supporting the data collection of the table-top object dataset (Sun et al. 2010b).

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Min Sun, Sid Yingze Bao & Silvio Savarese

Authors

Min Sun
View author publications
You can also search for this author in PubMed Google Scholar
Sid Yingze Bao
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Savarese
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Sun.

Additional information

Contribution of M. Sun and S.Y. Bao is equal in this paper.

Appendices

Appendix A: Derivation of Object Detect Score

In detail, we define V(O,x|D) as the sum of individual probabilities over all observed images patches at location l _j and for all possible depth $d_{j}^{p}\in D$, i.e.,

where the summation over j aggregates the evidence from individual patch location, and the summation over depth $d_{j}^{p}$ marginalizes out the uncertainty of depth corresponding to each image patch location. Since C _j is calculated deterministically from l _j and $d_{j}^{p}$, and assuming O only depending on C _j, we obtain:

We further assign image patches with different depths to different index j. As a result, we can take only the summation over patch index j and obtain (1).

Appendix B: Proof of Three Objects Requirement

Equation (14) admits one or at most two non-trivial solutions of {f,n ₁,n ₂,n ₃} if at least three non-aligned observations (u _i,v _i) (i.e. non-collinear in the image) are available. If the observations are collinear, then (14) has infinite number of solutions.

$$ \left[ \begin{array}{c@{\quad}c@{\quad}c} u_1 & v_1 & f\\ u_2 & v_2 & f\\ u_3 & v_3 & f\\ &\vdots&\\ u_N & v_N & f\\ \end{array} \right] \left( \begin{array}{c} n_1\\ n_2\\ n_3 \end{array} \right)= \left( \begin{array}{c} -\cos\phi_1 \sqrt{u_1^2+v_1^2+f^2}\\ -\cos\phi_2 \sqrt{u_2^2+v_2^2+f^2}\\ -\cos\phi_3 \sqrt{u_3^2+v_3^2+f^2}\\ \vdots\\ -\cos\phi_N \sqrt{u_N^2+v_N^2+f^2}\\ \end{array} \right) $$

(14)

Proof

Suppose at least three objects are not collinear in a image, then the rank of the left matrix in the left-hand side of (14) is 3. Therefore (14) provides 3 independent constraints. Recall the unknowns in (14) are n ₁,n ₂,n ₃,f. With these constraints, each of n ₁,n ₂,n ₃ can be expressed as a function of f, i.e. n _i=n _i(f). Because ∥n∥=1, we obtain an equation about f:

$$\sum_{i=1,\ldots,3}{n_i^2(f)}=1 $$

In the above equation, f appears in the order of f ² and f ⁴. Therefore, there are at most two real positive solutions of f. Given f, {n ₁,n ₂,n ₃} can be computed as n _i=n _i(f).

On the other hand, if all objects are collinear in the image, then infinite number of solutions of (14) exist. If all objects are collinear, the rank of the left matrix in the left-hand side of (14) is 2. Without loss of generality, assume (u ₁,v ₁)≠0. In such a case, after using Gaussian elimination, (14) will be in the following form:

$$ \left[ \begin{array}{c@{\quad}c@{\quad}c} \alpha& \beta& f \\ \gamma& \epsilon& 0 \\ 0&0&0\\ &\vdots& \end{array} \right] \left( \begin{array}{c} n_1\\ n_2\\ n_3 \end{array} \right)= \begin{array}{c} \zeta\\ \eta\\ 0\\ \vdots \end{array} $$

(15)

If $\widehat{f}, \widehat{n}_{1}, \widehat{n}_{2}, \widehat{n}_{3}$ is solution, then $\widehat{f}, \widehat{n}_{1}+km_{1}, \widehat{n}_{2}+km_{2}, \widehat{n}_{3}+km_{3}$ is also a solution of (15), where (m ₁,m ₂,m ₃) is the non-trivial solution the following equation:

$$\left[ \begin{array}{c@{\quad}c@{\quad}c} \alpha& \beta& f \\ \gamma& \epsilon& 0 \\ \end{array} \right] \left( \begin{array}{c} m_1\\ m_2\\ m_3 \end{array} \right)= 0 $$

Hence, (14) admits infinite solutions. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, M., Bao, S.Y. & Savarese, S. Object Detection using Geometrical Context Feedback. Int J Comput Vis 100, 154–169 (2012). https://doi.org/10.1007/s11263-012-0547-2

Download citation

Received: 17 December 2010
Accepted: 16 July 2012
Published: 02 August 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s11263-012-0547-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object Detection using Geometrical Context Feedback

Abstract

Access this article