Skip to main content
Log in

Object Detection using Geometrical Context Feedback

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a new coherent framework for joint object detection, 3D layout estimation, and object supporting region segmentation from a single image. Our approach is based on the mutual interactions among three novel modules: (i) object detector; (ii) scene 3D layout estimator; (iii) object supporting region segmenter. The interactions between such modules capture the contextual geometrical relationship between objects, the physical space including these objects, and the observer. An important property of our algorithm is that the object detector module is capable of adaptively changing its confidence in establishing whether a certain region of interest contains an object (or not) as new evidence is gathered about the scene layout. This enables an iterative estimation procedure where the detector becomes more and more accurate as additional evidence about a specific scene becomes available. Extensive quantitative and qualitative experiments are conducted on the table-top dataset (Sun et al. in ECCV, 2010b) and two publicly available datasets (Hoiem et al. in CVPR, 2006; Sudderth et al. in IJCV, 2008), and demonstrate competitive object detection, 3D layout estimation, and segmentation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Here we omit the superscript o to have a concise notation.

  2. When the area of the intersection between the foreground region (fg) and the object bounding box over the area of the object bounding box is bigger than 0.5, the object is considered as sufficient overlap with the foreground region.

  3. The training instances and testing instances are separated.

  4. As explained in Bao et al. (2010) and in Sect. 2.2.2, at least 3 objects are necessary for estimating the layout.

  5. \(e_{H}=\frac{1}{N}\sum_{i}|\frac{\widehat{H_{i}}-H_{i}}{H_{i}}|\), where \(\widehat{H_{i}}\) and H i are the best estimated and ground truth vanishing line.

References

  • Bao, S. Y., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scenelayout understanding. In CVPR.

    Google Scholar 

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In ECCV.

    Google Scholar 

  • Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3D city modeling using cognitive loops. In 3DPVT.

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

    Google Scholar 

  • Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision.

    Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV.

    Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. In IJCV.

    Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. In IJCV.

    Google Scholar 

  • Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR.

    Google Scholar 

  • Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzàlez, J. (2010). Harmony potentials for joint classification and segmentation. In CVPR.

    Google Scholar 

  • Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.

    Google Scholar 

  • Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In ICCV.

    Google Scholar 

  • Gupta, A., & Davis, L. S. (2008). Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV.

    Google Scholar 

  • Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV.

    Google Scholar 

  • Heitz, G., Gould, S., Saxena, A., & Koller, D. (2008). Cascaded classification models: combining models for holistic scene understanding. In NIPS.

    Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.

    Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in perspective. In CVPR.

    Google Scholar 

  • Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. In IJCV.

    Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2008). Closing the loop on scene interpretation. In CVPR.

    Google Scholar 

  • Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2010). Graph cut based inference with co-occurrence statistics. In ECCV.

    Google Scholar 

  • Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision.

    Google Scholar 

  • Li, C., Kowdle, A., Saxena, A., & Chen, T. (2010). Towards holistic scene understanding: feedback enabled cascaded classification models. In NIPS.

    Google Scholar 

  • Li, L. J., & Fei-Fei, L. (2007). What, where and who? classifying event by scene and object recognition. In ICCV.

    Google Scholar 

  • Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In CVPR.

    Google Scholar 

  • Liebelt, J., & Schmid, C. (2010). Multi-view object class detection with a 3D geometric model. In CVPR.

    Google Scholar 

  • Payet, N., & Todorovic, S. (2011). Scene shape from textures of objects. In CVPR.

    Google Scholar 

  • Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In ICCV.

    Google Scholar 

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. In IJCV.

    Google Scholar 

  • Savarese, S., & Fei-Fei, L. (2007). 3D generic object categorization, localization and pose estimation. In CVPR.

    Google Scholar 

  • Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3D: learning 3D scene structure from a single still image. In PAMI.

    Google Scholar 

  • Su, H., Sun, M., Fei-Fei, L., & Savarese, S. (2009). Learning a dense multi-view representation for detection, viewpoint classification, and synthesis of object categories. In ICCV.

    Google Scholar 

  • Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2008). Describing visual scenes using transformed objects and parts. In IJCV.

    Google Scholar 

  • Sun, M., Su, H., Savarese, S., & Fei-Fei, L. (2009). A multi-view probabilistic model for 3D object classes. In CVPR.

    Google Scholar 

  • Sun, M., Bao, S. Y., & Savarese, S. (2010a). Object detection with geometrical context feedback loop. In BMVC.

    Google Scholar 

  • Sun, M., Bradski, G., Xu, B. X., & Savarese, S. (2010b). Depth-encoded hough voting for coherent object detection, pose estimation, and shape recovery. In ECCV.

    Google Scholar 

  • Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., & Van Gool, L. (2006). Towards multi-view object class detection. In CVPR.

    Google Scholar 

  • Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In ICCV.

    Google Scholar 

  • Viola, P., & Jones, M. (2002). Robust real-time object detection. In IJCV.

    Google Scholar 

Download references

Acknowledgements

We acknowledge the support of NSF (Grant CNS 0931474) and the Gigascale Systems Research Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation Entity. We thank Gary Bradski for supporting the data collection of the table-top object dataset (Sun et al. 2010b).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Sun.

Additional information

Contribution of M. Sun and S.Y. Bao is equal in this paper.

Appendices

Appendix A: Derivation of Object Detect Score

In detail, we define V(O,x|D) as the sum of individual probabilities over all observed images patches at location l j and for all possible depth \(d_{j}^{p}\in D\), i.e.,

where the summation over j aggregates the evidence from individual patch location, and the summation over depth \(d_{j}^{p}\) marginalizes out the uncertainty of depth corresponding to each image patch location. Since C j is calculated deterministically from l j and \(d_{j}^{p}\), and assuming O only depending on C j , we obtain:

We further assign image patches with different depths to different index j. As a result, we can take only the summation over patch index j and obtain (1).

Appendix B: Proof of Three Objects Requirement

Equation (14) admits one or at most two non-trivial solutions of {f,n 1,n 2,n 3} if at least three non-aligned observations (u i ,v i ) (i.e. non-collinear in the image) are available. If the observations are collinear, then (14) has infinite number of solutions.

$$ \left[ \begin{array}{c@{\quad}c@{\quad}c} u_1 & v_1 & f\\ u_2 & v_2 & f\\ u_3 & v_3 & f\\ &\vdots&\\ u_N & v_N & f\\ \end{array} \right] \left( \begin{array}{c} n_1\\ n_2\\ n_3 \end{array} \right)= \left( \begin{array}{c} -\cos\phi_1 \sqrt{u_1^2+v_1^2+f^2}\\ -\cos\phi_2 \sqrt{u_2^2+v_2^2+f^2}\\ -\cos\phi_3 \sqrt{u_3^2+v_3^2+f^2}\\ \vdots\\ -\cos\phi_N \sqrt{u_N^2+v_N^2+f^2}\\ \end{array} \right) $$
(14)

Proof

Suppose at least three objects are not collinear in a image, then the rank of the left matrix in the left-hand side of (14) is 3. Therefore (14) provides 3 independent constraints. Recall the unknowns in (14) are n 1,n 2,n 3,f. With these constraints, each of n 1,n 2,n 3 can be expressed as a function of f, i.e. n i =n i (f). Because ∥n∥=1, we obtain an equation about f:

$$\sum_{i=1,\ldots,3}{n_i^2(f)}=1 $$

In the above equation, f appears in the order of f 2 and f 4. Therefore, there are at most two real positive solutions of f. Given f, {n 1,n 2,n 3} can be computed as n i =n i (f).

On the other hand, if all objects are collinear in the image, then infinite number of solutions of (14) exist. If all objects are collinear, the rank of the left matrix in the left-hand side of (14) is 2. Without loss of generality, assume (u 1,v 1)≠0. In such a case, after using Gaussian elimination, (14) will be in the following form:

$$ \left[ \begin{array}{c@{\quad}c@{\quad}c} \alpha& \beta& f \\ \gamma& \epsilon& 0 \\ 0&0&0\\ &\vdots& \end{array} \right] \left( \begin{array}{c} n_1\\ n_2\\ n_3 \end{array} \right)= \begin{array}{c} \zeta\\ \eta\\ 0\\ \vdots \end{array} $$
(15)

If \(\widehat{f}, \widehat{n}_{1}, \widehat{n}_{2}, \widehat{n}_{3}\) is solution, then \(\widehat{f}, \widehat{n}_{1}+km_{1}, \widehat{n}_{2}+km_{2}, \widehat{n}_{3}+km_{3}\) is also a solution of (15), where (m 1,m 2,m 3) is the non-trivial solution the following equation:

$$\left[ \begin{array}{c@{\quad}c@{\quad}c} \alpha& \beta& f \\ \gamma& \epsilon& 0 \\ \end{array} \right] \left( \begin{array}{c} m_1\\ m_2\\ m_3 \end{array} \right)= 0 $$

Hence, (14) admits infinite solutions. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, M., Bao, S.Y. & Savarese, S. Object Detection using Geometrical Context Feedback. Int J Comput Vis 100, 154–169 (2012). https://doi.org/10.1007/s11263-012-0547-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0547-2

Keywords

Navigation