Abstract
We propose a new coherent framework for joint object detection, 3D layout estimation, and object supporting region segmentation from a single image. Our approach is based on the mutual interactions among three novel modules: (i) object detector; (ii) scene 3D layout estimator; (iii) object supporting region segmenter. The interactions between such modules capture the contextual geometrical relationship between objects, the physical space including these objects, and the observer. An important property of our algorithm is that the object detector module is capable of adaptively changing its confidence in establishing whether a certain region of interest contains an object (or not) as new evidence is gathered about the scene layout. This enables an iterative estimation procedure where the detector becomes more and more accurate as additional evidence about a specific scene becomes available. Extensive quantitative and qualitative experiments are conducted on the table-top dataset (Sun et al. in ECCV, 2010b) and two publicly available datasets (Hoiem et al. in CVPR, 2006; Sudderth et al. in IJCV, 2008), and demonstrate competitive object detection, 3D layout estimation, and segmentation results.
Similar content being viewed by others
Notes
Here we omit the superscript o to have a concise notation.
When the area of the intersection between the foreground region (fg) and the object bounding box over the area of the object bounding box is bigger than 0.5, the object is considered as sufficient overlap with the foreground region.
The training instances and testing instances are separated.
\(e_{H}=\frac{1}{N}\sum_{i}|\frac{\widehat{H_{i}}-H_{i}}{H_{i}}|\), where \(\widehat{H_{i}}\) and H i are the best estimated and ground truth vanishing line.
References
Bao, S. Y., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scenelayout understanding. In CVPR.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In ECCV.
Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3D city modeling using cognitive loops. In 3DPVT.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results.
Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. In IJCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. In IJCV.
Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object category model for efficient learning and exhaustive recognition. In CVPR.
Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzàlez, J. (2010). Harmony potentials for joint classification and segmentation. In CVPR.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In ICCV.
Gupta, A., & Davis, L. S. (2008). Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV.
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV.
Heitz, G., Gould, S., Saxena, A., & Koller, D. (2008). Cascaded classification models: combining models for holistic scene understanding. In NIPS.
Hoiem, D., Efros, A. A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.
Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in perspective. In CVPR.
Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. In IJCV.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Closing the loop on scene interpretation. In CVPR.
Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2010). Graph cut based inference with co-occurrence statistics. In ECCV.
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision.
Li, C., Kowdle, A., Saxena, A., & Chen, T. (2010). Towards holistic scene understanding: feedback enabled cascaded classification models. In NIPS.
Li, L. J., & Fei-Fei, L. (2007). What, where and who? classifying event by scene and object recognition. In ICCV.
Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In CVPR.
Liebelt, J., & Schmid, C. (2010). Multi-view object class detection with a 3D geometric model. In CVPR.
Payet, N., & Todorovic, S. (2011). Scene shape from textures of objects. In CVPR.
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In ICCV.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. In IJCV.
Savarese, S., & Fei-Fei, L. (2007). 3D generic object categorization, localization and pose estimation. In CVPR.
Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3D: learning 3D scene structure from a single still image. In PAMI.
Su, H., Sun, M., Fei-Fei, L., & Savarese, S. (2009). Learning a dense multi-view representation for detection, viewpoint classification, and synthesis of object categories. In ICCV.
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2008). Describing visual scenes using transformed objects and parts. In IJCV.
Sun, M., Su, H., Savarese, S., & Fei-Fei, L. (2009). A multi-view probabilistic model for 3D object classes. In CVPR.
Sun, M., Bao, S. Y., & Savarese, S. (2010a). Object detection with geometrical context feedback loop. In BMVC.
Sun, M., Bradski, G., Xu, B. X., & Savarese, S. (2010b). Depth-encoded hough voting for coherent object detection, pose estimation, and shape recovery. In ECCV.
Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., & Van Gool, L. (2006). Towards multi-view object class detection. In CVPR.
Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In ICCV.
Viola, P., & Jones, M. (2002). Robust real-time object detection. In IJCV.
Acknowledgements
We acknowledge the support of NSF (Grant CNS 0931474) and the Gigascale Systems Research Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation Entity. We thank Gary Bradski for supporting the data collection of the table-top object dataset (Sun et al. 2010b).
Author information
Authors and Affiliations
Corresponding author
Additional information
Contribution of M. Sun and S.Y. Bao is equal in this paper.
Appendices
Appendix A: Derivation of Object Detect Score
In detail, we define V(O,x|D) as the sum of individual probabilities over all observed images patches at location l j and for all possible depth \(d_{j}^{p}\in D\), i.e.,
where the summation over j aggregates the evidence from individual patch location, and the summation over depth \(d_{j}^{p}\) marginalizes out the uncertainty of depth corresponding to each image patch location. Since C j is calculated deterministically from l j and \(d_{j}^{p}\), and assuming O only depending on C j , we obtain:
We further assign image patches with different depths to different index j. As a result, we can take only the summation over patch index j and obtain (1).
Appendix B: Proof of Three Objects Requirement
Equation (14) admits one or at most two non-trivial solutions of {f,n 1,n 2,n 3} if at least three non-aligned observations (u i ,v i ) (i.e. non-collinear in the image) are available. If the observations are collinear, then (14) has infinite number of solutions.
Proof
Suppose at least three objects are not collinear in a image, then the rank of the left matrix in the left-hand side of (14) is 3. Therefore (14) provides 3 independent constraints. Recall the unknowns in (14) are n 1,n 2,n 3,f. With these constraints, each of n 1,n 2,n 3 can be expressed as a function of f, i.e. n i =n i (f). Because ∥n∥=1, we obtain an equation about f:
In the above equation, f appears in the order of f 2 and f 4. Therefore, there are at most two real positive solutions of f. Given f, {n 1,n 2,n 3} can be computed as n i =n i (f).
On the other hand, if all objects are collinear in the image, then infinite number of solutions of (14) exist. If all objects are collinear, the rank of the left matrix in the left-hand side of (14) is 2. Without loss of generality, assume (u 1,v 1)≠0. In such a case, after using Gaussian elimination, (14) will be in the following form:
If \(\widehat{f}, \widehat{n}_{1}, \widehat{n}_{2}, \widehat{n}_{3}\) is solution, then \(\widehat{f}, \widehat{n}_{1}+km_{1}, \widehat{n}_{2}+km_{2}, \widehat{n}_{3}+km_{3}\) is also a solution of (15), where (m 1,m 2,m 3) is the non-trivial solution the following equation:
Hence, (14) admits infinite solutions. □
Rights and permissions
About this article
Cite this article
Sun, M., Bao, S.Y. & Savarese, S. Object Detection using Geometrical Context Feedback. Int J Comput Vis 100, 154–169 (2012). https://doi.org/10.1007/s11263-012-0547-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0547-2