Man-made environments tend to be abundant with planar homogeneous texture, which manifests as regularly repeating scene elements along a plane. In this work, we propose to exploit such structure to facilitate high-level scene understanding. By robustly fitting a texture projection model to optimal dominant frequency estimates in image patches, we arrive at a projective-invariant method to localize such generic, semantically meaningful regions in multi-planar scenes. The recovered projective parameters also allow an affine-ambiguous rectification in real-world images marred with outliers, room clutter, and photometric severities. Comprehensive qualitative and quantitative evaluations are performed that show our method outperforms existing representative work for both rectification and detection. The potential of homogeneous texture for two scene understanding tasks is then explored. Firstly, in environments where vanishing points cannot be reliably detected, or the Manhattan assumption is not satisfied, homogeneous texture detected by the proposed approach is shown to provide alternative cues to obtain a scene geometric layout. Second, low-level feature descriptors extracted upon affine rectification of detected texture are found to be not only class-discriminative but also complementary to features without rectification, improving recognition performance on the 67-category MIT benchmark of indoor scenes. One of our configurations involving deep ConvNet features outperforms most current state-of-the-art work on this dataset, achieving a classification accuracy of 76.90%. The approach is additionally validated on a set of 31 categories (mostly outdoor man-made environments exhibiting regular, repeating structure), being a subset of the large-scale Places2 scene dataset.

The symbol tilde ( \(\tilde{}\) ) is used to denote an instantaneous quantity in (Super and Bovik 1995a, b). In this paper, however, it is used to denote an estimated quantity, while the instantaneous nature is already clear by writing it as a function of \(\mathbf {x}\). As such, equality (\(=\)) is used in Eq. 10 instead of the approximate equality (\(\approx \)) appearing in (Super and Bovik 1995a, b).
We define drift as deviations from the “ideal” frequencies expected to be present in a perspectively projected image of a homogeneous textured patch due to perturbations by other scene elements.
For computational stability, the pixel coordinates are also normalized such that the top-left of the patch is given by (\(-\) 1,\(-\) 1) and the bottom right by (1,1).
The Fourier spectrum (magnitude of the Fourier transform) of a given texture is known to be invariant to an affine transform upon normalization by its \(l_1\)-norm (Zhang and Tan 2003). Our scenario, however, concerns the frequency plane coordinates (i.e., the frequency itself), having undergone said unknown transform.
For rectification experiments on cropped homogeneous texture (Sect. 7.1), a strict error tolerance of \(10^{-3}\) is used. Since RANSAC is run for a large number of iterations (50), and because multiple anisotropically scaled representations are used, the algorithm usually converges for most (if not all) of them. For experiments on detection (Sect. 7.2), however, the tolerance is relaxed to \(10^{-2}\), and the number of remaining iterations is adapted continuously based on the current proportion of outliers to speed up convergence (Fischler and Bolles 1981). While more iterations would certainly improve performance, we choose to make this trade-off since we evaluate a large number of overlapping patches.
Since our detector is not “trained” to produce an exact bounding box (as we need multiple detections to cover a perspectively projected textured region whose boundaries are thus not aligned to the image axis, and we also allow multi-scale detections), we slightly differ in our definitions of these measures from object detection (Everingham et al. 2014). Object detection methodology considers any more than one detection for a given ground truth as FPs, but all such detections are considered TPs in our scenario.
The evaluation presented herein can be considered as that for both detection as well as geometric class assignment (presented in Sect. 8). This is since it is really the assignment (via proposed approach) of a geometric class to a given proposal that goes on to determine the detector’s precision and recall.
In principle, it is also possible to classify a given detection as frontal; if the vanishing line lies ‘far’ from the patch (based on some pre-defined threshold), it may be classified as a frontal surface exhibiting no or minute perspective distortion. Note the slope in this case is useless. In practice, however, this was observed to cause mis-classifications of planes as frontal that would otherwise be assigned to the vertical (walls) or horizontal (ceiling/floor) classes due to the ill-conditioned nature of the slope in such cases. This adversely increases false positives and decreases true positives. Moreover, since fronto-parallel planes rarely appear in this dataset, we choose, for simplicity, not to model the fronto-parallel class.
Recall that for a line in the general form \(ax + by + c = 0\), the slope and y-intercept are given as \(-a/b\) and \(-c/b\), respectively. Thus, we have the slope of the vanishing line as \(-h_7/h_8\) and the intercept as \(-1/h_8\).
Indeed, we found original SIFT to yield a lower performance of 59.14%, as opposed to 60.93% by RootSIFT, on the MIT Indoor67. Incidentally, it is to be compatible with (Juneja et al. 2013)(60.77%) that a denser grid spacing of 4 pixels is used in our experiments for SIFT feature extraction (though computationally expensive for rectified representation), while the other three descriptors use 8 pixels.
As an aside, an accuracy of 68.57% obtained by CNN image description is very impressive, especially since the dimensionality is merely 4096 and a linear kernel SVM is used. By contrast, a Fisher encoded descriptor (Table 6a) is 204,800-dimensional, and also needs a non-linear kernel (Hellinger mapping) to achieve an accuracy that is still significantly lower than CNN! Clearly, CNNs are able to produce a very low-dimensional, highly discriminative, invariant and powerful representation for a given image. Comparing with previous works employing off-the-shelf ConvNet models to compute a single image-level descriptor and linear SVM classification, the performance obtained here is slightly higher than by (Cimpoi et al. 2015) (FC-CNN in Table 5(a), 67.6% using the VGG-M pre-trained model), and slightly lower than by (Razavian et al. 2014) (CNNaug-SVM, 69% using the OverFeat model, but with additional augmented training images).
In order to ascertain whether the improvement is indeed due to inclusion of rectified features in the image representation, and not simply due to the incorporation of multiscale features, we have performed an experiment wherein an image representation is constructed consisting of element-wise max-pooled CNN features extracted from the same patches as detected for the rectified CNN representation (recall the patches are detected at multiple scales as described in Sect. 7.2). We call this non-rectified representation \(\hbox {CNN}^{\prime }\). A classification performance of 60.37% is obtained, which is similar to that obtained by CNN_Rect(max) (Table 6b). When combining \(\hbox {CNN}^{\prime }\) with CNN, a performance of 72.16% is obtained, similar to CNN \(+\) CNN_Rect (max). Finally, when combining \(\hbox {CNN}^{\prime }\) with both CNN and CNN_Rect(max), a performance of 75.08% is observed. From this experiment, we conclude that explicit rectification does introduce additional and complementary CNN features that are not otherwise present in a non-rectified representation. Therefore, where augmenting data with multiscale information (Cimpoi et al. 2015), or with rotated and cropped examples (Razavian et al. 2014) is known to yield improved deep image representations, our work demonstrates that rectification can help as well.
