Skip to main content
Log in

Towards Reversal-Invariant Image Representation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

State-of-the-art image classification approaches are mainly based on robust image representation, such as the bag-of-features (BoF) model or the convolutional neural network (CNN) architecture. In real applications, the orientation (left/right) of an image or an object might vary from sample to sample, whereas some handcrafted descriptors (e.g., SIFT) and network operations (e.g., convolution) are not reversal-invariant, leading to the unsatisfied stability of image features extracted from these models. To deal with, a popular solution is to augment the dataset by adding a left-right reversed copy for each image. This strategy improves the recognition accuracy to some extent, but also brings the price of almost doubled time and memory consumptions on both the training and testing stages. In this paper, we present an alternative solution based on designing reversal-invariant representation of local patterns, so that we can obtain the identical representation for an image and its left-right reversed copy. For the BoF model, we design a reversal-invariant version of SIFT descriptor named Max-SIFT, a generalized RIDE algorithm which can be applied to a large family of local descriptors. For the CNN architecture, we present a simple idea of generating reversal-invariant deep features (RI-Deep), and, inspired by which, design reversal-invariant convolution (RI-Conv) layers to increase the CNN capacity without increasing the model complexity. Experiments reveal consistent accuracy gain on various image classification tasks, including scene understanding, fine-grained object recognition, and large-scale visual recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/BVLC/caffe/wiki/

    Models-accuracy-on-ImageNet-2012-val

References

  • Alahi, A., Ortiz, R., & Vandergheynst, P. (2012). FREAK: Fast Retina Keypoint. In IEEE conference on computer vision and pattern recognition.

  • Angelova, A., & Zhu, S. (2013). Efficient object detection and segmentation for fine-grained recognition. In IEEE conference on computer vision and pattern recognition.

  • Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In IEEE conference on computer vision and pattern recognition.

  • Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3), 346–359.

    Article  Google Scholar 

  • Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.

    Article  Google Scholar 

  • Berg, T., & Belhumeur, P. (2013). POOF: Part-based one-vs-one features for fine-grained categorization, face verification, and attribute estimation. In IEEE conference on computer vision and pattern recognition.

  • Bosch, A., Zisserman, A., & Munoz, X. (2006). Scene classification via pLSA. In International conference on computer vision.

  • Calonder, M., Lepetit, V., Strecha, C., & Fua, P. (2010). BRIEF: Binary robust independent elementary features. In European conference on computer vision.

  • Chai, Y., Lempitsky, V., & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In International conference on computer vision.

  • Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In British machine vision conference.

  • Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British machine vision conference.

  • Clinchant, S., Csurka, G., Perronnin, F., & Renders, J. (2007). XRCEs participation to ImagEval. ImagEval workshop at CVIR.

  • Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision, European conference on computer vision.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition.

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In IEEE conference international conference on machine learning.

  • Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  • Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric Lp-norm feature pooling for image classification. In IEEE conference on computer vision and pattern recognition.

  • Gavves, E., Fernando, B., Snoek, C., Smeulders, A., & Tuytelaars, T. (2014). Local alignments for fine-grained categorization. International Journal on Computer Vision, 111, 191–212.

    Article  Google Scholar 

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.

  • Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. In International conference on machine learning.

  • Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Discriminative classification with sets of image features. In International conference on computer vision.

  • Griffin, G. (2007). Caltech-256 object category dataset. Technical Report: CNS-TR-2007-001.

  • Guo, X., & Cao, X. (2010). FIND: A neat flip invariant descriptor. In International conference on patter recognition.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). CAFFE: Convolutional architecture for fast feature embedding. In ACM international conference on multimedia.

  • Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In IEEE conference on computer vision and pattern recognition.

  • Kobayashi, T. (2014). Dirichlet-based histogram feature transform for image classification. In IEEE conference on computer vision and pattern recognition.

  • Krause, J., Jin, H., Yang, J., & Fei-Fei, L. (2015). Fine-grained recognition without part annotations. In IEEE conference on computer vision and pattern recognition.

  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto, (Vol. 1(4), p. 7).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems.

  • Lapin, M., Schiele, B., & Hein, M. (2014). Scalable multitask representation learning for scene classification. In IEEE conference on computer vision and pattern recognition.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision and pattern recognition.

  • LeCun, Y., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1990). Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems.

  • Lee, C., Gallagher, P., & Tu, Z. (2016). Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In International conference on artificial intelligence and statistics.

  • Lee, C., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In International conference on artificial intelligence and statistics.

  • Leutenegger, S., Chli, M., & Siegwart, R. (2011). BRISK: Binary robust invariant scalable keypoints. In International conference on computer vision.

  • Li, L., Guo, Y., Xie, L., Kong, X., & Tian, Q. (2015). Fine-grained visual categorization with fine-tuned segmentation. In International conference on image processing.

  • Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition. In IEEE conference on computer vision and pattern recognition.

  • Lin, M., Chen, Q., & Yan, S. (2014a) Network in network. In International conference on learning representations.

  • Lin, D., Lu, C., Liao, R., & Jia, J. (2014b) Learning important spatial pooling regions for scene classification. In IEEE conference on computer vision and pattern recognition.

  • Lin, D., Shen, X., Lu, C., & Jia, J. (2015). Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In IEEE conference on computer vision and pattern recognition.

  • Liu, K., Skibbe, H., Schmidt, T., Blein, T., Palme, K., Brox, T., et al. (2014). Rotation-invariant HOG descriptors using fourier analysis in polar and spherical coordinates. International Journal of Computer Vision, 106(3), 342–364.

    Article  MathSciNet  MATH  Google Scholar 

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Ma, R., Chen, J., & Su, Z. (2010). MI-SIFT: Mirror and inversion invariant generalization for SIFT descriptor. In Conference on image and video retrieval.

  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. Technical report.

  • Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.

    Article  Google Scholar 

  • Murray, N., & Perronnin, F. (2014). Generalized max pooling. In IEEE conference on computer vision and pattern recognition.

  • Nagadomi. (2014). The Kaggle CIFAR10 network. https://github.com/nagadomi/kaggle-cifar10-torch7/

  • Nilsback, M., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In International conference on computer vision, graphics & image processing.

  • Parkhi, O., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In IEEE conference on computer vision and pattern recognition.

  • Paulin, M., Revaud, J., Harchaoui, Z., Perronnin, F., & Schmid, C. (2014). Transformation pursuit for image classification. In IEEE conference on computer vision and pattern recognition.

  • Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher Kernel for large-scale image classification. In European conference on computer vision.

  • Pu, J., Jiang, Y., Wang, J., & Xue, X. (2014). Which looks like which: Exploring inter-class relationships in fine-grained visual categorization. In European conference on computer vision.

  • Qian, Q., Jin, R., Zhu, S., & Lin, Y. (2015). Fine-grained visual categorization via multi-stage metric learning. In IEEE conference on computer vision and pattern recognition.

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In IEEE conference on computer vision and pattern recognition.

  • Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In IEEE conference on computer vision and pattern recognition.

  • Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). ORB: An efficient alternative to SIFT or SURF. In IEEE conference on international conference on computer vision.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 1, 1–42.

    MathSciNet  Google Scholar 

  • Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal on Computer Vision, 105(3), 222–245.

    Article  MathSciNet  MATH  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.

  • Skelly, L., & Sclaroff, S. (2007). Improved feature descriptors for 3D surface matching. In Proceedings of SPIE: 6762, two- and three-dimensional methods for inspection and metrology V

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition.

  • Takacs, G., Chandrasekhar, V., Tsai, S., Chen, D., Grzeszczuk, R., & Girod, B. (2013). Fast computation of rotation-invariant image features by an approximate radial gradient transform. IEEE Transactions on Image Processing, 22(8), 2970–2982.

    Article  MathSciNet  Google Scholar 

  • Torralba, A., Fergus, R., & Freeman, W. (2008). 80 Million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

    Article  Google Scholar 

  • Tuytelaars, T. (2010). Dense interest points. Computer Vision and Pattern Recognition, 32(9), 1582–1596.

    Google Scholar 

  • van de Sande, K., Gevers, T., & Snoek, C. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.

    Article  Google Scholar 

  • Vedaldi, A., & Fulkerson, B. (2010). VLFeat: An open and portable library of computer vision algorithms. In ACM international conference on multimedia.

  • Vedaldi, A., & Lenc, K. (2014). MatConvNet-convolutional neural networks for MATLAB. arXiv:1412.4564.

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical report: CNS-TR-2011-001.

  • Wang, Z., Fan, B., & Wu, F. (2011). Local intensity order pattern for feature description. In IEEE conference on international conference on computer vision.

  • Wang, Z., Feng, J., & Yan, S. (2014). Collaborative linear coding for robust image classification. International Journal on Computer Vision, 1, 1–12.

    Google Scholar 

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In IEEE conference on computer vision and pattern recognition.

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN satabase: Large-scale scene recognition from Abbey to Zoo. In IEEE conference on computer vision and pattern recognition.

  • Xie, L., Hong, R., Zhang, B., & Tian, Q. (2015a). Image classification and retrieval are ONE. In International conference on multimedia retrieval.

  • Xie, L., Tian, Q., Hong, R., Yan, S., & Zhang, B. (2013). Hierarchical part matching for fine-grained visual categorization. In International conference on computer vision.

  • Xie, L., Tian, Q., Wang, M., & Zhang, B. (2014a). Spatial pooling of heterogeneous features for image classification. IEEE Transactions on Image Processing, 23(5), 1994–2008.

    Article  MathSciNet  Google Scholar 

  • Xie, L., Tian, Q., & Zhang, B. (2015b). Image classification with Max-SIFT descriptors. In International conference on acoustics, speech and signal processing.

  • Xie, L., Tian, Q., & Zhang, B. (2015c). Simple techniques make sense: Feature pooling and normalization for image classification. In IEEE transactions on circuits and systems for video technology.

  • Xie, L., Wang, J., Guo, B., Zhang, B., & Tian, Q. (2014b). Orientational pyramid matching for recognizing indoor scenes. In IEEE conference on computer vision and pattern recognition.

  • Xie, L., Wang, J., Lin, W., Zhang, B., & Tian, Q. (2015d). RIDE: Reversal invariant descriptor enhancement. In International conference on computer vision.

  • Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In IEEE conference on computer vision and pattern recognition.

  • Yang, Y., & Newsam, S. (2010). Bag-of-visual-words and spatial extensions for land-use classification. In International conference on advances in geographic information systems.

  • Zeiler, M., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision.

  • Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014a). Part-based R-CNNs for fine-grained category detection. In European conference on computer vision.

  • Zhang, N., Farrell, R., Iandola, F., & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In International conference on computer vision.

  • Zhang, S., Tian, Q., Hua, G., Huang, Q., & Li, S. (2009). Descriptive visual words and visual phrases for image applications. In ACM international conference on multimedia.

  • Zhang, X., Xiong, H., Zhou, W., Lin, W., & Tian, Q. (2016). Picking deep filter responses for fine-grained image recognition. In IEEE conference on computer vision and pattern recognition.

  • Zhang, X., Xiong, H., Zhou, W., & Tian, Q. (2014b). Fused one-vs-all mid-level features for fine-grained visual categorization. In ACM international conference on multimedia.

  • Zhao, W., & Ngo, C. (2013). Flip-invariant SIFT for copy and object detection. IEEE Transactions on Image Processing, 22(3), 980–991.

    Article  MathSciNet  Google Scholar 

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.

  • Zhou, X., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In European conference on computer vision.

Download references

Acknowledgements

This work was done when Lingxi Xie was an intern at Microsoft Research. This work is supported by the 973 Program of China 2013CB329403 and 2012CB316301, NSFC 61332007, 61273023, 61429201, 61471235, Tsinghua ISRP 20121088071, ARO Grants W911NF-15-1-0290 and W911NF-12-1-0057, and Faculty Research Awards, NEC Lab of America.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jingdong Wang or Qi Tian.

Additional information

Communicated by Takayuki Okatani.

Appendices

Appendix 1: Orientation Estimation of Dense SIFT

In this section, we aim at proving an approximated estimation of SIFT orientation based on its local gradient values. The approximation is used in Sect. 4.2.1 of the main article.

1.1 The Implementation of SIFT

The implementation of SIFT is based on the original paper (Lowe 2004). In this subsection, we briefly review the process of orientation assignment and descriptor representation. Part of the statements refer to (Lowe 2004).

First let us assume that the assignment of descriptor scale is finished, which fits the case of dense sampling (Bosch et al. 2006) where all the descriptors have the same, fixed window size. Denote an image as \({\mathbf {I}}={\left[ a\left( x,y\right) \right] _{W\times H}}\). The gradient magnitude, \(m\left( x,y\right) \), and orientation, \(\theta \left( x,y\right) \), is pre-computed for each pixel:

$$\begin{aligned} \left\{ \begin{array}{rcl} {m\left( x,y\right) } &{} = &{} {\left[ \varDelta _x\left( x,y\right) ^2+\varDelta _y\left( x,y\right) ^2\right] ^{1/2}} \\ {\theta \left( x,y\right) } &{} = &{} {\arctan \left[ \varDelta _y\left( x,y\right) /\varDelta _x\left( x,y\right) \right] } \end{array}\right. \quad , \end{aligned}$$
(8)

in which \(\varDelta _x\left( x,y\right) \) and \(\varDelta _y\left( x,y\right) \) are defined as:

$$\begin{aligned} \left\{ \begin{array}{rcl} {\varDelta _x\left( x,y\right) } &{} = &{} {a\left( x+1,y\right) -a\left( x-1,y\right) } \\ {\varDelta _y\left( x,y\right) } &{} = &{} {a\left( x,y+1\right) -a\left( x,y-1\right) } \end{array}\right. \quad . \end{aligned}$$
(9)

The magnitude and orientation on each pixel are then used to estimate the dominant orientation of that descriptor. An orientation histogram is constructed using the gradient orientation of the pixels within a region around the keypoint. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a smoothing parameter \(\sigma \) that is 1.5 times that of the scale of the keypoint. Peaks in the orientation histogram correspond to dominant orientations of local gradients. The highest peak in the histogram is detected, and then any other local peak that is within \(80\%\) of the highest peak is used to also create a keypoint with that orientation. Therefore, for locations with multiple peaks of similar magnitude, there will be multiple keypoints created at the same location and scale but different orientations.

The above method works well on image matching and retrieval (Lowe 2004), but we do not need to assign multiple orientations for a descriptor in the classification tasks. As an alternation, we can also estimate a unique accumulated orientation using the following method. Every gradient magnitude is decomposed along both x and y axes, i.e.,

$$\begin{aligned} \left\{ \begin{array}{rcl} {m_x\left( x,y\right) } &{} = &{} {m\left( x,y\right) \times \cos \theta \left( x,y\right) } \\ {m_y\left( x,y\right) } &{} = &{} {m\left( x,y\right) \times \sin \theta \left( x,y\right) } \end{array}\right. \quad , \end{aligned}$$
(10)

and all the decomposed components are accumulated on x and y axes, respectively:

$$\begin{aligned} \left\{ \begin{array}{rcl} {G_x} &{} = &{} {{\sum _{x,y}}m_x\left( x,y\right) } \\ {G_y} &{} = &{} {{\sum _{x,y}}m_y\left( x,y\right) } \end{array}\right. \quad . \end{aligned}$$
(11)

Finally we get a 2-D vector \({\mathbf {G}}={\left( G_x,G_y\right) ^\top }\) indicating the orientation of that descriptor.

Of course, we can also follow the orientation assignment of original SIFT implementation (Lowe 2004). In practice, we have implemented RIDE with both dominant and accumulated orientations, and found that the latter one is slightly better. Another reason why we prefer the accumulated orientation is that it is a continuous value in \(\left[ 0,2\pi \right) \), which makes it easier for us to design the RIDE-8 algorithm.

In descriptor representation, we inherit \(m\left( x,y\right) \) and \(\theta \left( x,y\right) \) values of each pixel. The implementation of dense SIFT (Vedaldi and Fulkerson 2010) does not rotate the descriptor region. The region of a descriptor is partitioned into \(4\times 4\) grids, and an 8-bin orientation histogram is constructed in each grid. The central orientation value of the j-th bin is \({\theta _j}={j\pi /4}\), \({j}={0,1,\ldots ,7}\). Then the gradient magnitude of each pixel is then trilinearly quantized onto at most two bins. By trilinear we mean that if the orientation of a pixel, \(\theta \left( x,y\right) \), is closest to two standard orientation, say, \({\theta _a}<{\theta \left( x,y\right) }<{\theta _b}\), then the coefficients assigned to the bins are:

$$\begin{aligned} \left\{ \begin{array}{rcl} {m_a}={m\left( x,y\right) \times \displaystyle {\frac{\theta _b-\theta \left( x,y\right) }{\theta _b-\theta _a}}} \\ {m_b}={m\left( x,y\right) \times \displaystyle {\frac{\theta \left( x,y\right) -\theta _a}{\theta _b-\theta _a}}} \end{array}\right. \quad . \end{aligned}$$
(12)

An 8-dimensional orientation histogram is thereafter obtained in each of the \(4\times 4\) grids. Finally, the 128-dimensional descriptor is constructed by concatenating the histogram vectors from all \(4\times 4\) grids.

1.2 Orientation Estimation

The main goal of this part is to prove the next theorem for orientation approximation:

Theorem

Given a densely sampled SIFT descriptor \({\mathbf {d}}={\left( d_k,\theta _k\right) _{k=1,2,\ldots ,128}}\), where \(d_k\) and \(\theta _k\) are the gradient value and the histogram orientation for the k-th dimension, respectively. Its accumulated orientation \(\theta \) approximately satisfies:

$$\begin{aligned} {\tan \theta }={\frac{G_y\left( x,y\right) }{G_x\left( x,y\right) }}= {\frac{{{\sum _{x,y}}m_y\left( x,y\right) }}{{{\sum _{x,y}}m_x\left( x,y\right) }}}\approx {\frac{{\sum _k}d_k\sin \theta _k}{{\sum _k}d_k\cos \theta _k}}. \end{aligned}$$
(13)

For this, we only need to prove the following lemma:

Lemma

When a gradient value \(\left( m,\theta \right) \) with an arbitrary orientation is quantized as \(\left( m_a,\theta _a\right) \) and \(\left( m_b,\theta _b\right) \) \(({\theta _a}<{\theta }<{\theta _b})\) with the trilinear interpolation, i.e., using (12):

$$\begin{aligned} \left\{ \begin{array}{rcl} {m_a}={m\times \displaystyle {\frac{\theta _b-\theta }{\theta _b-\theta _a}}} \\ {m_b}={m\times \displaystyle {\frac{\theta -\theta _a}{\theta _b-\theta _a}}} \end{array}\right. \quad , \end{aligned}$$
(14)

the impacts on SIFT descriptor representation, before and after quantization, are approximately the same, i.e.,

$$\begin{aligned} \left\{ \begin{array}{rcl} {m\cos \theta } &{} \approx &{} {m_a\cos \theta _a+m_b\cos \theta _b} \\ {m\sin \theta } &{} \approx &{} {m_a\sin \theta _a+m_b\sin \theta _b} \end{array}\right. \quad . \end{aligned}$$
(15)
Table 11 Classification accuracy (\(\%\)) of different versions of RIDE on different versions of the Aircraft-100 dataset. The training/testing split is fixed in all cases

Proof

we only prove the first formula, since the proof of the other one is very similar.

Using (14) to substitute \(m_a\) and \(m_b\) in (15) yields:

$$\begin{aligned}&{m_a\cos \theta _a+m_b\cos \theta _b} \nonumber \\&\quad = {m\times \frac{\theta _b-\theta }{\theta _b-\theta _a}\times \cos \theta _a+ m\times \frac{\theta -\theta _a}{\theta _b-\theta _a}\times \cos \theta _b} \nonumber \\&\quad = {m\times \left( \frac{\theta _b-\theta }{\theta _b-\theta _a}\times \cos \theta _a+ \frac{\theta -\theta _a}{\theta _b-\theta _a}\times \cos \theta _b\right) }. \end{aligned}$$
(16)

Let us make the approximation that:

$$\begin{aligned} \left\{ \begin{array}{rcl} {\displaystyle {\frac{\theta _b-\theta }{\theta _b-\theta _a}}} &{} \approx &{} {\displaystyle {\frac{\sin \left( \theta _b-\theta \right) }{\sin \left( \theta _b-\theta _a\right) }}} \\ {\displaystyle {\frac{\theta -\theta _a}{\theta _b-\theta _a}}} &{} \approx &{} {\displaystyle {\frac{\sin \left( \theta -\theta _a\right) }{\sin \left( \theta _b-\theta _a\right) }}} \end{array}\right. \quad , \end{aligned}$$
(17)

thus (16) becomes:

$$\begin{aligned}&{m_a\cos \theta _a+m_b\cos \theta _b} \nonumber \\&\quad = {m\times \frac{\theta _b-\theta }{\theta _b-\theta _a}\times \cos \theta _a+ m\times \frac{\theta -\theta _a}{\theta _b-\theta _a}\times \cos \theta _b} \nonumber \\&\quad \approx {m\times \left[ \frac{\sin \left( \theta _b-\theta \right) }{\sin \left( \theta _b-\theta _a\right) }\times \cos \theta _a+ \frac{\sin \left( \theta -\theta _a\right) }{\sin \left( \theta _b-\theta _a\right) }\times \cos \theta _b\right] } \nonumber \\&\quad = {\frac{m\times \left[ \sin \left( \theta _b-\theta \right) \cos \theta _a+ \sin \left( \theta -\theta _a\right) \cos \theta _b\right] }{\sin \left( \theta _b-\theta _a\right) }} \nonumber \\&\quad = {\frac{m\times \left( \sin \theta _b\cos \theta \cos \theta _a-\cos \theta \sin \theta _a\cos \theta _b\right) }{\sin \left( \theta _b-\theta _a\right) }} \nonumber \\&\quad = {\frac{m\times \cos \theta \times \left( \sin \theta _b\cos \theta _a-\cos \theta _b\sin \theta _a\right) }{\sin \left( \theta _b-\theta _a\right) }} \nonumber \\&\quad = {m\cos \theta }, \end{aligned}$$
(18)

which finishes the proof. \(\square \)

We provide a discussion on the approximation (17). Given that \({\theta _b-\theta _a}={\pi /4}\), the maximum relative error of the approximation is less than \(11\%\). Let us define \({f\left( x\right) }={\frac{\sin x}{x}}\). Since \({{\lim _{x\rightarrow 0}}f\left( x\right) }={1}\) and \(f\left( x\right) \) is a monotonically decreasing function, large errors of (17) appear when \(\theta _b-\theta \) or \(\theta -\theta _a\) is quite small, in which case the \(m_a\) or \(m_b\) is also quite small thus the absolute estimation error is ignorable. Therefore, we can conclude that the approximation (17) is reasonable.

Appendix 2: Generalized RIDE Descriptors

In this section, we provide a detailed discussion of generalizing RIDE to dealing with more types of reversal and rotation invariance. It is a supplementary explanation to Sect. 4.2.3 of the main article.

1.1 RIDE-2, RIDE-4 and RIDE-8

We start from an alternative description of the RIDE-2, RIDE-4 and RIDE-8 algorithms.

Recall that we have computed a 2-D global gradient vector \({\mathbf {G}}={\left( G_x,G_y\right) ^\top }\), in which \(G_x\) and \(G_y\) estimate the horizontal and vertical orientation of a descriptor, respectively. If it is constrained that \({G_x}\geqslant {0}\) holds for a descriptor \(\mathbf {d}\), we need to generate a left-right reversed version of that descriptor, \(\mathbf {d}^\mathrm {R}\), and select the one in \(\mathbf {d}\) and \(\mathbf {d}^\mathrm {R}\) that satisfies \({G_x}\geqslant {0}\). Such a descriptor, denoted as \(r_2\left( \mathbf {d}\right) \), is left-right reversal-invariant. If \({G_x}={0}\) for \(\mathbf {d}\), both \(\mathbf {d}\) and \(\mathbf {d}^\mathrm {R}\) satisfy the condition. In such cases, we choose the one with the larger sequential lexicographic order.

If we need to achieve upside-down reversal invariance, the value \(G_y\) should also be constrained, i.e., \({G_y}\geqslant {0}\). We then generate 3 other versions of a descriptor \(\mathbf {d}\), namely \(\mathbf {d}_0\), \(\mathbf {d}_1\), \(\mathbf {d}_2\) and \(\mathbf {d}_3\), in which \(\mathbf {d}_0\) is just \(\mathbf {d}\), \(\mathbf {d}_1\) is the left-right reversed version of \(\mathbf {d}\), \(\mathbf {d}_2\) is the upside-down reversed version of \(\mathbf {d}\), and \(\mathbf {d}_3\) is the left-right and upside-down reversed version of \(\mathbf {d}\). Obviously, there exists at least one of them that satisfies both \({G_x}\geqslant {0}\) and \({G_y}\geqslant {0}\). If more than one candidates satisfy the conditions, we choose the one with the largest sequential lexicographic order. Such a descriptor, denoted as \(r_4\left( \mathbf {d}\right) \), is both left-right and upside-down invariant.

The last type of variant comes from rotating the descriptor by \(90^\circ \). Adding the \(90^\circ \)-rotation option into left-right and upside-down reversals obtains up to 8 descriptor versions. We generate all these variants and select one from them by constraining \({G_x}\geqslant {G_y}\geqslant {0}\), i.e., \({G_x}\geqslant {0}\), \({G_y}\geqslant {0}\) and \({G_x}\geqslant {G_y}\). If more than one candidates satisfy the conditions, we choose the one with the largest sequential lexicographic order. Such a descriptor, denoted as \(r_8\left( \mathbf {d}\right) \), is invariant through all the reversal and rotation operations.

We provide an intuitive explanation of RIDE-2, RIDE-4 and RIDE-8 algorithms. All the reversal and rotation operations change the orientation of a descriptor correspondingly. RIDE-2, in which \({G_x}\geqslant {0}\), limits the orientation to falling into an interval of a \(180^\circ \) range. This range is further shrunk into \(90^\circ \) in RIDE-4, and \(45^\circ \) in RIDE-8. A descriptor with any orientation can be aligned into the range with one or a few reversal or rotation variations, and in this way we cancel out the reversal and rotation operations and achieve the desired reversal invariance.

1.2 Experiments

We evaluate the original descriptors with RIDE-2, RIDE-4 and RIDE-8 on the Aircraft-100 dataset (Maji et al. 2013). We use four different versions of the dataset. The aligned version, denoted as Aircraft-100-1, is the one in which all the objects are manually aligned to the right. Other three versions, denoted as Aircraft-100-2, Aircraft-100-4 and Aircraft-100-8, are generated by randomly assigning one of 2, 4 and 8 image transformations to each image in the aligned dataset. Here, 2 transformations include unchanged and the left-right reversal, 4 transformations are constructed by adding the option of upside-down reversal to 2 transformations, and 8 transformations are constructed by adding the option of \(90^\circ \) rotation to 4 transformations. The property of Aircraft-100-2 is very similar to the original (unaligned) version of the Aircraft-100 dataset.

The basic setting follows what is used in the main article (Sect. 4.6.1). We only use the SIFT descriptor, and do not use spatial pyramids in the following experiments. The classification results are summarized in Table 11. One can observe that on the Aircraft-100-1 dataset, the system with original descriptors (ORIG) works best. After original descriptors are processed by RIDE, classification accuracy drops dramatically. The underlying reason is that RIDE harms the descriptive power of original descriptors by performing a one-of-many selection. The more candidates generated for selection, the heavier accuracy drop is observed.

However, in the case of Aircraft-100-2, RIDE-2 works better than ORIG. This implies that RIDE-2 captures the left-right reversal invariance. Although the descriptive power of SIFT is reduced, the benefit of reversal invariance is larger than the loss in descriptive power. However, when we use RIDE-4 and RIDE-8, the descriptive power continues to drop but we do not obtain any new invariance, resulting in the accuracy drop from RIDE-2 to both RIDE-4 and RIDE-8. Similar results are also observed in the Aircraft-100-4 dataset, i.e., RIDE-4 is just enough to capture left-right and upside-down reversal. In Aircraft-100-8 dataset, all the reversal and rotation variance might be encountered, therefore RIDE-8 produces the highest accuracy.

The above experiments verify that RIDE increases the robustness of descriptors but harms the descriptive power. According to Table 11, one type of reversal/rotation variance, if not captured, causes about \(10\%\) accuracy drop, meanwhile performing RIDE to capture an unnecessary invariance causes about \(5\%\) accuracy drop. Therefore it is not wise to cover those unnecessary types of invariance: the best strategy is to take what we need.

Consequently, we do not use RIDE-4 and RIDE-8 in all the experiments presented in the main article, since all the evaluated datasets, either on fine-grained object recognition or scene understanding, often do not contain upside-down reversed or \(90^\circ \)-rotated objects. RIDE-2 produces the best classification accuracy in such cases.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, L., Wang, J., Lin, W. et al. Towards Reversal-Invariant Image Representation. Int J Comput Vis 123, 226–250 (2017). https://doi.org/10.1007/s11263-016-0970-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0970-x

Keywords

Navigation