Skip to main content
Log in

Better than SIFT?

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Independent evaluation of the performance of feature descriptors is an important part of the process of developing better computer vision systems. In this paper, we compare the performance of several state-of-the art image descriptors including several recent binary descriptors. We test the descriptors on an image recognition application and a feature matching application. Our study includes several recently proposed methods and, despite claims to the contrary, we find that SIFT is still the most accurate performer in both application settings. We also find that general purpose binary descriptors are not ideal for image recognition applications but perform adequately in a feature matching application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.imageclef.org/.

  2. We use “paired” here rather loosely since it would usually refer to only two samples but we have samples for each method. In this context we mean that each method was tested using the same query image and same collection and are, therefore, directly comparable.

References

  1. Agrawal, M., Konolige, K., Blas, M.: CenSurE: center surround extremas for realtime feature detection and matching. In: International European Conference on Computer Vision, vol. 5305, pp. 102–115. Springer, Berlin (2008)

  2. Alahi, A., Ortiz, R., Vandergheynst, P.: Freak: fast retina keypoint. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 510–517 (2012)

  3. Aly, M., Munich, M.E., Perona, P.: Bag of words for large scale object recognition—properties and benchmark. In: International Conference on Computer Vision Theory and Applications, pp. 299–306 (2011)

  4. Arandjelovic, R., Zisserman, A.: Smooth object retrieval using a bag of boundaries. In: IEEE International Conference on Computer Vision, pp. 375–382 (2011)

  5. Babenko, B., Dollar, P., Belongie, S.: Task specific local region matching. In: International Conference on Computer Vision (2007)

  6. Bauer, J., Sünderhauf, N., Protzel, P.: Comparing several implementations of two recently published feature detectors. In: International Conference on Intelligent and Autonomous Systems, pp. 1–6 (2007)

  7. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Surf: speeded up robust features. Comput. Vis. Image Underst. 110(3), 346–359 (2008)

    Article  Google Scholar 

  8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 509–522 (2002)

    Article  Google Scholar 

  9. Boureau, Y.l., Cun, Y.L., et al.: Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1185–1192 (2008)

  10. Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly, Cambridge (2008)

    Google Scholar 

  11. Byrne, J.: NESTED Descriptor (2013). https://github.com/jebyrne

  12. Byrne, J., Shi, J.: Nested shape descriptors. In: IEEE International Conference on Computer Vision, pp. 1201–1208 (2013)

  13. Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: International European conference on Computer Vision: Part IV, pp. 778–792 (2010)

  14. Chandrasekhar, V., Chen, D., Tsai, S., Cheung, N.M., Chen, H., Takacs, G., Reznik, Y., Vedantham, R., Grzeszczuk, R., Bach, J., Girod, B.: The Stanford mobile visual search dataset. In: ACM Multimedia Systems Conference, pp. 117–122 (2011)

  15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. IEEE Int. Conf. Comput. Vis. Pattern Recognit. 1, 886–893 (2005)

    Google Scholar 

  16. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)

    Article  Google Scholar 

  17. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 13(9), 891–906 (1991)

    Article  Google Scholar 

  18. Gauglitz, S., Höllerer, T., Turk, M.: Evaluation of interest point detectors and feature descriptors for visual tracking. Int. J. Comput. Vis. 94(3), 335–360 (2011)

    Article  MATH  Google Scholar 

  19. Harris, C., Stephens, M.: A combined corner and edge detector. In: Fourth Alvey Vision Conference, pp. 147–151 (1988)

  20. Heinly, J., Dunn, E., Frahm, J.M.: Comparative evaluation of binary features. In: International European Conference on Computer Vision—Volume Part II, pp. 759–773. Springer, Berlin (2013)

  21. Jain, P., Kulis, B., Davis, J.V., Dhillon, I.S.: Metric and kernel learning using a linear transformation. J. Mach. Learn. Res. 13(1), 519–547 (2012)

    MathSciNet  MATH  Google Scholar 

  22. Juan, L., Gwun, O.: A comparison of SIFT, PCA-SIFT and SURF. Int. J. Image Process. 3(4), 143–152 (2009)

    Google Scholar 

  23. Junior, O.L., Delgado, D., Gonçalves, V., Nunes, U.: Trainable classifier-fusion schemes: an application to pedestrian detection. In: Intelligent Transportation Systems, pp. 1–6 (2009)

  24. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image descriptors. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 506–513 (2004)

  25. Khan, N.: Benchmark Dataset to Evaluate Image Retrieval Performance. http://www.cs.otago.ac.nz/students/postgrads/nabeel/ (2014)

  26. Khan, N., McCane, B.: Smartphone application for indoor scene localization. In: International ACM SIGACCESS Conference on Computers and Accessibility, pp. 201–202 (2012)

  27. Khan, N., McCane, B., Mills, S.: Feature set reduction for image matching in large scale environments. In: International Conference on Image and Vision Computing, pp. 68–72. ACM, New York (2012)

  28. Khan, N., McCane, B., Mills, S.: Vision based indoor scene localization via smart phone. In: International Conference of the NZ Chapter of the ACM’s Special Interest Group on Human–Computer Interaction, pp. 88 (2012)

  29. Khan, N., McCane, B., Wyvill, G.: SIFT and SURF performance evaluation against various image deformations on benchmark dataset. In: International Conference on Digital Image Computing Techniques and Applications, pp. 501–506 (2011)

  30. Khan, U.M., McCane, B., Trotman, A.: A feature compression scheme for large scale image retrieval systems. In: Proceedings of the 27th Conference on Image and Vision Computing New Zealand, pp. 492–496 (2012)

  31. Koenderink, J.J., van Doorn, A.J.: Representation of local geometry in the visual system. Biol. Cybern. 55(6), 367–375 (1987)

    Article  MATH  Google Scholar 

  32. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using affine-invariant regions. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II-319. IEEE, New York (2003)

  33. Leutenegger, S., Chli, M., Siegwart, R.Y.: BRISK: binary robust invariant scalable keypoints. In: IEEE International Conference on Computer Vision, pp. 2548–2555 (2011)

  34. Lowe, D.G.: Distinctive image features from scale invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)

    Article  Google Scholar 

  35. Mair, E., Hager, G.D., Burschka, D., Suppa, M., Hirzinger, G.: Adaptive and generic corner detection based on the accelerated segment test. In: Proceedings of the 11th European Conference on Computer Vision: Part II, pp. 183–196 (2010)

  36. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Computer Vision, pp. 128–142. Springer, Berlin (2002)

  37. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)

    Article  Google Scholar 

  38. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. Int. J. Comput. Vis. 65(1–2), 43–72 (2005)

    Article  Google Scholar 

  39. Miksik, O., Mikolajczyk, K.: Evaluation of local detectors and descriptors for fast feature matching. In: 21st International Conference on Pattern Recognition (ICPR), 2012, pp. 2681–2684. IEEE, New York (2012)

  40. Mokhtarian, F., Mohanna, F.: Performance evaluation of corner detectors using consistency and accuracy measures. Comput. Vis. Image Underst. 102(1), 81–94 (2006)

    Article  Google Scholar 

  41. Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3D objects. Int. J. Comput. Vis. 73(3), 263–284 (2007)

    Article  Google Scholar 

  42. Muja, M., Lowe, D.G.: Scalable nearest neighbour algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)

    Article  Google Scholar 

  43. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. IEEE Int. Conf. Comput. Vis. Pattern Recognit. 2, 2161–2168 (2006)

    Google Scholar 

  44. Oliva, A.: GIST Descriptor (2006). http://people.csail.mit.edu/torralba/code/spatialenvelope/

  45. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  MATH  Google Scholar 

  46. Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)

  47. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3498–3505 (2012)

  48. Philbin, J., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)

  49. Raguram, R., Chum, O., Pollefeys, M., Matas, J., Frahm, J.: Usac: a universal framework for random sample consensus. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 2022–2038 (2013)

    Article  Google Scholar 

  50. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR’07, pp. 1–8. IEEE, New York (2007)

  51. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: International European Conference on Computer Vision, pp. 430–443 (2006)

  52. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alternative to SIFT or SURF. In: IEEE International Conference on Computer Vision, pp. 2564–2571 (2011)

  53. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge (2014)

  54. Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets, or “how do i organize my holiday snaps?”. In: International European Conference on Computer Vision—Part I, pp. 414–431 (2002)

  55. Shi, J., Tomasi, C.: Good features to track. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 593–600. IEEE, New York (1994)

  56. Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. IEEE Trans. Pattern Anal. Mach. Intell. 12, 25–70 (2014)

    Google Scholar 

  57. Tola, E.: DAISY Descriptor (2009). http://cvlab.epfl.ch/software/daisy

  58. Tola, E., Lepetit, V., Fua, P.: DAISY: an efficient dense descriptor applied to wide baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010)

    Article  Google Scholar 

  59. Tombari, F., Franchi, A., Di Stefano, L.: BOLD Descriptor Code (2013). http://vision.deis.unibo.it/BOLD/

  60. Tombari, F., Franchi, A., Di Stefano, L.: BOLD features to detect texture-less objects. In: IEEE International Conference on Computer Vision, pp. 1265–1272 (2013)

  61. Trzcinski, T., Christoudias, M., Lepetit, V.: Learning image descriptors with boosting. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 597–610 (2015). doi:10.1109/TPAMI.2014.2343961

    Article  Google Scholar 

  62. Van Gool, L., Moons, T., Ungureanu, D.: Affine/photometric invariants for planar intensity patterns. In: Computer Vision, pp. 642–651. Springer, Berlin (1996)

  63. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. I–511 (2001)

  64. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nabeel Khan.

Appendices

Appendix A: Features

1.1 Scale invariant feature transform (SIFT)

SIFT algorithm [34] has four stages:-

  • Extrema detection This stage detects possible keypoints from the image and ensures invariance to scale.

  • Keypoint localisation This stage filters unstable keypoints, i.e. low contrast points and ones those are poorly localised along an edge.

  • Orientation assignment The remaining stable keypoints are assigned one or more orientations based on local image gradient directions.

  • Keypoint descriptor The previous steps ensure invariance to image location, scale and rotation. This stage computes a descriptor vector that is partially invariant to illumination and viewpoint. First, a \(16\times 16\) window around each keypoint is broken into sixteen \(4\times 4\) windows followed by computation of gradient magnitudes and orientations for each window. A set of orientation histograms are then created for every window with 8 bins each resulting in a vector of 128 elements. This descriptor vector is then normalised to unit length to achieve invariance to illumination.

1.2 Speeded up robust features (SURF)

SURF algorithm [7] has two stages:

  • Keypoint detection This stage detects possible keypoints from images via integral images [63]. The input image is subjected to scale space construction after conversion into an integral image. During scale space construction, the integral images allow fast computation using an efficient box filter representation. SURF uses an approach opposite to SIFT by keeping the image size the same and varying the filter size. The determinant of the Hessian matrix is then used to detect the blob-like structures at different scales. This results in location and scale invariance.

  • Keypoint description The SURF descriptor is computed by constructing a square window centred around the keypoint and oriented along the orientation obtained previously. This window is then divided into \(4\times 4\) regular sub-regions and Haar wavelets are calculated within each sub-region. Each sub-region contributes four values thus resulting in a 64-dimensional descriptor which is then normalised to unit length. The resulting SURF descriptor is invariant to rotation, scale, contrast and partially invariant to other transformations.

1.3 DAISY

DAISY is inspired by SIFT and is mostly used for dense wide-baseline matching purposes [58]. DAISY computes eight orientation maps \(G_o\), one for each quantised direction, where \(G_o (u,v)\) equals the magnitude of the image gradient at location \((u,v)\) for direction \(o\) if it is positive and zero otherwise. Each orientation map is then convolved with Gaussian kernels of different sizes to get convolved orientation maps for different scales. At each pixel location, DAISY consists of a vector made of values in the convolved orientation maps located on concentric circles at the location and where the amount of Gaussian smoothing is proportional to the radius of circles. The vectors are normalised to represent pixels near occlusions as correctly as possible. The full DAISY descriptor for location \((u, v)\) is then defined as the concatenation of all vectors to get a single vector. Each DAISY descriptor has 200 elements.

1.4 BOLD

The BOLD descriptor was introduced recently to handle occlusions and untextured objects [60]. The BOLD algorithm first extracts line segments which are then pruned to improve repeatability of extracted line segments. This removes short line segments which are possibly generated due to noise. BOLD then generates geometrical primitives over resulting pairs of neighbouring segments. These primitives ensure invariance to rotation, translation, scale and noise. For a segment pair, the authors experimented with various geometric primitives, such as relative segment lengths, distance between segments, or absolute orientations. The authors choose relative orientation between pairs of segments as primitives because it was found to offer the best trade-off between robustness and descriptiveness.

Each line segment in a pair is assigned a canonical orientation based on its direction followed by computation of angle pair (\(\alpha , \beta \)) between the two segments, where \(\alpha \) and \(\beta \) measure the clockwise rotations which align one line segment to the other. For each line segment, there can be many neighbouring line segments resulting in various angle pairs. Therefore, BOLD is built for a line segment by aggregating (\(\alpha , \beta \)) geometrical primitives for its k-nearest neighbouring line segments. The authors have used distance information between line segments to find nearest neighbours and the best results are reported with 10 nearest neighbours.

1.5 Oriented FAST and rotated BRIEF (ORB)

ORB descriptor detects FAST keypoints based on Harris corner measure [52]. An image pyramid is then used to detect features at different scales. A \(9 \times 9\) patch is used with the keypoint as its centre and patch orientation is computed using first-order moments. Calonder et al. [13] computed the BRIEF descriptor for a set of rotations and perspective warps of each patch. However, the ORB authors propose to compute the descriptor based on measured patch orientation, which is more efficient than the method used in BRIEF. A learning method is then used for de-correlating the feature descriptors under rotational invariance leading to ORB features. ORB is claimed by its authors to be an order of magnitude faster than SURF and over two orders faster than SIFT in feature detection and description on images with 640x480 resolution. The authors report ORB features to outperform SIFT and SURF in nearest neighbour matching over large databases of images in their experiments.

1.6 Binary robust invariant scalable keypoints (BRISK)

BRISK descriptor [33] uses the AGAST corner detector [35], which is an advanced version of the FAST detector. AGAST is faster in computation and provides performance similar to FAST algorithm. A scale space pyramid is used to detect scale invariant keypoints. To describe the features, the authors use a symmetric pattern, where sample points are positioned in concentric circles surrounding the feature. Each sample point represents a Gaussian blurring of its surrounding pixels. The standard deviation of this blurring also increases with the distance from the center of the feature.

Several long-distance sample point comparisons are used to determine orientation. For each long-distance comparison, the vector displacement between the sample points is stored and weighted by the relative difference in intensity. These weighted vectors are then averaged to find the dominant gradient direction of the patch. Finally, the sampling pattern is scaled and rotated, and the descriptor is built up of 512 short-distance sample point comparisons. Comparisons represent the local gradients and shape within the patch. BRISK requires significantly more computation and slightly more storage space than BRIEF and ORB.

1.7 Fast retina keypoint (FREAK)

FREAK [2] is a descriptor only and relies on a robust feature detector algorithm for keypoint detection. The sampling pattern around the detected keypoint is circular and has higher density of points near the centre. The sample points are smoothed to reduce sensitivity to noise by using Gaussian kernels. The authors use different sized kernels for each sample point, which leads to overlapping receptive fields. The binary descriptor is then created by thresholding the difference between pairs of receptive fields with their corresponding Gaussian kernel. However, receptive fields can lead to thousands of pairs, which can result in a large descriptor. A strategy is needed for the selection of suitable pairs to describe an image. To find relevant pairs, the authors have constructed around 50,000 descriptors on a dataset whose details are not provided. From computed descriptors, the authors calculate which pairs have mean value closest to 0.5. These are the pairs that discriminate different images the most. Essentially the pairs are sorted by their entropy. The authors have reported that the first 512 sorted pairs offer the best performance and should be used to construct a descriptor. The selected pairs yield a highly structured pattern that mimics the sporadic search of human eyes. The orientation of the keypoint is finally obtained via local gradients over the selected pairs.

1.8 NESTED shape descriptors

NESTED [12] first extracts keypoints from images using the SIFT detector. A nested pooling structure is defined around each keypoint, which can be decomposed into a set of “Hawaiian earring” structures. A binary descriptor is then computed for each keypoint by nested pooling, logarithmic spiral normalisation and binarisation of oriented gradients over a nested support. The intuition of this approach is to handle variations such as occlusion, viewpoint change and scale differences more effectively than the grid approach commonly used by other features. Another contribution of this paper is the introduction of new robust local distance function, called the nesting distance, which can be used to compare the nested descriptors effectively to remove most outliers.

Appendix B: Nearest neighbour ratio

Brute force matching ensures the best image matching performance and is a good indicator of a feature’s robustness. However, often matches are ambiguous and a so a filter is often applied to remove unreliable matches. A common approach is to use the ratio of the two nearest neighbours for this purpose. This method is efficient and often rejects wrong feature matches resulting in better matching performance, but the best matching performance is not guaranteed.

In nearest neighbour ratio, we determine the first nearest neighbour feature from a database image against a query feature using k-nearest neighbours algorithm and call it \(d1\). The nearest neighbour is defined as the feature with minimum Euclidean or Hamming distance against a query feature. Then we determine the second nearest neighbour against the same query feature and call it \(d2\). A match between a query feature and its first nearest neighbour feature is accepted only if the ratio of the first and second nearest neighbours i.e. \(d1/d2\) is less than a ratio threshold. Smaller ratio thresholds reject more feature matches [34].

1.1 Experiment

For this experiment, we have picked a small set of query (i.e. 100) and corresponding database images from non-indoor datasets to test several ratio thresholds. Features which do poorly on a small set of data are not going to perform better on larger datasets. We have not used the indoor datasets because there are fewer than 100 query images. For comparison purposes, we report the precisions averaged over ranks 1–5 across each dataset for the top five performing features as shown in Fig. 7. We have not included the GIST features in our experiments because there is one GIST feature vector per image and it is not possible to use the nearest neighbour ratio approach.

Fig. 7
figure 7

Average precision of features across non-indoor datasets and various nearest neighbour ratio thresholds

The results in Fig. 7 show that SIFT gives the best performance with all thresholds followed by SIFT [SURF]. BRISK [SURF] performs best among binary descriptors followed by ORB and BRISK [SIFT], which do well with four and three ratio thresholds respectively. Surprisingly, SURF features which performed well in brute force matching experiments performed poorly with all ratio thresholds. It seems that some descriptors are not suited to the ratio test and this is likely due to the distribution of distances seen in practice. NESTED and FREAK are other descriptors that perform poorly with the ratio test.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, N., McCane, B. & Mills, S. Better than SIFT?. Machine Vision and Applications 26, 819–836 (2015). https://doi.org/10.1007/s00138-015-0689-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-015-0689-7

Keywords

Navigation