Abstract
A major drawback of the k-nearest neighbors (k-NN) rule is the high variance when dealing with sparse prototype datasets in high dimensions. Most techniques proposed for improving k-NN classification rely either on deforming the k-NN relationship by learning a distance function or modifying the input space by means of subspace selection. Here we propose a novel boosting approach for generalizing the k-NN rule. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Our algorithm, called UNN (Universal Nearest Neighbors), rely on the k-nearest neighbors examples as weak classifiers and learn their weights so as to minimize a surrogate risk. These weights, called leveraging coefficients, allow us to distinguish the most relevant prototypes for a given class. Results obtained on several scene categorization datasets display the ability of UNN to compete with or beat state-of-the-art methods, while achieving comparatively small training and testing times.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
A surrogate is a function which is a suitable upperbound for another function (here, the non-convex non-differentiable empirical risk).
- 2.
The implementation by the authors is available at: http://people.csail.mit.edu/torralba/code/spatialenvelope/sceneRecognition.m.
- 3.
The MAP was computed by averaging classification rates over categories (diagonal of the confusion matrix) and then averaging those values after repeating each experiment 10 times on different folds.
- 4.
Code available at http://www.vlfeat.org/.
- 5.
Code available at http://www.irisa.fr/texmex/people/jegou/src.php.
- 6.
For AdaBoost, we used the code available at http://www.mathworks.com/matlabcentral/fileexchange/22997-multiclass-gentleadaboosting.
- 7.
We recall young inequality: for any p, q Hölder conjugates (p>1,(1/p)+(1/q)=1), we have yy′≤y p/p+y′q/q, assuming y,y′≥0.
References
Amores J, Sebe N, Radeva P (2006) Boosting the distance estimation: application to the k-nearest neighbor classifier. Pattern Recognit Lett 27(3):201–209
Athitsos V, Alon J, Sclaroff S, Kollios G (2008) BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans Pattern Anal Mach Intell 30(1):89–104
Bartlett P, Traskin M (2007) Adaboost is consistent. J Mach Learn Res 8:2347–2368
Bartlett P, Jordan M, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172
Cucala L, Marin JM, Robert CP, Titterington DM (2009) A bayesian reassessment of nearest-neighbor classification. J Am Stat Assoc 104(485):263–273
Dudani S (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(4):325–327
Escolano Ruiz F, Suau Pérez P, Bonev BI (2009) Information theory in computer vision and pattern recognition. Springer, London
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 524–531
Fukunaga K, Flick T (1984) An optimal global nearest neighbor metric. IEEE Trans Pattern Anal Mach Intell 6(3):314–318
García-Pedrajas N, Ortiz-Boyer D (2009) Boosting k-nearest neighbor classifier by means of input space projection. Expert Syst Appl 36(7):10,570–10,582
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc international conference on very large databases, pp 518–529
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: IEEE international conference on computer vision (ICCV), pp 1458–1465
Gupta L, Pathangay V, Patra A, Dyana A, Das S (2007) Indoor versus outdoor scene classification using probabilistic neural network. EURASIP J Appl Signal Process 2007(1): 123
Bel Haj Ali W, Piro P, Crescence L, Giampaglia D, Ferhat O, Darcourt J, Pourcher T, Barlaud M (2012) Changes in the subcellular localization of a plasma membrane protein studied by bioinspired UNN learning classification of biologic cell images. In: International conference on computer vision theory and applications (VISAPP)
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616
Holmes CC, Adams NM (2003) Likelihood inference in nearest-neighbour classification models. Biometrika 90:99–112
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Kakade S, Shalev-Shwartz S, Tewari A (2009) Applications of strong convexity–strong smoothness duality to learning with matrices. Technical report
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2169–2178
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Masip D, Vitrià J (2006) Boosted discriminant projections for nearest neighbor classification. Pattern Recognit 39(2):164–170
Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37:876–904
Nock R, Nielsen F (2009) Bregman divergences and surrogates for learning. IEEE Trans Pattern Anal Mach Intell 31(11):2048–2059
Nock R, Nielsen F (2009) On the efficient minimization of classification calibrated surrogates. In: Advances in neural information processing systems (NIPS), vol 21, pp 1201– 1208
Nock R, Sebban M (2001) An improved bound on the finite-sample risk of the nearest neighbor rule. Pattern Recognit Lett 22(3/4):407–412
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Paredes R (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28(7):1100–1110
Payne A, Singh S (2005) Indoor vs. outdoor scene classification in digital photographs. Pattern Recognit 38(10):1533–1545
Piro P, Nock R, Nielsen F, Barlaud M (2012) Leveraging k-NN for generic classification boosting. Neurocomputing 80:3–9
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn J 37:297–336
Serrano N, Savakis AE, Luo JB (2004) Improved scene classification using efficient low-level features and semantic cues. Pattern Recognit 37:1773–1784
Shakhnarovich G, Darell T, Indyk P (2006) Nearest-neighbors methods in learning and vision. MIT Press, Cambridge
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: IEEE international conference on computer vision (ICCV), vol 2, pp 1470– 1477
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7:11–32
Torralba A, Murphy K, Freeman W, Rubin M (2003) Context-based vision system for place and object recognition. In: IEEE international conference on computer vision (ICCV), pp 273–280
Vedaldi A, Fulkerson B (2008) VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org
Vogel J, Schiele B (2007) Semantic modeling of natural scenes for content-based image retrieval. Int J Comput Vis 72(2):133–157
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3485–3492
Yu K, Ji L, Zhang X (2002) Kernel nearest-neighbor algorithm. Neural Process Lett 15(2):147–156
Yuan M, Wegkamp M (2010) Classification methods with reject option based on convex risk minimization. J Mach Learn Res 11:111–130
Zhang ML, Zhou ZH (2007) ML-kNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Zhang H, Berg AC, Maire M, Malik J (2006) SVM-kNN: discriminative nearest neighbor classification for visual category recognition. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2126–2136
Zhu J, Rosset S, Zou H, Hastie T (2009) Multi-class adaboost. Stat Interface 2:349–360
Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classification. Pattern Anal Appl 11(3–4):247–257
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Generic UNN Algorithm
The general version of UNN is shown in Algorithm 2. This algorithm induces the leveraged k-NN rule (12.10) for the broad class of surrogate losses meeting conditions of [4], thus generalizing Algorithm 1. Namely, we constrain ψ to meet the following conditions: (i) \(\mathrm{im}(\psi) = {\mathbb{R}}_{+}\), (ii) ∇ ψ (0)<0 (∇ ψ is the conventional derivative of ψ loss function), and (iii) ψ is strictly convex and differentiable. (i) and (ii) imply that ψ is classification-calibrated: its local minimization is roughly tied up to that of the empirical risk [4]. (iii) implies convenient algorithmic properties for the minimization of the surrogate risk [28]. Three common examples have been shown in Eqs. (12.7)–(12.6).
The main bottleneck of UNN is step [I.1], as Eq. (12.30) is non-linear, but it always has a solution, finite under mild assumptions [28]: in our case, δ j is guaranteed to be finite when there is no total matching or mismatching of example j’s memberships with its reciprocal neighbors’, for the class at hand. The second column of Table 12.5 contains the solutions to (12.30) for surrogate losses mentioned in Sect. 12.2.2. Those solutions are always exact for the exponential loss (ψ exp) and squared loss (ψ squ); for the logistic loss (ψ log) it is exact when the weights in the reciprocal neighborhood of j are the same, otherwise it is approximated. Since starting weights are all the same, exactness can be guaranteed during a large number of inner rounds depending on which order is used to choice the examples. Table 12.5 helps to formalize the finiteness condition on δ j mentioned above: when either sum of weights in (12.29) is zero, the solutions in the first and third line of Table 12.5 are not finite. A simple strategy to cope with numerical problems arising from such situations is that proposed by [35]. (See Sect. 12.2.4.) Table 12.5 also shows how the weight update rule (12.31) specializes for the mentioned losses.
Proofsketch of Theorem 12.1
We show that UNN converges to the global optimum of any surrogate risk (Sect. 12.2.5). For this purpose, let us consider the surrogate risk (12.5) for a given class c=1,2,…,C:
In this section, we use the following notations:
-
\(\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)\), where \(\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))\) is the Legendre conjugate of ψ, which is strictly convex and differentiable as well. (\(\tilde{\psi}\) is related to ψ in such a way that: \(\nabla_{\tilde{\psi}}(x) = - \nabla^{-1}_{\psi}(-x)\).)
-
\(D_{\tilde{\psi}}(w_{i} \| w'_{i}) \stackrel {\mathrm {.}}{=}{\tilde{\psi}}(w_{i}) - {\tilde{\psi}}(w'_{i}) - (w_{i} - w'_{i})\nabla_{\tilde{\psi}}(w'_{i})\) is the Bregman divergence with generator \({\tilde{\psi}}\) [28].
Let w t denote the tth weight vector inside the “for c” loop of Algorithm 2 (assuming w 0 is the initialization of w); similarly, \(\boldsymbol{h}^{\ell}_{t}\) denotes the tth leveraged k-NN rule obtained after the update in [I.3]. The following fundamental identity holds, whose proof follows from [28]:
where \(g(m) \stackrel {\mathrm {.}}{=}-\tilde{\psi}(0)\) does not depend on the k-NN rule. In particular, Eq. (12.33) makes the connection between the real-valued classification problem and a geometric problem in the non-metric space of weights. Moreover, Eq. (12.33) proves in handy as one computes the difference between two successive surrogates: \(\varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t+1}, {\mathcal{S}}) - \varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t}, {\mathcal{S}})\). Indeed, plugging Eq. (12.33) in Eq. (12.32), and computing δ j in Eq. (12.30) so as to bring \(\boldsymbol{h}^{\ell}_{t+1}\) from \(\boldsymbol{h}^{\ell}_{t}\), we obtain the following identity:
Since Bregman divergences are non negative and meet the identity of the indiscernibles, (12.34) implies that steps [I.1]–[I.3] guarantee the decrease of (12.32) as long as δ j ≠0. But (12.32) is lowerbounded, hence UNN must converge.
In addition, it converges to the global optimum of the risk (12.5). Since predictions for each class are independent, the proof consists in showing that (12.32) converges to its global minimum for each c. Let us assume this convergence for the current class c. Then, following the reasoning of Nock and Nielsen [28], (12.30) and (12.31) imply that, when any possible δ j =0, the weight vector, say w ∞, satisfies r (c) ⊤ w ⊤=0, that is, w ∞∈kerr (c) ⊤, and w ∞ is unique. But the kernel of r (c) ⊤ and \(\overline{\mathbb{W}}\), the closure of \({\mathbb{W}}\) (i.e., the manifold where w’s live), are provably Bregman orthogonal [28], thus yielding:
Underbraces use (12.33) in (12.32), and h ℓ is a leveraged k-NN rule corresponding to w. One obtains that \(\boldsymbol{h}^{\ell}_{\infty}\) achieves the global minimum of (12.32), as claimed.
The proofsketch is graphically summarized in Fig. 12.11. In particular, two crucial Bregman orthogonalities are mentioned [28]. The red one symbolizes:
which is equivalent to (12.34). The black one on w ∞ is (12.35).
Proofsketch of Theorem 12.2
Using developments analogous to those of [28], UNN can be shown to be equivalent to AdaBoost in which m weak classifiers are available, each one being an example. Each weak classifier returns a value in {−1,0,1}, where 0 is reserved for examples outside the reciprocal neighborhood. Theorem 3 of [35] brings in our case:
where \(Z^{(c)}_{t} \stackrel {\mathrm {.}}{=}\sum_{i=1}^{m} {\tilde{w}^{(c)}_{it}}\) is the normalizing coefficient for each weight vector in UNN. (\(\tilde{w}^{(c)}_{it}\) denotes the weight of example i at iteration (t,c) of UNN, and the Tilda notation refers to weights normalized to unity at each step.) It follows that:
where \(\tilde{w}^{(c)+-}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} + \tilde{w}^{(c)-}_{jt}\), \(p^{(c)}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} / \tilde{w}^{(c)+-}_{jt} = w^{(c)+}_{jt} / w^{(c)+-}_{jt}\). The first inequality uses 1−x≤exp(−x), and the second the (WIA). Since even when the (WIA) does not hold, we still observe \(Z^{(c)}_{t} \leq 1\), plugging the last inequality in (12.37) yields the statement of the theorem.
Proofsketch of Theorem 12.3
We plug in the weight notation the iteration t and class c, so that \(w_{ti}^{(c)}\) denotes the weight of example x i prior to iteration t for class c in UNN (inside the “for c” loop of Algorithm 2, letting w 0 denote the initial value of w). To save space in some computations below, we also denote for short:
ψ is ω strongly smooth is equivalent to \(\tilde{\psi}\) being strongly convex with parameter ω −1 [22], that is,
is convex. Here, we have made use of the following notations: \(\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)\), where \(\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))\) is the Legendre conjugate of ψ. Since a convex function h satisfies h(w′)≥h(w)+∇ h (w)(w′−w), applying inequality (12.39) taking as h the function in (12.39) yields, ∀t=1,2,…,T, ∀i=1,2,…,m, ∀c=1,2,…,C:
where we recall that D ψ denotes the Bregman divergence with generator ψ (12.22). On the other hand, Cauchy–Schwarz inequality yields:
The equality in (12.41) holds because \(\sum_{i: j \sim_{k} i} {\mathrm{r}^{(c)}_{ij}w^{(c)}_{(t+1)i}} = 0\), which is exactly (12.30). We obtain:
Here, (12.42) follows from (12.40), (12.43) follows from (12.41), and (12.44) follows from (12.20). Adding (12.44) for c=1,2,…,C and t=1,2,…,T, and then dividing by C, we obtain:
We now work on the big parenthesis which depends solely upon the examples. We have:
Here, (12.46) holds because of the Arithmetic-Geometric-Harmonic inequality, and (12.47) is Young’s inequalityFootnote 7 with p=q=2. Plugging (12.48) into (12.45), we obtain:
Now, UNN meets the following property, which can easily be shown to hold with our class encoding as well:
Adding (12.50) for t=0,2,…,T−1 and c=1,2,…,C, we obtain:
Plugging (12.49) into (12.51), we obtain:
But the following inequality holds between the average surrogate risk and the empirical risk of the leveraged k-NN rule \(\boldsymbol{h}^{\ell}_{T}\), because of (i):
so that, putting altogether (12.52) and (12.53) and using the fact that ψ(0)>0 because of (i)–(ii), we have after T rounds of boosting for each class: that is,
There remains to compute the minimal value of T for which the right-hand side of (12.54) becomes no greater than some user-fixed τ∈[0,1] to obtain the bound in (12.23).
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Piro, P., Nock, R., Bel Haj Ali, W., Nielsen, F., Barlaud, M. (2013). Boosting k-Nearest Neighbors Classification. In: Farinella, G., Battiato, S., Cipolla, R. (eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5520-1_12
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5520-1_12
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5519-5
Online ISBN: 978-1-4471-5520-1
eBook Packages: Computer ScienceComputer Science (R0)