Skip to main content

Boosting k-Nearest Neighbors Classification

  • Chapter

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

A major drawback of the k-nearest neighbors (k-NN) rule is the high variance when dealing with sparse prototype datasets in high dimensions. Most techniques proposed for improving k-NN classification rely either on deforming the k-NN relationship by learning a distance function or modifying the input space by means of subspace selection. Here we propose a novel boosting approach for generalizing the k-NN rule. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Our algorithm, called UNN (Universal Nearest Neighbors), rely on the k-nearest neighbors examples as weak classifiers and learn their weights so as to minimize a surrogate risk. These weights, called leveraging coefficients, allow us to distinguish the most relevant prototypes for a given class. Results obtained on several scene categorization datasets display the ability of UNN to compete with or beat state-of-the-art methods, while achieving comparatively small training and testing times.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    A surrogate is a function which is a suitable upperbound for another function (here, the non-convex non-differentiable empirical risk).

  2. 2.

    The implementation by the authors is available at: http://people.csail.mit.edu/torralba/code/spatialenvelope/sceneRecognition.m.

  3. 3.

    The MAP was computed by averaging classification rates over categories (diagonal of the confusion matrix) and then averaging those values after repeating each experiment 10 times on different folds.

  4. 4.

    Code available at http://www.vlfeat.org/.

  5. 5.

    Code available at http://www.irisa.fr/texmex/people/jegou/src.php.

  6. 6.

    For AdaBoost, we used the code available at http://www.mathworks.com/matlabcentral/fileexchange/22997-multiclass-gentleadaboosting.

  7. 7.

    We recall young inequality: for any p, q Hölder conjugates (p>1,(1/p)+(1/q)=1), we have yy′≤y p/p+yq/q, assuming y,y′≥0.

References

  1. Amores J, Sebe N, Radeva P (2006) Boosting the distance estimation: application to the k-nearest neighbor classifier. Pattern Recognit Lett 27(3):201–209

    Article  Google Scholar 

  2. Athitsos V, Alon J, Sclaroff S, Kollios G (2008) BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans Pattern Anal Mach Intell 30(1):89–104

    Article  Google Scholar 

  3. Bartlett P, Traskin M (2007) Adaboost is consistent. J Mach Learn Res 8:2347–2368

    MathSciNet  MATH  Google Scholar 

  4. Bartlett P, Jordan M, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156

    Article  MathSciNet  MATH  Google Scholar 

  5. Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771

    Article  Google Scholar 

  6. Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172

    Article  MathSciNet  MATH  Google Scholar 

  7. Cucala L, Marin JM, Robert CP, Titterington DM (2009) A bayesian reassessment of nearest-neighbor classification. J Am Stat Assoc 104(485):263–273

    Article  MathSciNet  Google Scholar 

  8. Dudani S (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(4):325–327

    Article  Google Scholar 

  9. Escolano Ruiz F, Suau Pérez P, Bonev BI (2009) Information theory in computer vision and pattern recognition. Springer, London

    Book  Google Scholar 

  10. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 524–531

    Google Scholar 

  11. Fukunaga K, Flick T (1984) An optimal global nearest neighbor metric. IEEE Trans Pattern Anal Mach Intell 6(3):314–318

    Article  MATH  Google Scholar 

  12. García-Pedrajas N, Ortiz-Boyer D (2009) Boosting k-nearest neighbor classifier by means of input space projection. Expert Syst Appl 36(7):10,570–10,582

    Article  Google Scholar 

  13. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc international conference on very large databases, pp 518–529

    Google Scholar 

  14. Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: IEEE international conference on computer vision (ICCV), pp 1458–1465

    Chapter  Google Scholar 

  15. Gupta L, Pathangay V, Patra A, Dyana A, Das S (2007) Indoor versus outdoor scene classification using probabilistic neural network. EURASIP J Appl Signal Process 2007(1): 123

    Google Scholar 

  16. Bel Haj Ali W, Piro P, Crescence L, Giampaglia D, Ferhat O, Darcourt J, Pourcher T, Barlaud M (2012) Changes in the subcellular localization of a plasma membrane protein studied by bioinspired UNN learning classification of biologic cell images. In: International conference on computer vision theory and applications (VISAPP)

    Google Scholar 

  17. Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516

    Article  Google Scholar 

  18. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616

    Article  Google Scholar 

  19. Holmes CC, Adams NM (2003) Likelihood inference in nearest-neighbour classification models. Biometrika 90:99–112

    Article  MathSciNet  MATH  Google Scholar 

  20. Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report

    Google Scholar 

  21. Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128

    Article  Google Scholar 

  22. Kakade S, Shalev-Shwartz S, Tewari A (2009) Applications of strong convexity–strong smoothness duality to learning with matrices. Technical report

    Google Scholar 

  23. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2169–2178

    Google Scholar 

  24. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  25. Masip D, Vitrià J (2006) Boosted discriminant projections for nearest neighbor classification. Pattern Recognit 39(2):164–170

    Article  MATH  Google Scholar 

  26. Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37:876–904

    Article  MathSciNet  MATH  Google Scholar 

  27. Nock R, Nielsen F (2009) Bregman divergences and surrogates for learning. IEEE Trans Pattern Anal Mach Intell 31(11):2048–2059

    Article  Google Scholar 

  28. Nock R, Nielsen F (2009) On the efficient minimization of classification calibrated surrogates. In: Advances in neural information processing systems (NIPS), vol 21, pp 1201– 1208

    Google Scholar 

  29. Nock R, Sebban M (2001) An improved bound on the finite-sample risk of the nearest neighbor rule. Pattern Recognit Lett 22(3/4):407–412

    Article  MATH  Google Scholar 

  30. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

    Article  MATH  Google Scholar 

  31. Paredes R (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28(7):1100–1110

    Article  Google Scholar 

  32. Payne A, Singh S (2005) Indoor vs. outdoor scene classification in digital photographs. Pattern Recognit 38(10):1533–1545

    Article  Google Scholar 

  33. Piro P, Nock R, Nielsen F, Barlaud M (2012) Leveraging k-NN for generic classification boosting. Neurocomputing 80:3–9

    Article  Google Scholar 

  34. Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)

    Google Scholar 

  35. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn J 37:297–336

    Article  MATH  Google Scholar 

  36. Serrano N, Savakis AE, Luo JB (2004) Improved scene classification using efficient low-level features and semantic cues. Pattern Recognit 37:1773–1784

    Article  MATH  Google Scholar 

  37. Shakhnarovich G, Darell T, Indyk P (2006) Nearest-neighbors methods in learning and vision. MIT Press, Cambridge

    Google Scholar 

  38. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: IEEE international conference on computer vision (ICCV), vol 2, pp 1470– 1477

    Chapter  Google Scholar 

  39. Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7:11–32

    Article  Google Scholar 

  40. Torralba A, Murphy K, Freeman W, Rubin M (2003) Context-based vision system for place and object recognition. In: IEEE international conference on computer vision (ICCV), pp 273–280

    Chapter  Google Scholar 

  41. Vedaldi A, Fulkerson B (2008) VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org

  42. Vogel J, Schiele B (2007) Semantic modeling of natural scenes for content-based image retrieval. Int J Comput Vis 72(2):133–157

    Article  Google Scholar 

  43. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3485–3492

    Google Scholar 

  44. Yu K, Ji L, Zhang X (2002) Kernel nearest-neighbor algorithm. Neural Process Lett 15(2):147–156

    Article  MATH  Google Scholar 

  45. Yuan M, Wegkamp M (2010) Classification methods with reject option based on convex risk minimization. J Mach Learn Res 11:111–130

    MathSciNet  MATH  Google Scholar 

  46. Zhang ML, Zhou ZH (2007) ML-kNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048

    Article  MATH  Google Scholar 

  47. Zhang H, Berg AC, Maire M, Malik J (2006) SVM-kNN: discriminative nearest neighbor classification for visual category recognition. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2126–2136

    Google Scholar 

  48. Zhu J, Rosset S, Zou H, Hastie T (2009) Multi-class adaboost. Stat Interface 2:349–360

    Article  MathSciNet  MATH  Google Scholar 

  49. Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classification. Pattern Anal Appl 11(3–4):247–257

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Piro .

Editor information

Editors and Affiliations

Appendix

Appendix

Generic UNN Algorithm

The general version of UNN is shown in Algorithm 2. This algorithm induces the leveraged k-NN rule (12.10) for the broad class of surrogate losses meeting conditions of [4], thus generalizing Algorithm 1. Namely, we constrain ψ to meet the following conditions: (i) \(\mathrm{im}(\psi) = {\mathbb{R}}_{+}\), (ii) ∇ ψ (0)<0 (∇ ψ  is the conventional derivative of ψ loss function), and (iii) ψ is strictly convex and differentiable. (i) and (ii) imply that ψ is classification-calibrated: its local minimization is roughly tied up to that of the empirical risk [4]. (iii) implies convenient algorithmic properties for the minimization of the surrogate risk [28]. Three common examples have been shown in Eqs. (12.7)–(12.6).

Algorithm 2
figure 12

Universal Nearest Neighbors UNN(\(\mathcal{S},\psi\))

The main bottleneck of UNN is step [I.1], as Eq. (12.30) is non-linear, but it always has a solution, finite under mild assumptions [28]: in our case, δ j is guaranteed to be finite when there is no total matching or mismatching of example j’s memberships with its reciprocal neighbors’, for the class at hand. The second column of Table 12.5 contains the solutions to (12.30) for surrogate losses mentioned in Sect. 12.2.2. Those solutions are always exact for the exponential loss (ψ exp) and squared loss (ψ squ); for the logistic loss (ψ log) it is exact when the weights in the reciprocal neighborhood of j are the same, otherwise it is approximated. Since starting weights are all the same, exactness can be guaranteed during a large number of inner rounds depending on which order is used to choice the examples. Table 12.5 helps to formalize the finiteness condition on δ j mentioned above: when either sum of weights in (12.29) is zero, the solutions in the first and third line of Table 12.5 are not finite. A simple strategy to cope with numerical problems arising from such situations is that proposed by [35]. (See Sect. 12.2.4.) Table 12.5 also shows how the weight update rule (12.31) specializes for the mentioned losses.

Table 12.5 Three common loss functions and the corresponding solutions δ j of (12.30) and w i of (12.31). (Vector \(\boldsymbol{r}^{(c)}_{j}\) designates column j of R (c) and ∥.∥1 is the L 1 norm.) The rightmost column says whether it is (A)lways the solution, or whether it is when the weights of reciprocal neighbors of j are the (S)ame

Proofsketch of Theorem 12.1

We show that UNN converges to the global optimum of any surrogate risk (Sect. 12.2.5). For this purpose, let us consider the surrogate risk (12.5) for a given class c=1,2,…,C:

$$ \varepsilon ^{\psi }_c(\boldsymbol{h}, {\mathcal{S}}) \stackrel {\mathrm {.}}{=}\frac{1}{m} \sum_{i=1}^{m}{\psi\bigl(\varrho(\boldsymbol{h},i,c) \bigr)} . $$
(12.32)

In this section, we use the following notations:

  • \(\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)\), where \(\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))\) is the Legendre conjugate of ψ, which is strictly convex and differentiable as well. (\(\tilde{\psi}\) is related to ψ in such a way that: \(\nabla_{\tilde{\psi}}(x) = - \nabla^{-1}_{\psi}(-x)\).)

  • \(D_{\tilde{\psi}}(w_{i} \| w'_{i}) \stackrel {\mathrm {.}}{=}{\tilde{\psi}}(w_{i}) - {\tilde{\psi}}(w'_{i}) - (w_{i} - w'_{i})\nabla_{\tilde{\psi}}(w'_{i})\) is the Bregman divergence with generator \({\tilde{\psi}}\) [28].

Let w t denote the tth weight vector inside the “for c” loop of Algorithm 2 (assuming w 0 is the initialization of w); similarly, \(\boldsymbol{h}^{\ell}_{t}\) denotes the tth leveraged k-NN rule obtained after the update in [I.3]. The following fundamental identity holds, whose proof follows from [28]:

$$\begin{aligned} \psi\bigl(\varrho\bigl(\boldsymbol{h}_t^\ell,i,c\bigr)\bigr) = & g + D_{\tilde{\psi}} (0 \| w_{ti} ), \end{aligned}$$
(12.33)

where \(g(m) \stackrel {\mathrm {.}}{=}-\tilde{\psi}(0)\) does not depend on the k-NN rule. In particular, Eq. (12.33) makes the connection between the real-valued classification problem and a geometric problem in the non-metric space of weights. Moreover, Eq. (12.33) proves in handy as one computes the difference between two successive surrogates: \(\varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t+1}, {\mathcal{S}}) - \varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t}, {\mathcal{S}})\). Indeed, plugging Eq. (12.33) in Eq. (12.32), and computing δ j in Eq. (12.30) so as to bring \(\boldsymbol{h}^{\ell}_{t+1}\) from \(\boldsymbol{h}^{\ell}_{t}\), we obtain the following identity:

$$ \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{t+1}, {\mathcal{S}}\bigr) - \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_t, {\mathcal{S}}\bigr) = - \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{(t+1)i} \| w_{ti} )}. $$
(12.34)

Since Bregman divergences are non negative and meet the identity of the indiscernibles, (12.34) implies that steps [I.1]–[I.3] guarantee the decrease of (12.32) as long as δ j ≠0. But (12.32) is lowerbounded, hence UNN must converge.

In addition, it converges to the global optimum of the risk (12.5). Since predictions for each class are independent, the proof consists in showing that (12.32) converges to its global minimum for each c. Let us assume this convergence for the current class c. Then, following the reasoning of Nock and Nielsen [28], (12.30) and (12.31) imply that, when any possible δ j =0, the weight vector, say w , satisfies r (c) w =0, that is, w ∈kerr (c) , and w is unique. But the kernel of r (c) and \(\overline{\mathbb{W}}\), the closure of \({\mathbb{W}}\) (i.e., the manifold where w’s live), are provably Bregman orthogonal [28], thus yielding:

$$ \underbrace{\sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_i )}}_{m\varepsilon ^{\psi }_c(\boldsymbol{h}^\ell, {\mathcal{S}}) - mg} = \underbrace{ \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{\infty i} )}}_{m\varepsilon ^{\psi }_c(\boldsymbol{h}_\infty^\ell, {\mathcal{S}}) - mg} + \underbrace{\sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{\infty i} \| w_i )}}_{\geq 0},\quad \forall \boldsymbol{w} \in \overline{\mathbb{W}}. $$
(12.35)

Underbraces use (12.33) in (12.32), and h is a leveraged k-NN rule corresponding to w. One obtains that \(\boldsymbol{h}^{\ell}_{\infty}\) achieves the global minimum of (12.32), as claimed.

The proofsketch is graphically summarized in Fig. 12.11. In particular, two crucial Bregman orthogonalities are mentioned [28]. The red one symbolizes:

$$ \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{ti} )} = \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{(t+1)i} )} + \sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{(t+1)i} \| w_{ti} )} , $$
(12.36)

which is equivalent to (12.34). The black one on w is (12.35).

Fig. 12.11
figure 13

A geometric view of how UNN converges to the global optimum of (12.5). (See Appendix for details and notations.)

Proofsketch of Theorem 12.2

Using developments analogous to those of [28], UNN can be shown to be equivalent to AdaBoost in which m weak classifiers are available, each one being an example. Each weak classifier returns a value in {−1,0,1}, where 0 is reserved for examples outside the reciprocal neighborhood. Theorem 3 of [35] brings in our case:

$$\begin{aligned} \varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell, {\mathcal{S}}\bigr) \leq & \frac{1}{C} \sum_{c=1}^{C}{\prod _{t=1}^{T} {Z^{(c)}_{t}}} , \end{aligned}$$
(12.37)

where \(Z^{(c)}_{t} \stackrel {\mathrm {.}}{=}\sum_{i=1}^{m} {\tilde{w}^{(c)}_{it}}\) is the normalizing coefficient for each weight vector in UNN. (\(\tilde{w}^{(c)}_{it}\) denotes the weight of example i at iteration (t,c) of UNN, and the Tilda notation refers to weights normalized to unity at each step.) It follows that:

$$\begin{aligned} Z^{(c)}_{t} = & 1 - \tilde{w}^{(c)+-}_{jt} \Bigl(1 - 2\sqrt{p^{(c)}_{jt}\bigl(1-p^{(c)}_{jt} \bigr)}\, \Bigr) \\ \leq & \exp \Bigl(-\tilde{w}^{(c)+-}_{jt} \Bigl(1 - 2\sqrt {p^{(c)}_{jt}\bigl(1-p^{(c)}_{jt} \bigr)} \,\Bigr) \Bigr) \\ \leq & \exp \bigl(-\eta \bigl(1 - \sqrt{1 - 4\gamma^2} \,\bigr) \bigr) \leq \exp\bigl(-2\eta\gamma^2\bigr) , \end{aligned}$$

where \(\tilde{w}^{(c)+-}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} + \tilde{w}^{(c)-}_{jt}\), \(p^{(c)}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} / \tilde{w}^{(c)+-}_{jt} = w^{(c)+}_{jt} / w^{(c)+-}_{jt}\). The first inequality uses 1−x≤exp(−x), and the second the (WIA). Since even when the (WIA) does not hold, we still observe \(Z^{(c)}_{t} \leq 1\), plugging the last inequality in (12.37) yields the statement of the theorem.

Proofsketch of Theorem 12.3

We plug in the weight notation the iteration t and class c, so that \(w_{ti}^{(c)}\) denotes the weight of example x i prior to iteration t for class c in UNN (inside the “for c” loop of Algorithm 2, letting w 0 denote the initial value of w). To save space in some computations below, we also denote for short:

$$\begin{aligned} \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \stackrel {\mathrm {.}}{=}& \frac{1}{C} \sum_{c=1}^{C} {\varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} . \end{aligned}$$
(12.38)

ψ is ω strongly smooth is equivalent to \(\tilde{\psi}\) being strongly convex with parameter ω −1 [22], that is,

$$\begin{aligned} \tilde{\psi}(w) - \frac{1}{2\omega}w^2 \end{aligned}$$
(12.39)

is convex. Here, we have made use of the following notations: \(\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)\), where \(\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))\) is the Legendre conjugate of ψ. Since a convex function h satisfies h(w′)≥h(w)+∇ h (w)(w′−w), applying inequality (12.39) taking as h the function in (12.39) yields, ∀t=1,2,…,T, ∀i=1,2,…,m, ∀c=1,2,…,C:

$$\begin{aligned} D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr) = & D_{\tilde{\psi}} \bigl(w^{(c)}_{ti} + \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr) \| w^{(c)}_{ti} \bigr) \\ \geq & \frac{1}{2 \omega} \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti} \bigr)^2 , \end{aligned}$$
(12.40)

where we recall that D ψ denotes the Bregman divergence with generator ψ (12.22). On the other hand, Cauchy–Schwarz inequality yields:

$$\begin{aligned} \forall j \in {\mathcal{S}},\quad \sum_{i: j \sim_k i} { \bigl(\mathrm{r}^{(c)}_{ij} \bigr)^2} \sum _{i: j \sim_k i} {\bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr)^2} \geq{}& \biggl(\sum_{i: j \sim_k i} {\mathrm{r}^{(c)}_{ij} \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr)} \biggr)^2 \\ = {}&\biggl(\sum_{i: j \sim_k i} { \mathrm{r}^{(c)}_{ij}w^{(c)}_{ti}} \biggr)^2 . \end{aligned}$$
(12.41)

The equality in (12.41) holds because \(\sum_{i: j \sim_{k} i} {\mathrm{r}^{(c)}_{ij}w^{(c)}_{(t+1)i}} = 0\), which is exactly (12.30). We obtain:

$$\begin{aligned} \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)} = & \frac{1}{m} \sum_{i: t \sim_k i} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)} \\ \geq & \frac{1}{2 \omega m} \sum_{i: t \sim_k i} { \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti} \bigr)^2} \end{aligned}$$
(12.42)
$$\begin{aligned} \geq & \frac{1}{2 \omega m} \frac{ (\sum_{i: t \sim_k i} {\mathrm{r}^{(c)}_{it}w^{(c)}_{ti}} )^2}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} \end{aligned}$$
(12.43)
$$\begin{aligned} \geq & \frac{\vartheta^2}{2 \omega m} \times \frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} \end{aligned}$$
(12.44)

Here, (12.42) follows from (12.40), (12.43) follows from (12.41), and (12.44) follows from (12.20). Adding (12.44) for c=1,2,…,C and t=1,2,…,T, and then dividing by C, we obtain:

$$\begin{aligned} &\frac{1}{C} \sum_{c=1}^{C} {\sum_{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} \\&\quad \geq \frac{T \vartheta^2}{2\omega m} \times \Biggl(\frac{1}{TC} \times \sum _{c=1}^{C} {\sum _{t=1}^{T}{\frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} }} \Biggr) . \end{aligned}$$
(12.45)

We now work on the big parenthesis which depends solely upon the examples. We have:

$$\begin{aligned} &\Biggl(\frac{1}{TC} \times \sum_{c=1}^{C} {\sum_{t=1}^{T}{\frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} }} \Biggr)^{-1} \\ &\quad \leq \frac{1}{TC} \sum_{c=1}^{C} {\sum_{t=1}^{T}{\sum _{i: t \sim_k i} { \bigl(\mathrm{r}^{(c)}_{it} \bigr)^2}}} \end{aligned}$$
(12.46)
$$\begin{aligned} & \quad = \frac{1}{TC} \sum_{c=1}^{C} { \sum_{t=1}^{T}{\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} {y^2_{tc} y^2_{ic}}}} \\ & \quad \leq \frac{1}{TC} \sum_{c=1}^{C} {\sum_{t=1}^{T}{\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} { \biggl(\frac{|y_{tc}|}{2} + \frac{|y_{ic}|}{2} \biggr)}}} \end{aligned}$$
(12.47)
$$\begin{aligned} & \quad = \frac{k}{TC} \sum_{t=1}^{T}{ \sum_{c=1}^{C} {\frac{|y_{tc}|}{2}}} + \frac{1}{TC} \sum_{t=1}^{T}\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} \,\sum_{c=1}^{C} {\frac{|y_{ic}|}{2}} \\ & \quad = \frac{k}{(C-1)} . \end{aligned}$$
(12.48)

Here, (12.46) holds because of the Arithmetic-Geometric-Harmonic inequality, and (12.47) is Young’s inequalityFootnote 7 with p=q=2. Plugging (12.48) into (12.45), we obtain:

$$\begin{aligned} \frac{1}{C} \sum_{c=1}^{C} {\sum _{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} \geq & \frac{T (C-1) \vartheta^2}{2 \omega mk} . \end{aligned}$$
(12.49)

Now, UNN meets the following property, which can easily be shown to hold with our class encoding as well:

$$ \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{t+1}, {\mathcal{S}}\bigr) - \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_t, {\mathcal{S}}\bigr) = - \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}. $$
(12.50)

Adding (12.50) for t=0,2,…,T−1 and c=1,2,…,C, we obtain:

$$ \frac{1}{C} \sum_{c=1}^{C} {\varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} - \psi(0) = -\frac{1}{C} \sum_{c=1}^{C} { \sum_{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} . $$
(12.51)

Plugging (12.49) into (12.51), we obtain:

$$ \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \leq \psi(0) - \frac{T (C-1) \vartheta^2}{2 \omega mk}. $$
(12.52)

But the following inequality holds between the average surrogate risk and the empirical risk of the leveraged k-NN rule \(\boldsymbol{h}^{\ell}_{T}\), because of (i):

$$\begin{aligned} \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) = & \frac{1}{C} \sum_{c=1}^{C} { \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} \\ = & \frac{1}{mC} \sum_{c=1}^{C} { \sum_{i=1}^{m}{\psi \biggl(y_{ic} \sum_{j : j\sim_k i} \alpha_{jc}y_{jc} \biggr)}} \\ \geq & \frac{\psi(0)}{mC} \sum_{c=1}^{C} {\sum_{i=1}^{m}{ \biggl[y_{ic} \sum_{j : j\sim_k i} \alpha_{jc}y_{jc} < 0 \biggr]}} \\ =& \psi(0)\varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}} \bigr), \end{aligned}$$
(12.53)

so that, putting altogether (12.52) and (12.53) and using the fact that ψ(0)>0 because of (i)–(ii), we have after T rounds of boosting for each class: that is,

$$\begin{aligned} \varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \leq & 1 - \frac{T (C-1) \vartheta^2}{2 \psi(0) \omega mk} . \end{aligned}$$
(12.54)

There remains to compute the minimal value of T for which the right-hand side of (12.54) becomes no greater than some user-fixed τ∈[0,1] to obtain the bound in (12.23).

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag London

About this chapter

Cite this chapter

Piro, P., Nock, R., Bel Haj Ali, W., Nielsen, F., Barlaud, M. (2013). Boosting k-Nearest Neighbors Classification. In: Farinella, G., Battiato, S., Cipolla, R. (eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5520-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5520-1_12

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5519-5

  • Online ISBN: 978-1-4471-5520-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics