Boosting k-Nearest Neighbors Classification

Piro, Paolo; Nock, Richard; Bel Haj Ali, Wafa; Nielsen, Frank; Barlaud, Michel

doi:10.1007/978-1-4471-5520-1_12

Boosting k-Nearest Neighbors Classification

Paolo Piro⁶,
Richard Nock⁷,
Wafa Bel Haj Ali⁸,
Frank Nielsen^9,10 &
…
Michel Barlaud⁸

Chapter

3072 Accesses
2 Citations

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

A major drawback of the k-nearest neighbors (k-NN) rule is the high variance when dealing with sparse prototype datasets in high dimensions. Most techniques proposed for improving k-NN classification rely either on deforming the k-NN relationship by learning a distance function or modifying the input space by means of subspace selection. Here we propose a novel boosting approach for generalizing the k-NN rule. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Our algorithm, called UNN (Universal Nearest Neighbors), rely on the k-nearest neighbors examples as weak classifiers and learn their weights so as to minimize a surrogate risk. These weights, called leveraging coefficients, allow us to distinguish the most relevant prototypes for a given class. Results obtained on several scene categorization datasets display the ability of UNN to compete with or beat state-of-the-art methods, while achieving comparatively small training and testing times.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
A surrogate is a function which is a suitable upperbound for another function (here, the non-convex non-differentiable empirical risk).
2.
The implementation by the authors is available at: http://people.csail.mit.edu/torralba/code/spatialenvelope/sceneRecognition.m.
3.
The MAP was computed by averaging classification rates over categories (diagonal of the confusion matrix) and then averaging those values after repeating each experiment 10 times on different folds.
4.
Code available at http://www.vlfeat.org/.
5.
Code available at http://www.irisa.fr/texmex/people/jegou/src.php.
6.
For AdaBoost, we used the code available at http://www.mathworks.com/matlabcentral/fileexchange/22997-multiclass-gentleadaboosting.
7.
We recall young inequality: for any p, q Hölder conjugates (p>1,(1/p)+(1/q)=1), we have yy′≤y ^p/p+y′^q/q, assuming y,y′≥0.

References

Amores J, Sebe N, Radeva P (2006) Boosting the distance estimation: application to the k-nearest neighbor classifier. Pattern Recognit Lett 27(3):201–209
Article Google Scholar
Athitsos V, Alon J, Sclaroff S, Kollios G (2008) BoostMap: an embedding method for efficient nearest neighbor retrieval. IEEE Trans Pattern Anal Mach Intell 30(1):89–104
Article Google Scholar
Bartlett P, Traskin M (2007) Adaboost is consistent. J Mach Learn Res 8:2347–2368
MathSciNet MATH Google Scholar
Bartlett P, Jordan M, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156
Article MathSciNet MATH Google Scholar
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Article Google Scholar
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6:153–172
Article MathSciNet MATH Google Scholar
Cucala L, Marin JM, Robert CP, Titterington DM (2009) A bayesian reassessment of nearest-neighbor classification. J Am Stat Assoc 104(485):263–273
Article MathSciNet Google Scholar
Dudani S (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(4):325–327
Article Google Scholar
Escolano Ruiz F, Suau Pérez P, Bonev BI (2009) Information theory in computer vision and pattern recognition. Springer, London
Book Google Scholar
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 524–531
Google Scholar
Fukunaga K, Flick T (1984) An optimal global nearest neighbor metric. IEEE Trans Pattern Anal Mach Intell 6(3):314–318
Article MATH Google Scholar
García-Pedrajas N, Ortiz-Boyer D (2009) Boosting k-nearest neighbor classifier by means of input space projection. Expert Syst Appl 36(7):10,570–10,582
Article Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc international conference on very large databases, pp 518–529
Google Scholar
Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: IEEE international conference on computer vision (ICCV), pp 1458–1465
Chapter Google Scholar
Gupta L, Pathangay V, Patra A, Dyana A, Das S (2007) Indoor versus outdoor scene classification using probabilistic neural network. EURASIP J Appl Signal Process 2007(1): 123
Google Scholar
Bel Haj Ali W, Piro P, Crescence L, Giampaglia D, Ferhat O, Darcourt J, Pourcher T, Barlaud M (2012) Changes in the subcellular localization of a plasma membrane protein studied by bioinspired UNN learning classification of biologic cell images. In: International conference on computer vision theory and applications (VISAPP)
Google Scholar
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
Article Google Scholar
Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18(6):607–616
Article Google Scholar
Holmes CC, Adams NM (2003) Likelihood inference in nearest-neighbour classification models. Biometrika 90:99–112
Article MathSciNet MATH Google Scholar
Hsu CW, Chang CC, Lin CJ (2003) A practical guide to support vector classification. Technical report
Google Scholar
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Article Google Scholar
Kakade S, Shalev-Shwartz S, Tewari A (2009) Applications of strong convexity–strong smoothness duality to learning with matrices. Technical report
Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2169–2178
Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Masip D, Vitrià J (2006) Boosted discriminant projections for nearest neighbor classification. Pattern Recognit 39(2):164–170
Article MATH Google Scholar
Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37:876–904
Article MathSciNet MATH Google Scholar
Nock R, Nielsen F (2009) Bregman divergences and surrogates for learning. IEEE Trans Pattern Anal Mach Intell 31(11):2048–2059
Article Google Scholar
Nock R, Nielsen F (2009) On the efficient minimization of classification calibrated surrogates. In: Advances in neural information processing systems (NIPS), vol 21, pp 1201– 1208
Google Scholar
Nock R, Sebban M (2001) An improved bound on the finite-sample risk of the nearest neighbor rule. Pattern Recognit Lett 22(3/4):407–412
Article MATH Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
Paredes R (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28(7):1100–1110
Article Google Scholar
Payne A, Singh S (2005) Indoor vs. outdoor scene classification in digital photographs. Pattern Recognit 38(10):1533–1545
Article Google Scholar
Piro P, Nock R, Nielsen F, Barlaud M (2012) Leveraging k-NN for generic classification boosting. Neurocomputing 80:3–9
Article Google Scholar
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)
Google Scholar
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn J 37:297–336
Article MATH Google Scholar
Serrano N, Savakis AE, Luo JB (2004) Improved scene classification using efficient low-level features and semantic cues. Pattern Recognit 37:1773–1784
Article MATH Google Scholar
Shakhnarovich G, Darell T, Indyk P (2006) Nearest-neighbors methods in learning and vision. MIT Press, Cambridge
Google Scholar
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: IEEE international conference on computer vision (ICCV), vol 2, pp 1470– 1477
Chapter Google Scholar
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7:11–32
Article Google Scholar
Torralba A, Murphy K, Freeman W, Rubin M (2003) Context-based vision system for place and object recognition. In: IEEE international conference on computer vision (ICCV), pp 273–280
Chapter Google Scholar
Vedaldi A, Fulkerson B (2008) VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org
Vogel J, Schiele B (2007) Semantic modeling of natural scenes for content-based image retrieval. Int J Comput Vis 72(2):133–157
Article Google Scholar
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3485–3492
Google Scholar
Yu K, Ji L, Zhang X (2002) Kernel nearest-neighbor algorithm. Neural Process Lett 15(2):147–156
Article MATH Google Scholar
Yuan M, Wegkamp M (2010) Classification methods with reject option based on convex risk minimization. J Mach Learn Res 11:111–130
MathSciNet MATH Google Scholar
Zhang ML, Zhou ZH (2007) ML-kNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Article MATH Google Scholar
Zhang H, Berg AC, Maire M, Malik J (2006) SVM-kNN: discriminative nearest neighbor classification for visual category recognition. In: IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2126–2136
Google Scholar
Zhu J, Rosset S, Zou H, Hastie T (2009) Multi-class adaboost. Stat Interface 2:349–360
Article MathSciNet MATH Google Scholar
Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classification. Pattern Anal Appl 11(3–4):247–257
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Istituto Italiano di Tecnologia, Via Morego 30, 16163, Genova, Italy
Paolo Piro
CEREGMIA, Université Antilles-Guyane, Campus de Schoelcher, Martinique, France
Richard Nock
I3S Laboratory, University of Nice-Sophia Antipolis, 06903, Sophia Antipolis, France
Wafa Bel Haj Ali & Michel Barlaud
Department of Fundamental Research, Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
LIX Department, Ecole Polytechnique, Palaiseau, France
Frank Nielsen

Authors

Paolo Piro
View author publications
You can also search for this author in PubMed Google Scholar
Richard Nock
View author publications
You can also search for this author in PubMed Google Scholar
Wafa Bel Haj Ali
View author publications
You can also search for this author in PubMed Google Scholar
Frank Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Michel Barlaud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Piro .

Editor information

Editors and Affiliations

Dipartimento di Matematica e Informatica, Università di Catania, Catania, Italy
Giovanni Maria Farinella
Dipartimento di Matematica e Informatica, Università di Catania, Catania, Italy
Sebastiano Battiato
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
Roberto Cipolla

Appendix

Generic UNN Algorithm

The general version of UNN is shown in Algorithm 2. This algorithm induces the leveraged k-NN rule (12.10) for the broad class of surrogate losses meeting conditions of [4], thus generalizing Algorithm 1. Namely, we constrain ψ to meet the following conditions: (i) $\mathrm{im}(\psi) = {\mathbb{R}}_{+}$, (ii) ∇_ψ(0)<0 (∇_ψ is the conventional derivative of ψ loss function), and (iii) ψ is strictly convex and differentiable. (i) and (ii) imply that ψ is classification-calibrated: its local minimization is roughly tied up to that of the empirical risk [4]. (iii) implies convenient algorithmic properties for the minimization of the surrogate risk [28]. Three common examples have been shown in Eqs. (12.7)–(12.6).

The main bottleneck of UNN is step [I.1], as Eq. (12.30) is non-linear, but it always has a solution, finite under mild assumptions [28]: in our case, δ _j is guaranteed to be finite when there is no total matching or mismatching of example j’s memberships with its reciprocal neighbors’, for the class at hand. The second column of Table 12.5 contains the solutions to (12.30) for surrogate losses mentioned in Sect. 12.2.2. Those solutions are always exact for the exponential loss (ψ ^exp) and squared loss (ψ ^squ); for the logistic loss (ψ ^log) it is exact when the weights in the reciprocal neighborhood of j are the same, otherwise it is approximated. Since starting weights are all the same, exactness can be guaranteed during a large number of inner rounds depending on which order is used to choice the examples. Table 12.5 helps to formalize the finiteness condition on δ _j mentioned above: when either sum of weights in (12.29) is zero, the solutions in the first and third line of Table 12.5 are not finite. A simple strategy to cope with numerical problems arising from such situations is that proposed by [35]. (See Sect. 12.2.4.) Table 12.5 also shows how the weight update rule (12.31) specializes for the mentioned losses.

Table 12.5 Three common loss functions and the corresponding solutions δ _j of (12.30) and w _i of (12.31). (Vector $\boldsymbol{r}^{(c)}_{j}$ designates column j of R ^(c) and ∥.∥₁ is the L ₁ norm.) The rightmost column says whether it is (A)lways the solution, or whether it is when the weights of reciprocal neighbors of j are the (S)ame

Full size table

Proofsketch of Theorem 12.1

We show that UNN converges to the global optimum of any surrogate risk (Sect. 12.2.5). For this purpose, let us consider the surrogate risk (12.5) for a given class c=1,2,…,C:

$$ \varepsilon ^{\psi }_c(\boldsymbol{h}, {\mathcal{S}}) \stackrel {\mathrm {.}}{=}\frac{1}{m} \sum_{i=1}^{m}{\psi\bigl(\varrho(\boldsymbol{h},i,c) \bigr)} . $$

(12.32)

In this section, we use the following notations:

$\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)$, where $\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))$ is the Legendre conjugate of ψ, which is strictly convex and differentiable as well. ($\tilde{\psi}$ is related to ψ in such a way that: $\nabla_{\tilde{\psi}}(x) = - \nabla^{-1}_{\psi}(-x)$.)
$D_{\tilde{\psi}}(w_{i} \| w'_{i}) \stackrel {\mathrm {.}}{=}{\tilde{\psi}}(w_{i}) - {\tilde{\psi}}(w'_{i}) - (w_{i} - w'_{i})\nabla_{\tilde{\psi}}(w'_{i})$ is the Bregman divergence with generator ${\tilde{\psi}}$ [28].

Let w _t denote the tth weight vector inside the “for c” loop of Algorithm 2 (assuming w ₀ is the initialization of w); similarly, $\boldsymbol{h}^{\ell}_{t}$ denotes the tth leveraged k-NN rule obtained after the update in [I.3]. The following fundamental identity holds, whose proof follows from [28]:

$$\begin{aligned} \psi\bigl(\varrho\bigl(\boldsymbol{h}_t^\ell,i,c\bigr)\bigr) = & g + D_{\tilde{\psi}} (0 \| w_{ti} ), \end{aligned}$$

(12.33)

where $g(m) \stackrel {\mathrm {.}}{=}-\tilde{\psi}(0)$ does not depend on the k-NN rule. In particular, Eq. (12.33) makes the connection between the real-valued classification problem and a geometric problem in the non-metric space of weights. Moreover, Eq. (12.33) proves in handy as one computes the difference between two successive surrogates: $\varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t+1}, {\mathcal{S}}) - \varepsilon ^{\psi }_{c}(\boldsymbol{h}^{\ell}_{t}, {\mathcal{S}})$. Indeed, plugging Eq. (12.33) in Eq. (12.32), and computing δ _j in Eq. (12.30) so as to bring $\boldsymbol{h}^{\ell}_{t+1}$ from $\boldsymbol{h}^{\ell}_{t}$, we obtain the following identity:

$$ \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{t+1}, {\mathcal{S}}\bigr) - \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_t, {\mathcal{S}}\bigr) = - \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{(t+1)i} \| w_{ti} )}. $$

(12.34)

Since Bregman divergences are non negative and meet the identity of the indiscernibles, (12.34) implies that steps [I.1]–[I.3] guarantee the decrease of (12.32) as long as δ _j≠0. But (12.32) is lowerbounded, hence UNN must converge.

In addition, it converges to the global optimum of the risk (12.5). Since predictions for each class are independent, the proof consists in showing that (12.32) converges to its global minimum for each c. Let us assume this convergence for the current class c. Then, following the reasoning of Nock and Nielsen [28], (12.30) and (12.31) imply that, when any possible δ _j=0, the weight vector, say w _∞, satisfies r ^(c) ^⊤ w ^⊤=0, that is, w _∞∈kerr ^(c) ^⊤, and w _∞ is unique. But the kernel of r ^(c) ^⊤ and $\overline{\mathbb{W}}$, the closure of ${\mathbb{W}}$ (i.e., the manifold where w’s live), are provably Bregman orthogonal [28], thus yielding:

$$ \underbrace{\sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_i )}}_{m\varepsilon ^{\psi }_c(\boldsymbol{h}^\ell, {\mathcal{S}}) - mg} = \underbrace{ \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{\infty i} )}}_{m\varepsilon ^{\psi }_c(\boldsymbol{h}_\infty^\ell, {\mathcal{S}}) - mg} + \underbrace{\sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{\infty i} \| w_i )}}_{\geq 0},\quad \forall \boldsymbol{w} \in \overline{\mathbb{W}}. $$

(12.35)

Underbraces use (12.33) in (12.32), and h ^ℓ is a leveraged k-NN rule corresponding to w. One obtains that $\boldsymbol{h}^{\ell}_{\infty}$ achieves the global minimum of (12.32), as claimed.

The proofsketch is graphically summarized in Fig. 12.11. In particular, two crucial Bregman orthogonalities are mentioned [28]. The red one symbolizes:

$$ \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{ti} )} = \sum_{i=1}^{m} {D_{\tilde{\psi}} (0 \| w_{(t+1)i} )} + \sum_{i=1}^{m} {D_{\tilde{\psi}} (w_{(t+1)i} \| w_{ti} )} , $$

(12.36)

which is equivalent to (12.34). The black one on w _∞ is (12.35).

Proofsketch of Theorem 12.2

Using developments analogous to those of [28], UNN can be shown to be equivalent to AdaBoost in which m weak classifiers are available, each one being an example. Each weak classifier returns a value in {−1,0,1}, where 0 is reserved for examples outside the reciprocal neighborhood. Theorem 3 of [35] brings in our case:

$$\begin{aligned} \varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell, {\mathcal{S}}\bigr) \leq & \frac{1}{C} \sum_{c=1}^{C}{\prod _{t=1}^{T} {Z^{(c)}_{t}}} , \end{aligned}$$

(12.37)

where $Z^{(c)}_{t} \stackrel {\mathrm {.}}{=}\sum_{i=1}^{m} {\tilde{w}^{(c)}_{it}}$ is the normalizing coefficient for each weight vector in UNN. ($\tilde{w}^{(c)}_{it}$ denotes the weight of example i at iteration (t,c) of UNN, and the Tilda notation refers to weights normalized to unity at each step.) It follows that:

$$\begin{aligned} Z^{(c)}_{t} = & 1 - \tilde{w}^{(c)+-}_{jt} \Bigl(1 - 2\sqrt{p^{(c)}_{jt}\bigl(1-p^{(c)}_{jt} \bigr)}\, \Bigr) \\ \leq & \exp \Bigl(-\tilde{w}^{(c)+-}_{jt} \Bigl(1 - 2\sqrt {p^{(c)}_{jt}\bigl(1-p^{(c)}_{jt} \bigr)} \,\Bigr) \Bigr) \\ \leq & \exp \bigl(-\eta \bigl(1 - \sqrt{1 - 4\gamma^2} \,\bigr) \bigr) \leq \exp\bigl(-2\eta\gamma^2\bigr) , \end{aligned}$$

where $\tilde{w}^{(c)+-}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} + \tilde{w}^{(c)-}_{jt}$, $p^{(c)}_{jt} \stackrel {\mathrm {.}}{=}\tilde{w}^{(c)+}_{jt} / \tilde{w}^{(c)+-}_{jt} = w^{(c)+}_{jt} / w^{(c)+-}_{jt}$. The first inequality uses 1−x≤exp(−x), and the second the (WIA). Since even when the (WIA) does not hold, we still observe $Z^{(c)}_{t} \leq 1$, plugging the last inequality in (12.37) yields the statement of the theorem.

Proofsketch of Theorem 12.3

We plug in the weight notation the iteration t and class c, so that $w_{ti}^{(c)}$ denotes the weight of example x _i prior to iteration t for class c in UNN (inside the “for c” loop of Algorithm 2, letting w ₀ denote the initial value of w). To save space in some computations below, we also denote for short:

$$\begin{aligned} \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \stackrel {\mathrm {.}}{=}& \frac{1}{C} \sum_{c=1}^{C} {\varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} . \end{aligned}$$

(12.38)

ψ is ω strongly smooth is equivalent to $\tilde{\psi}$ being strongly convex with parameter ω ⁻¹ [22], that is,

$$\begin{aligned} \tilde{\psi}(w) - \frac{1}{2\omega}w^2 \end{aligned}$$

(12.39)

is convex. Here, we have made use of the following notations: $\tilde{\psi}(x) \stackrel {\mathrm {.}}{=}\psi^{\star}(-x)$, where $\psi^{\star}(x) \stackrel {\mathrm {.}}{=}x\nabla_{\psi}^{-1}(x) - \psi(\nabla^{-1}_{\psi}(x))$ is the Legendre conjugate of ψ. Since a convex function h satisfies h(w′)≥h(w)+∇_h(w)(w′−w), applying inequality (12.39) taking as h the function in (12.39) yields, ∀t=1,2,…,T, ∀i=1,2,…,m, ∀c=1,2,…,C:

$$\begin{aligned} D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr) = & D_{\tilde{\psi}} \bigl(w^{(c)}_{ti} + \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr) \| w^{(c)}_{ti} \bigr) \\ \geq & \frac{1}{2 \omega} \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti} \bigr)^2 , \end{aligned}$$

(12.40)

where we recall that D _ψ denotes the Bregman divergence with generator ψ (12.22). On the other hand, Cauchy–Schwarz inequality yields:

$$\begin{aligned} \forall j \in {\mathcal{S}},\quad \sum_{i: j \sim_k i} { \bigl(\mathrm{r}^{(c)}_{ij} \bigr)^2} \sum _{i: j \sim_k i} {\bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr)^2} \geq{}& \biggl(\sum_{i: j \sim_k i} {\mathrm{r}^{(c)}_{ij} \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti}\bigr)} \biggr)^2 \\ = {}&\biggl(\sum_{i: j \sim_k i} { \mathrm{r}^{(c)}_{ij}w^{(c)}_{ti}} \biggr)^2 . \end{aligned}$$

(12.41)

The equality in (12.41) holds because $\sum_{i: j \sim_{k} i} {\mathrm{r}^{(c)}_{ij}w^{(c)}_{(t+1)i}} = 0$, which is exactly (12.30). We obtain:

$$\begin{aligned} \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)} = & \frac{1}{m} \sum_{i: t \sim_k i} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)} \\ \geq & \frac{1}{2 \omega m} \sum_{i: t \sim_k i} { \bigl(w^{(c)}_{(t+1)i} - w^{(c)}_{ti} \bigr)^2} \end{aligned}$$

(12.42)

$$\begin{aligned} \geq & \frac{1}{2 \omega m} \frac{ (\sum_{i: t \sim_k i} {\mathrm{r}^{(c)}_{it}w^{(c)}_{ti}} )^2}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} \end{aligned}$$

(12.43)

$$\begin{aligned} \geq & \frac{\vartheta^2}{2 \omega m} \times \frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} \end{aligned}$$

(12.44)

Here, (12.42) follows from (12.40), (12.43) follows from (12.41), and (12.44) follows from (12.20). Adding (12.44) for c=1,2,…,C and t=1,2,…,T, and then dividing by C, we obtain:

$$\begin{aligned} &\frac{1}{C} \sum_{c=1}^{C} {\sum_{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} \\&\quad \geq \frac{T \vartheta^2}{2\omega m} \times \Biggl(\frac{1}{TC} \times \sum _{c=1}^{C} {\sum _{t=1}^{T}{\frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} }} \Biggr) . \end{aligned}$$

(12.45)

We now work on the big parenthesis which depends solely upon the examples. We have:

$$\begin{aligned} &\Biggl(\frac{1}{TC} \times \sum_{c=1}^{C} {\sum_{t=1}^{T}{\frac{1}{\sum_{i: t \sim_k i} { (\mathrm{r}^{(c)}_{it} )^2}} }} \Biggr)^{-1} \\ &\quad \leq \frac{1}{TC} \sum_{c=1}^{C} {\sum_{t=1}^{T}{\sum _{i: t \sim_k i} { \bigl(\mathrm{r}^{(c)}_{it} \bigr)^2}}} \end{aligned}$$

(12.46)

$$\begin{aligned} & \quad = \frac{1}{TC} \sum_{c=1}^{C} { \sum_{t=1}^{T}{\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} {y^2_{tc} y^2_{ic}}}} \\ & \quad \leq \frac{1}{TC} \sum_{c=1}^{C} {\sum_{t=1}^{T}{\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} { \biggl(\frac{|y_{tc}|}{2} + \frac{|y_{ic}|}{2} \biggr)}}} \end{aligned}$$

(12.47)

$$\begin{aligned} & \quad = \frac{k}{TC} \sum_{t=1}^{T}{ \sum_{c=1}^{C} {\frac{|y_{tc}|}{2}}} + \frac{1}{TC} \sum_{t=1}^{T}\sum _{i \in {{\mathrm {NN}}_k}(\boldsymbol{x}_t)} \,\sum_{c=1}^{C} {\frac{|y_{ic}|}{2}} \\ & \quad = \frac{k}{(C-1)} . \end{aligned}$$

(12.48)

Here, (12.46) holds because of the Arithmetic-Geometric-Harmonic inequality, and (12.47) is Young’s inequality^{Footnote 7} with p=q=2. Plugging (12.48) into (12.45), we obtain:

$$\begin{aligned} \frac{1}{C} \sum_{c=1}^{C} {\sum _{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} \geq & \frac{T (C-1) \vartheta^2}{2 \omega mk} . \end{aligned}$$

(12.49)

Now, UNN meets the following property, which can easily be shown to hold with our class encoding as well:

$$ \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{t+1}, {\mathcal{S}}\bigr) - \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_t, {\mathcal{S}}\bigr) = - \frac{1}{m} \sum_{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}. $$

(12.50)

Adding (12.50) for t=0,2,…,T−1 and c=1,2,…,C, we obtain:

$$ \frac{1}{C} \sum_{c=1}^{C} {\varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} - \psi(0) = -\frac{1}{C} \sum_{c=1}^{C} { \sum_{t=1}^{T} {\frac{1}{m} \sum _{i=1}^{m} {D_{\tilde{\psi}} \bigl(w^{(c)}_{(t+1)i} \| w^{(c)}_{ti} \bigr)}}} . $$

(12.51)

Plugging (12.49) into (12.51), we obtain:

$$ \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \leq \psi(0) - \frac{T (C-1) \vartheta^2}{2 \omega mk}. $$

(12.52)

But the following inequality holds between the average surrogate risk and the empirical risk of the leveraged k-NN rule $\boldsymbol{h}^{\ell}_{T}$, because of (i):

$$\begin{aligned} \bar {\varepsilon }^{\psi }\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) = & \frac{1}{C} \sum_{c=1}^{C} { \varepsilon ^{\psi }_c\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr)} \\ = & \frac{1}{mC} \sum_{c=1}^{C} { \sum_{i=1}^{m}{\psi \biggl(y_{ic} \sum_{j : j\sim_k i} \alpha_{jc}y_{jc} \biggr)}} \\ \geq & \frac{\psi(0)}{mC} \sum_{c=1}^{C} {\sum_{i=1}^{m}{ \biggl[y_{ic} \sum_{j : j\sim_k i} \alpha_{jc}y_{jc} < 0 \biggr]}} \\ =& \psi(0)\varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}} \bigr), \end{aligned}$$

(12.53)

so that, putting altogether (12.52) and (12.53) and using the fact that ψ(0)>0 because of (i)–(ii), we have after T rounds of boosting for each class: that is,

$$\begin{aligned} \varepsilon ^{0/1}\bigl(\boldsymbol{h}^\ell_{T}, {\mathcal{S}}\bigr) \leq & 1 - \frac{T (C-1) \vartheta^2}{2 \psi(0) \omega mk} . \end{aligned}$$

(12.54)

There remains to compute the minimal value of T for which the right-hand side of (12.54) becomes no greater than some user-fixed τ∈[0,1] to obtain the bound in (12.23).

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Piro, P., Nock, R., Bel Haj Ali, W., Nielsen, F., Barlaud, M. (2013). Boosting k-Nearest Neighbors Classification. In: Farinella, G., Battiato, S., Cipolla, R. (eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5520-1_12

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5520-1_12
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5519-5
Online ISBN: 978-1-4471-5520-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Generic UNN Algorithm

Proofsketch of Theorem 12.1

Proofsketch of Theorem 12.2

Proofsketch of Theorem 12.3

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation