Skip to main content

Advertisement

Log in

Improved nearest neighbor classifiers by weighting and selection of predictors

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Nearest neighborhood classification is a flexible classification method that works under weak assumptions. The basic concept is to use the weighted or un-weighted sums over class indicators of observations in the neighborhood of the target value. Two modifications that improve the performance are considered here. Firstly, instead of using weights that are solely determined by the distances we estimate the weights by use of a logit model. By using a selection procedure like lasso or boosting the relevant nearest neighbors are automatically selected. Based on the concept of estimation and selection, in the second step, we extend the predictor space. We include nearest neighborhood counts, but also the original predictors themselves and nearest neighborhood counts that use distances in sub dimensions of the predictor space. The resulting classifiers combine the strength of nearest neighbor methods with parametric approaches and by use of sub dimensions are able to select the relevant features. Simulations and real data sets demonstrate that the method yields better misclassification rates than currently available nearest neighborhood methods and is a strong and flexible competitor in classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Bache, K., Lichman, M.: Uci machine learning repository. http://archive.ics.uci.edu/ml 19 (2013)

  • Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996a)

    MathSciNet  MATH  Google Scholar 

  • Breiman, L.: Heuristics of instability and stabilisation in model selection. Ann. Stat. 24, 2350–2383 (1996b)

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996c)

    MathSciNet  MATH  Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting (with discussion). Stat. Sci. 22, 477–505 (2007)

    Article  MATH  Google Scholar 

  • Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and classification. J. Am. Stat. Assoc. 98, 324–339 (2003)

    Article  MATH  Google Scholar 

  • Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  • Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1281–1285 (2002)

    Article  Google Scholar 

  • Domeniconi, C., Yan, B.: Nearest neighbor ensemble. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 1, pp. 228–231 (2004)

  • Fan, J., Li, R.: Variable selection via nonconcave penalize likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Fix, E., Hodges, J.L.: Discriminatory Analysis-nonparametric Discrimination: Consistency Properties. US Air Force School of Aviation Medicine, Randolph Field (1951)

    MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  • Friedman, J.H.: Flexible metric nearest neighbor classification. Technical report 113, Stanford University, Statistics Department (1994)

  • Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337–407 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Gertheiss, J., Tutz, G.: Feature selection and weighting by nearest neighbor ensembles. Chemom. Intell. Lab. Syst. 99, 30–38 (2009)

    Article  Google Scholar 

  • Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. J. Comput. Graph. Stat. 16(2), 482–502 (2007)

    Article  MathSciNet  Google Scholar 

  • Ghosh, A.K.: A probabilistic approach for semi-supervised nearest neighbor classification. Pattern Recognit. Lett. 33(9), 1127–1133 (2012)

    Article  Google Scholar 

  • Ghosh, A.K., Godtliebsen, F.: On hybrid classification using model assisted posterior estimates. Pattern Recognit. 45(6), 2288–2298 (2012)

    Article  MATH  Google Scholar 

  • Goeman, J. J.: Penalized: weighted k-nearest neighbors. R package version 0.9–42 (2012)

  • Hall, P., Park, B.U., Samworth, R.J.: Choice of neighbor order in nearest-neighbor classification. Ann. Stat. 36, 2135–2152 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  • Holmes, C., Adams, N.: A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 64(2), 295–306 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Holmes, C.C., Adams, N.M.: Likelihood inference in nearest-neighbour classification models. Biometrika 90(1), 99–112 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Hothorn, T.: TH.data: TH’s data archive. R package version 1.0-3 (2014)

  • Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B.: Mboost: model-based boosting. R package version 2.2-3 (2013)

  • Leisch, F., Dimitriadou, E.: mlbench: Machine learning benchmark problems. R package version 2.1-1 (2010)

  • Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  • Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101, 578–590 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.: e1071: Misc functions of the department of statistics (e1071), TU Wien. R package version 1.6-2 (2014)

  • Morin, R.L., Raeside, D.E.: A reappraisal of distance-weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans. Syst. Man Cybern. 11, 241–243 (1981)

    Article  MathSciNet  Google Scholar 

  • Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 10, 186–190 (1964)

    Article  MATH  Google Scholar 

  • Paik, M., Yang, Y.: Combining nearest neighbor classifiers versus cross-validation selection. Stat. Appl. Genet. Mol. Biol. 3(12), 1–19 (2004)

  • Park, M.Y., Hastie, T.: An l1 regularization-path algorithm for generalized linear models. J. R. Stat. Soc. B 69, 659–677 (2007)

    Article  MathSciNet  Google Scholar 

  • Parthasarthy, G., Chatterji, B.N.: A class of new knn methods for low sample problems. IEEE Trans. Syst. Man Cybern. 20, 715–718 (1990)

  • Pößnecker, W.: MRSP: multinomial response models with structured penalties. R package version 0.4.2 (2014)

  • R Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing (2013)

  • Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)

    Book  MATH  Google Scholar 

  • Schliep, K., Hechenbichler, K.: kknn: Weighted k-nearest neighbors. R package version 1.2-3 (2013)

  • Silverman, B.W., Jones, M.C.: Commentary on fix and hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation. Int. Stat. Rev. 57, 233–238 (1989)

    Article  MATH  Google Scholar 

  • Simonoff, J.S.: Smoothing Methods in Statistics. Springer, New York (1996)

    Book  MATH  Google Scholar 

  • Stone, C.J.: Consistent nonparametric regression (with discussion). Ann. Stat. 5, 595–645 (1977)

    Article  MATH  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., Chu, G., Narasimhan, B., Li, J.: samr: SAM: Significance analysis of microarrays. R package version 2.0 (2011)

  • Tutz, G., Binder, H.: Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62, 961–971 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Tutz, G., Pössnecker, W., Uhlmann, L.: Variable selection in general multinomial logit models. Comput. Stat. Data Anal. 82, 207–222 (2015)

    Article  MathSciNet  Google Scholar 

  • Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S (Fourth ed.). New York: Springer. ISBN 0-387-95457-0 (2002)

  • Watson, G.S.: Smooth regression analysis. Sankhyā, Ser. A 26, 359–372 (1964)

    MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerhard Tutz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tutz, G., Koch, D. Improved nearest neighbor classifiers by weighting and selection of predictors. Stat Comput 26, 1039–1057 (2016). https://doi.org/10.1007/s11222-015-9588-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9588-z

Keywords

Navigation