Skip to main content

GS4: Generating Synthetic Samples for Semi-Supervised Nearest Neighbor Classification

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

  • 2170 Accesses

Abstract

In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its \(k\)-nearest neighbors. In particular, the distance of each synthetic sample from its \(k\)-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brown, M., Forsythe, A.: Robust tests for the equality of variances. J. Am. Stat. Assoc. 69(346), 364–367 (1974)

    Article  MATH  Google Scholar 

  2. Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning, vol. 2. MIT Press, Cambridge (2006)

    Book  Google Scholar 

  3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  4. Cohen, I., Cozman, F., Sebe, N., Cirelo, M., Huang, T.: Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction. IEEE Trans. Pattern Anal. Mach. Intell. 26(12), 1553–1566 (2004)

    Article  Google Scholar 

  5. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967)

    Article  MATH  Google Scholar 

  6. Dean, N., Murphy, T., Downey, G.: Using unlabelled data to update classification rules with applications in food authenticity studies. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 55(1), 1–14 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  7. Ghosh, A.: A probabilistic approach for semi-supervised nearest neighbor classification. Pattern Recogn. Lett. 33(9), 1127–1133 (2012)

    Article  Google Scholar 

  8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning Data Mining, Inference and Prediction. Springer, New York (2009)

    MATH  Google Scholar 

  9. Merz, C., Murphy, P., Aha, D.: UCI repository of machine learning databases. Department of Information and Computer Science, University of California (2012)

    Google Scholar 

  10. Wolfe, D., Hollander, M.: Nonparametric Statistical Methods. Wiley Series in Probability and Statistics. Wiley, New York (1973)

    MATH  Google Scholar 

  11. Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(16), 321–328 (2004)

    Google Scholar 

  12. Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report, CMU-CALD-02-107, Carnegie Mellon University (2002)

    Google Scholar 

  13. Zhu, X., Goldberg, A.: Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 3(1), 1–130 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

This research was funded in part by the US Army Research Lab (W911NF-13-1-0127) and the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Moutafis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Moutafis, P., Kakadiaris, I.A. (2014). GS4: Generating Synthetic Samples for Semi-Supervised Nearest Neighbor Classification. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13186-3_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13185-6

  • Online ISBN: 978-3-319-13186-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics