Skip to main content
Log in

Towards enriching the quality of k-nearest neighbor rule for document classification

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www-users.cs.umn.edu/∼han/data/tmdata.tar.gz.

  2. http://www.textfixer.com/resources/common-english-words.txt.

  3. The test statistic is of the form \(t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}},\) where \(\bar{x}_1, \bar{x}_2\) are the means, s 1, s 2 are the standard deviations and n 1, n 2 are the number of observations.

References

  1. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27

    Article  MATH  Google Scholar 

  2. Fix E, Hodges JL (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 261–279

  3. Duda R, Hart P, Stork DG (2000) Pattern classification. Wiley, New York

    Google Scholar 

  4. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, New York

    MATH  Google Scholar 

  5. Dasarathy BV (1991) Nearest neighbor NN norms: NN pattern classification techniques. McGraw-Hill Computer Science Series. IEEE CS Press

  6. Dasarathy BV, Sheela BV (1977) Visiting nearest neighbors: a survey of nearest neighbour classification techniques. In: Proceedings of the international conference on cybernetics and society, 630–635

  7. Dhurandhar A, Dobra A (2012) Probabilistic characterization of nearest neighbor classifiers. Int J Mach Learn Cybern

  8. Fukunaga K, Hostetler LD (1973) Optimization of k-nearest neighbor density estimates. IEEE Trans Inform Theory 19:320–326

    Article  MATH  MathSciNet  Google Scholar 

  9. Fukunaga K, Hostetler LD (1975) K-nearest neighbor bayes risk estimation. IEEE Trans Inform Theory 21(3):285–293

    Article  MATH  MathSciNet  Google Scholar 

  10. Loftsgaarden DO, Quesenberry CP (1965) A nonparametric estimate of multivariate density function. Ann Math Stat 36:1049–1051

    Article  MATH  MathSciNet  Google Scholar 

  11. Friedman JH (1994) Flexible metric nearest neighbor classification. Technical Report, Department of Statistics, Stanford University, Stanford

  12. Jiang L, Cai Z, Wang D, Zhang H (2013) Bayesian citation-KNN with distance weighting. Int J Mach Learn Cybern

  13. Ghosh AK (2007) On nearest neighbor classification using adaptive choice of k. J Comput Graph Stat 16(2):482–502

    Article  Google Scholar 

  14. Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York

    Google Scholar 

  15. Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical work. Statistical Publishing Society, Calcutta

  16. Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999) Document categorization and query generation on the World Wide Web using WebACE. J Artif Intell Rev 3(5–6):365–391

    Article  Google Scholar 

  17. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the tenth European conference on machine learning (ECML 98), Berlin, Germany, 137–142

  18. TREC, Text REtrieval conference. http://trec.nist.gov

  19. Porter MF (1980) An algorithm for suffix stripping, Program 14(3):130–137

    Google Scholar 

  20. Lewis DD Reuters-21578 text categorization test collection distribution. http://www.research.att.com/lewis

  21. Dudani SA (1976) The distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern SMC-6:325–327

    Google Scholar 

  22. Bailey T, Jain A (1978) A note on distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern 8:311–313

    Article  MATH  Google Scholar 

  23. Morin RL, Raeside DE (1981) A reappraisal of distance weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cyber 11(3):241–243

    Article  MathSciNet  Google Scholar 

  24. Hui GG, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430

    Article  Google Scholar 

  25. Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the ACM conference on information and knowledge management (CIKM 2000), 12–19

  26. Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, Kluwer Academic Publishers, Dordrecht, 69–90

  27. Lam W, Ho CY (1998) Using a generalized instance set for automatic text categorization. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 81–89

  28. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York

  29. Lewis DD, Shapire RE, Callan JP, Papka R (1996) Training algorithms for linear text classifiers. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 96), 298–306

  30. Quinlan JR (1986) Induction of decision trees, Mach Learn 1(1):81–106

    Google Scholar 

  31. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

  32. Baoli L, Qin L, Shiwen Y (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Trans Asian Lang Inform Proces 3(4):215–226

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanmay Basu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Basu, T., Murthy, C.A. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. & Cyber. 5, 897–905 (2014). https://doi.org/10.1007/s13042-013-0177-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-013-0177-1

Keywords

Navigation