Towards enriching the quality of k-nearest neighbor rule for document classification

Basu, Tanmay; Murthy, C. A.

doi:10.1007/s13042-013-0177-1

Towards enriching the quality of k-nearest neighbor rule for document classification

Original Article
Published: 12 June 2013

Volume 5, pages 897–905, (2014)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Tanmay Basu¹ &
C. A. Murthy¹

355 Accesses
12 Citations
Explore all metrics

Abstract

The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variant Nearest Neighbor Classification Algorithm for Text Document

Hesitant k-Nearest Neighbor (HK-nn) Classifier for Document Classification and Numerical Result Analysis

A medoid-based weighting scheme for nearest-neighbor decision rule toward effective text categorization

Article Open access 04 May 2020

Notes

http://www-users.cs.umn.edu/∼han/data/tmdata.tar.gz.
http://www.textfixer.com/resources/common-english-words.txt.
The test statistic is of the form \(t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}},\) where \(\bar{x}_1, \bar{x}_2\) are the means, s ₁, s ₂ are the standard deviations and n ₁, n ₂ are the number of observations.

References

Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Article MATH Google Scholar
Fix E, Hodges JL (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 261–279
Duda R, Hart P, Stork DG (2000) Pattern classification. Wiley, New York
Google Scholar
Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press, New York
MATH Google Scholar
Dasarathy BV (1991) Nearest neighbor NN norms: NN pattern classification techniques. McGraw-Hill Computer Science Series. IEEE CS Press
Dasarathy BV, Sheela BV (1977) Visiting nearest neighbors: a survey of nearest neighbour classification techniques. In: Proceedings of the international conference on cybernetics and society, 630–635
Dhurandhar A, Dobra A (2012) Probabilistic characterization of nearest neighbor classifiers. Int J Mach Learn Cybern
Fukunaga K, Hostetler LD (1973) Optimization of k-nearest neighbor density estimates. IEEE Trans Inform Theory 19:320–326
Article MATH MathSciNet Google Scholar
Fukunaga K, Hostetler LD (1975) K-nearest neighbor bayes risk estimation. IEEE Trans Inform Theory 21(3):285–293
Article MATH MathSciNet Google Scholar
Loftsgaarden DO, Quesenberry CP (1965) A nonparametric estimate of multivariate density function. Ann Math Stat 36:1049–1051
Article MATH MathSciNet Google Scholar
Friedman JH (1994) Flexible metric nearest neighbor classification. Technical Report, Department of Statistics, Stanford University, Stanford
Jiang L, Cai Z, Wang D, Zhang H (2013) Bayesian citation-KNN with distance weighting. Int J Mach Learn Cybern
Ghosh AK (2007) On nearest neighbor classification using adaptive choice of k. J Comput Graph Stat 16(2):482–502
Article Google Scholar
Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York
Google Scholar
Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical work. Statistical Publishing Society, Calcutta
Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1999) Document categorization and query generation on the World Wide Web using WebACE. J Artif Intell Rev 3(5–6):365–391
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the tenth European conference on machine learning (ECML 98), Berlin, Germany, 137–142
TREC, Text REtrieval conference. http://trec.nist.gov
Porter MF (1980) An algorithm for suffix stripping, Program 14(3):130–137
Google Scholar
Lewis DD Reuters-21578 text categorization test collection distribution. http://www.research.att.com/lewis
Dudani SA (1976) The distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern SMC-6:325–327
Google Scholar
Bailey T, Jain A (1978) A note on distance weighted K nearest neighbor rule. IEEE Trans Syst Man Cybern 8:311–313
Article MATH Google Scholar
Morin RL, Raeside DE (1981) A reappraisal of distance weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cyber 11(3):241–243
Article MathSciNet Google Scholar
Hui GG, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Article Google Scholar
Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the ACM conference on information and knowledge management (CIKM 2000), 12–19
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, Kluwer Academic Publishers, Dordrecht, 69–90
Lam W, Ho CY (1998) Using a generalized instance set for automatic text categorization. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 81–89
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
Lewis DD, Shapire RE, Callan JP, Papka R (1996) Training algorithms for linear text classifiers. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR 96), 298–306
Quinlan JR (1986) Induction of decision trees, Mach Learn 1(1):81–106
Google Scholar
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Baoli L, Qin L, Shiwen Y (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Trans Asian Lang Inform Proces 3(4):215–226
Article Google Scholar

Download references

Author information

Authors and Affiliations

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Tanmay Basu & C. A. Murthy

Authors

Tanmay Basu
View author publications
You can also search for this author in PubMed Google Scholar
C. A. Murthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanmay Basu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Basu, T., Murthy, C.A. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. & Cyber. 5, 897–905 (2014). https://doi.org/10.1007/s13042-013-0177-1

Download citation

Received: 26 September 2012
Accepted: 23 May 2013
Published: 12 June 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s13042-013-0177-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards enriching the quality of k-nearest neighbor rule for document classification

Abstract

Access this article

Similar content being viewed by others

Variant Nearest Neighbor Classification Algorithm for Text Document

Hesitant k-Nearest Neighbor (HK-nn) Classifier for Document Classification and Numerical Result Analysis

A medoid-based weighting scheme for nearest-neighbor decision rule toward effective text categorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards enriching the quality of k-nearest neighbor rule for document classification

Abstract

Access this article

Similar content being viewed by others

Variant Nearest Neighbor Classification Algorithm for Text Document

Hesitant k-Nearest Neighbor (HK-nn) Classifier for Document Classification and Numerical Result Analysis

A medoid-based weighting scheme for nearest-neighbor decision rule toward effective text categorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation