ABSTRACT
We consider the relationship between training set size and the parameter k for the k-Nearest Neighbors (kNN) classifier. When few examples are available, we observe that accuracy is sensitive to k and that best k tends to increase with training size. We explore the subsequent risk that k tuned on partitions will be suboptimal after aggregation and re-training. This risk is found to be most severe when little data is available. For larger training sizes, accuracy becomes increasingly stable with respect to k and the risk decreases.
- D. D. Lewis, et al., RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 5, 2004. Google ScholarDigital Library
- Y. Yang. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 1999. Google ScholarDigital Library
- Y. Yang, et al., A Scalability Analysis of Classifiers in Text Categorization. In SIGIR, 2003. Google ScholarDigital Library
Index Terms
- An analysis of the coupling between training set and neighborhood sizes for the kNN classifier
Recommendations
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Tri-partition cost-sensitive active learning through kNN
Active learning differs from the training---testing scenario in that class labels can be obtained upon request. It is widely employed in applications where the labeling of instances incurs a heavy manual cost. In this paper, we propose a new algorithm ...
Improving Text Classification Accuracy by Training Label Cleaning
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Comments