Improved k-nearest neighbor classification

doi:10.1016/S0031-3203(01)00132-7

Pattern Recognition

Volume 35, Issue 10, October 2002, Pages 2311-2318

https://doi.org/10.1016/S0031-3203(01)00132-7 Get rights and content

Abstract

k-nearest neighbor (k-NN) classification is a well-known decision rule that is widely used in pattern classification. However, the traditional implementation of this method is computationally expensive. In this paper we develop two effective techniques, namely, template condensing and preprocessing, to significantly speed up k-NN classification while maintaining the level of accuracy. Our template condensing technique aims at “sparsifying” dense homogeneous clusters of prototypes of any single class. This is implemented by iteratively eliminating patterns which exhibit high attractive capacities. Our preprocessing technique filters a large portion of prototypes which are unlikely to match against the unknown pattern. This again accelerates the classification procedure considerably, especially in cases where the dimensionality of the feature space is high. One of our case studies shows that the incorporation of these two techniques to k-NN rule achieves a seven-fold speed-up without sacrificing accuracy.

Introduction

The k-nearest neighbor (k-NN) rule [1], [2], [3], [4], [5], [6], [7], [8] is a well-known decision rule widely used in pattern classification applications. The misclassification rate of the k-NN rule approaches the optimal Bayes error rate asymptotically as k increases [3] and is particularly effective when the probability distributions of the feature variables are not known, thereby rendering the Bayes decision rule [3] ineffective. The computational inefficiency of the k-NN rule stems from the following observation. To perform template¹ matching, the complexity of each matching is O(n), where n is the dimension of the feature space. In order to achieve a high recognition rate, the feature dimension n and the template size M are chosen to be large. For example, consider the GSC recognizer which uses features based on gradient, structural, and concavity aspects of a character image [8] and uses the k-NN rule to achieve high classification accuracy. It has a feature dimension of 512 and template size of 32,000 [8] making it quite inefficient to match a test pattern against the entire set of prototypes. In this paper we propose two effective techniques to improve the efficiency: template condensing and preprocessing.

Template condensing is an important part of the nearest neighbor (1-NN) rule. The set of prototypes in the template are chosen so that classification obtained using any proper subset of the initial template leads to a gradual degradation in recognition accuracy. This greatly decreases the number of prototypes that an unknown pattern must be compared to with sacrifice of accuracy [9], [10], [11], [12], [13]. In this paper, we develop a novel method of selecting the subset of prototypes for general k-NN classification. The idea is motivated by the observation that, if a large number of prototypes form a homogeneous cluster in feature space, then the number of prototypes in the neighborhood of the test pattern is usually larger than k (sufficient number according to the k-NN rule) when the test pattern is located in this area. This observation is further strengthened by the fact that k is usually quite small in real applications in order to keep the process of searching the nearest k prototypes efficient. Our idea is to “sparsify” dense homogeneous clusters by iteratively eliminating patterns which exhibit high “attractive capacities” (defined in Section 3). This not only reduces the template size significantly but also maintains the level of classification accuracy. In this sense the method presented in this paper differs from those described in [9], [10], [11], [12], [13].

We also describe a preprocessing operation wherein an unknown pattern is matched against a prototype in two sequential stages. In the first stage a quick assessment of the potential of match is made. The approach is motivated by an insightful observation that the norm of a pattern vector represents a characteristic of the pattern. In order for a full match to occur in the second stage, the difference of the norms of the prototype and the test pattern must be less than a predetermined threshold. The threshold is designed for each prototype individually. Prototypes that fail in the first stage of matching are not considered any further. A large portion of the prototypes are thus dynamically precluded. This preprocessing just takes one step, i.e., the complexity is O(1) and is independent of the dimensionality of the feature space. Furthermore, such preprocessing does not sacrifice the accuracy for it only rejects prototypes which are not “close” to the test pattern in feature space, if properly applied.

The rest of the paper is organized as follows. In Section 2, we introduce the general k-NN classification. In 3 Template condensing, 4 Preprocessing, we present template condensing and preprocessing respectively. We present experimental results in Section 5, and draw conclusions in Section 6.

Section snippets

Preliminary: k-NN classification

Let p be the number of classes, and $C ≜{c^{(i)}, i=1,2,…,p}$ be the set of class labels. Let $Φ$ be a set of labeled patterns referred to as a template. A labeled pattern $y ∈ R^{n}$ in the template is referred to as a prototype, where n denotes the pattern dimension. $w(y)$ denotes the weight of a prototype $y$ , i.e., the number of prototypes $y$ in the template. The class label of a prototype $y$ is denoted by $c(y)$ .

Let $H (x, y)$ be the matching measure between pattern $x$ and $y$ , where $H$ is supposed to be a non-negative

Template condensing

In the k-NN classification the (k+1),(k+2),…, prototypes in the template nearest to an unknown $x$ do not affect the classification of $x$ . In fact, k is usually chosen to be a small number, otherwise sorting k nearest patterns over a template of size M, after all matching measures $H (x,·)$ are calculated, will need computational complexity O(kM/p) [14]. Often the number of prototypes (all of a single class) which are nearer to $x$ than prototypes of other classes is much larger than k (which gives the

Preprocessing

In the previous section we have introduced a method to reduce the template size while maintaining nearly the original accuracy. In this section, we further enhance the efficiency of the k-NN algorithm. Our idea is to reject a large part of the template prototypes dynamically by carrying out computationally efficient preprocessing.

We observe that the norm of a prototype, ||·||, is a special characteristic of that prototype when appropriately defined (usually l₁ or l₂ norm). An unknown pattern $x$

Experimental results

In this section we describe the application of the two techniques described in 3 Template condensing, 4 Preprocessing to handwritten numeral recognition where the number of classes is 10 (p=10). The training set $Ω$ of 126,000 patterns has an equal number of patterns in each class. The testing set has 25,300 patterns and again equal number in each class. The experimental platform is the SPARC $400 MHz$ computer.

In our first case study, the developed techniques are applied to the “Gradient”

Conclusions and future studies

In this paper we have shown how to improve the efficiency of the k-NN classification by incorporating two novel ideas. The first idea is the reduction of the template size using the concept of attractive capacity. The second idea is a preprocessing method to preclude participation of a large portion of prototype patterns which are unlikely to match the test pattern. This work notably speeds up the classification without compromising accuracy.

The proposed template reduction technique is distinct

Acknowledgements

The authors would like to thank the anonymous referees for their numerous comments which improved and clarified the presentation a lot.

About the Author—YINGQUAN WU received the B.S. and M.S. degrees in Mathematics from the Harbin Institute of Technology, Harbin, P. R. China, in 1995 and 1997, respectively. He received the M.S. degree in the Department of Electrical Engineering, State University of New York at Buffalo, USA, in 2000. Since 2000, he has been pursuing a Ph.D. in the Department of Electrical & Computer Engineering at the University of Illinois at Urbana-Champaign, USA.

References (14)

K. Hattori et al.
A new nearest-neighbor rule in the pattern classification problem
Pattern Recognition
(1999)
S.O. Belkasim et al.
Pattern classification using an efficient KNNR
Pattern Recognition
(1992)
T.M. Cover et al.
Nearest neighbor pattern classification
IEEE Trans. Inform. Theory
(1967)
R.O. Duda et al.
Pattern Classification and Scene Analysis
(1973)
K. Fukunaga et al.
k-nearest-neighbor Bayes-risk estimation
IEEE Trans. Inform. Theory
(1975)
J.M. Keller et al.
A fuzzy k-nearest neighbor algorithm
IEEE Trans. Systems Man Cybernet.
(1985)
S.A. Dudani, The Distance-Weighted k-Nearest Neighbor Rule, Neighbor Neighbor Norms: NN Pattern Classification...

There are more references available in the full text version of this article.

Cited by (159)

Comparison of imputation methods for missing production data of dairy cattle
2023, Animal
Nowadays, vast amounts of data representing feed intake, growth, and environmental impact of individual animals are being recorded in on-farm settings. Despite their apparent use, data collected in real-world applications often have missing values in one or several variables, due to reasons including human error, machine error, or sampling frequency misalignment across multiple variables. Since incomplete datasets are less valuable for downstream data analysis, it is important to address the missing value problem properly. One option may be to reduce the dataset to a subset that contains only complete data, but considerable data may be lost via this process. The current study aimed to compare imputation methods for the estimation of missing values in a raw dataset of dairy cattle including 454 553 records collected from 629 cows between 2009 and 2020. The dataset was subjected to a cleaning process that reduced its size to 437 075 observations corresponding to 512 cows. Missing values were present in four variables: concentrate DM intake (CDMI, missing percentage = 2.30%), forage DM intake (FDMI, 8.05%), milk yield (MY, 15.12%), and BW (64.33%). After removing all missing values, the resulting dataset (n = 129 353) was randomly sampled five times to create five independent subsets that exhibit the same missing data percentages as the cleaned dataset. Four univariate and nine multivariate imputation methods (eight machine learning methods and the MissForest method) were applied and evaluated on the five repeats, and average imputation performance was reported for each repeat. The results showed that Random Forest was overall the best imputation method for this type of data and had a lower mean squared prediction error and higher concordance correlation coefficient than the other imputation methods for all imputed variables. Random Forest performed particularly well for imputing CDMI, MY, and BW, compared to imputing FDMI.
A stochastic approximation approach to fixed instance selection
2023, Information Sciences
Instance selection plays a critical role in enhancing the efficacy and efficiency of machine learning tools when utilised for a data mining task. This study proposes a fixed instance selection algorithm based on simultaneous perturbation stochastic approximation that works in conjunction with any supervised machine learning method and any corresponding performance metric, which we call SpFixedIS. This algorithm provides an approximate solution to the NP-hard instance selection problem and additionally serves as a way of intelligently selecting a specified number of instances within a training set with regards to a machine learning model. The shape of the objective function obtained from the test accuracy against the number of instances selected is examined extensively for our instance selection algorithm. The SpFixedIS algorithm was tested on 43 diverse datasets across 6 different machine learning classifiers. The results show that in over 90% of cases SpFixedIS provides a statistically significant improvement at a 5% level with intelligent selection over random selection for the same number of instances. Furthermore, with respect to probabilistic models, specifically Gaussian Naive Bayes, SpFixedIS provides a statistically significant improvement compared to models that utilise the entirety of the training set in 84% of the experimented values ranging from 50 to 1000 instances.
Evaluation of logistic regression and support vector machine approaches for XRF based particle sorting for a copper ore
2023, Minerals Engineering
The study is aimed at particle sorting at the Copper Mountain Mine using XRF. Possible applications include the rejection of barren material from mill feed, the rejection of pebbles in the SABC circuit or the recovery of valuable material from low grade stockpiles. It is recognized that XRF is a surface measurement that can detect copper but depending on several operational conditions, such as the orientation of the particle or mineral texture, the sensor spot may not see the copper. However, XRF also provides information about the concentrations of a range of elements in minerals that are associated with copper mineralization which can improve sorting. The study described herein is aimed at improving XRF sensor-based sorting by the introduction of logistics regression (LR)- and support vector machine (SVM)-based machine learning approaches. To solve the collinearity and dimensionality issues in the input variables, the authors propose a combined approach of principal component analysis (PCA) and stepwise regression to extract the significant features. The combined PCA and stepwise regression approach is novel and has shown to be very effective for dimensionality reduction of the XRF spectrum data. By applying the ROC and AUC), the LR and SVM models are compared. Results showed that the LR model with the AUC of 0.847 outperforms the SVM with kernel functions with respect to classification accuracy; especially for data sets with a small number of features. The improved classification accuracy should benefit the economic performance of the particle sorting system.
Early prediction of hypothyroidism and multiclass classification using predictive machine learning and deep learning
2022, Measurement: Sensors
Thyroid disease is considered one of the most common health disorders, which may lead to various health problems. Recent studies reveal that approximately 42 million people in India face thyroid dysfunction or disorder problems. The thyroid hormone is responsible for thyroid disorder which may lead to hypothyroidism or hyperthyroidism problems. TSH (Thyroid Stimulating Hormone), T3 (Triiodothyronine, T3-RIA), FT4 (FT4, Free Thyroxine), T4 (Thyroxine), FTI (Free Thyroxine Index, FTI, T7) are the significant components of thyroid test which is performed to diagnose the behavior of thyroid hormone. However, manual analysis of these parameters on large databases to diagnose and predict hypothyroidism or hyperthyroidism is tedious. In this article, various machine learning-based techniques have been applied to build predictive models, which includes decision tree, random forest, naive Bayes and multiclass classifier and a deep learning (DL) based model Artificial Neural Network (ANN), which is best known for dealing with text data has been applied to predict the class of hypothyroidism. The performance evaluation indicates that the decision tree and random forest provide better results with the highest accuracy of 99.5758% and 99.3107% and very few error rates of 0.0424 and 0.0689, respectively. Furthermore, a comparison among the presented classifiers has been made, and also the proposed model has been compared with previous works, and it has been found that it shows better accuracy as compared to other related works. The DL-based ANN model also offers a competitive accuracy which is 93.8226%. Furthermore, this study can be useful for researchers to identify a suitable model for hypothyroidism detection and classification.
Investigation of fault detection and isolation accuracy of different Machine learning techniques with different data processing methods for gas turbine
2022, Alexandria Engineering Journal
Classification is an essential task for many applications, including text classification, image classification, data classification, and so on. The present study investigates the accuracy of different machine learning classification algorithms with three different data smoothing techniques for gas turbine fault detection and isolation task. The gas turbine performance model was developed by considering variable inlet guide vane and bleed air. Fouling and erosion were injected into all six main components of the gas turbine engine. Faulty and non-faulty data were generated from the developed performance model. Based on sensitivity analysis, 12 measurement parameters and 11,824 data points were selected for the development of a fault detection and isolation model. The faulty and non-faulty data were balanced, smoothed, corrected and normalized. Finally, the classification accuracy of the machine learning techniques was analyzed. The result shows that K-Nearest Neighbours, Neural Network and Decision Tree classifiers exhibited high classification accuracy, about 99% with all three data smoothing techniques. It is also observed that the computation time of Support Vector Machine is higher whereas K-Nearest Neighbours shows the lowest. Finally, the research proves that K-Nearest Neighbours is the best classification technique for gas turbine engine fault detection and isolation application.
Using machine learning regression models to predict the pellet quality of pelleted feeds
2022, Animal Feed Science and Technology
Citation Excerpt :
The test point will be labeled using the most common label (majority voting) of the nearest k neighbors around it. The distance between two data points can be measured by metrics such as the Euclidean distance, the Hamming distance, the Manhattan distance, and the more general Minkowski distance (Wu et al., 2002). Four ensemble learning algorithms (RF, ABR, GBR, SR) that combine multiple base learners (Sagi and Rokach, 2018) were considered in this study.
Pelleted feeds are widely used in monogastric animal production systems because they not only improve animal performance (increasing digestibility and feed consumption) but are convenient to store and handle. However, pellet quality can be affected by many factors. While previous studies have reported the effect of a single or several factors on pellet quality, no studies have investigated how pellet quality can be affected by the large number of factors that vary during feed manufacturing. Therefore, the current study reports using machine learning regression models to predict pellet quality using commercial feed mill data. A dataset consisting of 2471 observations describing the pellet manufacturing process, the feed formulation, and environmental conditions (e.g. outdoor temperature) were collected from two feed mill lines for 8 months. Sixteen features (13 continuous, 3 categorical) were used for building the regression models, and the output was the pellet durability index (PDI) of the pelleted feeds. Twelve regression algorithms including Linear Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO) regression, Ridge Regression (RR), Support Vector Regression (SVR), Linear Support Vector Regression (LSVR), Random Forest (RF), Decision Tree (DT), Gradient Boosting Regression (GBR), Adaptive Boosting Regression (ABR), Multi-Layer Perceptron (MLP) neural network, K-Nearest Neighbor (KNN), and Stacking Regression (SR) were examined in this study. Feature importance analysis using permutation importance was performed to identify what features were most relevant for each model. Average outdoor temperature, bakery byproduct and wheat inclusion levels, as well as production line all had high permutation importance values, while the fat added into the mixer (with controls at the mill already in place to limit it) was less important than most features. The cleaned dataset was preprocessed and then split into a training (80 % of total samples, n = 1147) and a testing (20 % of total samples, n = 287) set. A 5-fold cross-validation process was applied and learning curves were used to verify the presence of overfitting for each algorithm before and after tuning the hyperparameters on the training set. The models that exhibited overfitting were excluded from the final results and only models with tuned hyperparameters were evaluated on the testing set. The SVR algorithm was selected as the best overall model for predicting PDI, as it had the lowest mean absolute and mean squared prediction errors (MAE = 3.280, MSPE = 16.192), and the second highest concordance correlation coefficient (CCC = 0.636). In conclusion, this study shows that feed mill features describing manufacturing parameters, feed formulation, and environmental data can be successfully used to build machine learning regression models for pellet quality prediction.

View all citing articles on Scopus

About the Author—KRASSIMIR IANAKIEV received a Master (Hons.) degree from Sofia University in 1989 and a Ph.D. degree in Mathematics and Computer Science and Engineering in 1998 and 2000, respectively. His research interests include pattern recognition and fuzzy systems.

About the Author—VENU GOVINDARAJU received his Ph.D. in Computer Science from the State University of New York at Buffalo in 1992 and Bachelors of Technology from the Indian Institute of Technology, Kharagpur, in 1986. Venu has co-authored a total of over 115 technical papers (26 in journals) and has one US patent on cursive script recognition. His main areas of interest are human computer interaction and pattern recognition. He is currently the associate director of the Center of Excellence for Document Analysis and Recognition (CEDAR) and concurrently holds the associate professorship in the Department of Computer Science and Engineering, University at Buffalo. He is the associate editor of the Journal of Pattern Recognition and the IEEE Transaction on Pattern Analysis and Machine Intelligence. Venu Govindaraju is the Program Co-chair of the upcoming Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR) in 2002. He is a senior member of the IEEE.

View full text

Improved k-nearest neighbor classification

Abstract

Introduction

Section snippets

Preliminary: k-NN classification

Template condensing

Preprocessing

Experimental results

Conclusions and future studies

Acknowledgements

Pattern Recognition

Pattern Recognition

Nearest neighbor pattern classification

IEEE Trans. Inform. Theory

Pattern Classification and Scene Analysis

k-nearest-neighbor Bayes-risk estimation

IEEE Trans. Inform. Theory

A fuzzy k-nearest neighbor algorithm

IEEE Trans. Systems Man Cybernet.