Abstract
With arrival of big-data era, data mining algorithm becomes more and more important. K nearest neighbor algorithm is a representative algorithm for data classification; it is a simple classification method which is widely used in many fields. But some unreasonable parameters of KNN limit its scope of application, such as sample feature values must be numeric types; Some unreasonable parameters limit its classification efficiency, such as the number of training samples is too much, too high feature dimension; Some unreasonable parameters limit the effect of classification, such as the selection of K value is not reasonable, such as distance calculating method is not reasonable, Class voting method is not reasonable. This paper proposed some methods to rationalize the unreasonable parameters above, such as feature value quantification, Dimension reduction, weighted distance and weighted voting function. This paper uses experimental results based on benchmark data to show the effect.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Larose, D.T.: k-nearest neighbor algorithm. In: Discovering Knowledge in Data: An Introduction to Data Mining, pp. 90–106 (2005)
Soucy, P., Mineau, G.W.: A simple KNN algorithm for text categorization. In: Proceedings of the IEEE International Conference on IEEE Data Mining, ICDM 2001, pp. 647–648 (2001)
Bandara, U., Wijayarathna, G.: A machine learning based tool for source code plagiarism detection. Int. J. Mach. Learn. Comput. 1(4), 337–343 (2011)
Chung, C.H., Parker, J.S., Karaca, G., et al.: Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. Cancer Cell 5(5), 489–500 (2004)
Costa, J.A., Hero, A.O.: Manifold learning using Euclidean k-nearest neighbor graphs (image processing examples). In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 3, pp. iii-988–iii-991. IEEE (2004)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14(2), 1137–1145 (1995)
Parvin, H., Alizadeh, H., Minati, B.: A modification on k-nearest neighbor classifier. Glob. J. Comput. Sci. Technol. 10(14), 37–41 (2010)
Parvin, H., Alizadeh, H., Minaei-Bidgoli, B.: MKNN: modified k-nearest neighbor. In: Proceedings of the World Congress on Engineering and Computer Science, pp. 831–834 (2008)
Li, L., Weinberg, C.R., Darden, T.A., et al.: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12), 1131–1142 (2001)
Yahia, M.E., Ibrahim, B.A.: K-nearest neighbor and C4.5 algorithms as data mining methods: advantages and difficulties. In: ACS/IEEE International Conference on Computer Systems and Applications, p. 103. IEEE (2003)
Mylonas, P., Wallace, M., Kollias, S.D.: Using k-nearest neighbor and feature selection as an improvement to hierarchical clustering. In: Vouros, G.A., Panayiotopoulos, T. (eds.) SETN 2004. LNCS (LNAI), vol. 3025, pp. 191–200. Springer, Heidelberg (2004)
Midzuno, H.: On the sampling system with probability proportionate to sum of sizes. Ann. Inst. Stat. Math. 3(1), 99–107 (1951)
Sabhnani, M., Serpen, G.: Application of machine learning algorithms to KDD intrusion detection dataset within misuse detection context. In: MLMTA 2003, pp. 209–215 (2003)
Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15), 2429–2437 (2004)
Chen, W.: A method to determine weight according samples. China’s High. Educ. Eval. 4, 018 (2003)
Hechenbichler, K., Schliep, K.: Weighted k-nearest-neighbor techniques and ordinal classification (2004)
Qian, G., Sural, S., Gu, Y., et al.: Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 1232–1237. ACM (2004)
Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp. 388–388. IEEE Computer Society (1995)
Acknowledgement
This work was supported by the National Natural Science Foundation of China (Grant No. 61272513) and Beijing Municipal Science and Technology Project (Grant No. D151100004215003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, J., Zhao, G., Zheng, Y. (2015). Rationalizing the Parameters of K-Nearest Neighbor Classification Algorithm. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-28430-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28429-3
Online ISBN: 978-3-319-28430-9
eBook Packages: Computer ScienceComputer Science (R0)