Abstract
Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising high-dimensional properties could in fact be exploited in improving overall algorithm design. One such phenomenon, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion. In this paper we go a step further by embracing the soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighborhoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improvement over the crisp weighted method, as well as the standard kNN classifier.
Similar content being viewed by others
Notes
Skewness, the standardized 3rd moment of a probability distribution, is 0 if the distribution is symmetrical, while positive (negative) values indicate skew to the right (left).
References
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory (ICDT), Lecture notes in computer science, vol 1973. Springer, pp 420–434
Aucouturier JJ (2006) Ten experiments on the modelling of polyphonic timbre. Ph.D. thesis, University of Paris 6
Aucouturier JJ, Pachet F (2004) Improving timbre similarity: how high is the sky? J Negat Results Speech Audio Sci 1. http://jjtok.io/papers/JNRSAS-2004.pdf
Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recogn Lett 42(9):1719–1731
Buza K, Nanopoulos A, Schmidt-Thieme L (2011) INSIGHT: Efficient and effective instance selection for time-series classification. In: Proceedings of the 15th pacific-asia conference on knowledge discovery and data mining (PAKDD), Part II, Lecture Notes in Artificial Intelligence, vol 6635. Springer, pp 149–160
Cabello D, Barro S, Salceda JM, Ruiz R, Mira J (1991) Fuzzy k-nearest neighbor classifiers for ventricular arrhythmia detection. Int J Biomed Comput 27(2):77–93
Chen J, Fang H, Saad Y (2009) Fast approximate k NN graph construction for high dimensional data via recursive Lanczos bisection. J Mach Learn Res 10:1989–2012
Cintra ME, Camargo HA, Monard MC (2008) A study on techniques for the automatic generation of membership functions for pattern recognition. In: Congresso da Academia Trinacional de Ciências (C3N), vol 1, pp 1–10
Durrant RJ, Kabán A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397
François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886
Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A (2010) Can shared-neighbor distances defeat the curse of dimensionality? In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM), Lecture Notes in Computer Science, vol 6187. Springer, pp 482–500
Huang WL, Chen HM, Hwang SF, Ho SY (2007) Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosyst Eng 90(2):405–413
Keller JE, Gray MR, Givens JA (1985) A fuzzy k-nearest neighbor algorithm. IEEE Trans Syst Man Cybern 15(4):580–585
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281
Pham T.D. (2005) An optimally weighted fuzzy k-NN algorithm. In: Proceedings of the 3rd international conference on advances in pattern recognition (ICAPR), Part I, Lecture Notes in Computer Science, vol 3686. Springer, pp 239–247
Radovanović M, Nanopoulos A, Ivanović M (2009) Nearest neighbors in high-dimensional data: the emergence and influence of hubs. In: Proceedings of the 26th international conference on machine learning (ICML), pp 865–872
Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: Popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
Radovanović M., Nanopoulos A., Ivanović M. (2010) On the existence of obstinate results in vector space models. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, pp 186–193
Radovanović M, Nanopoulos A, Ivanović M (2010) Time-series classification in many intrinsic dimensions. In: Proceedings of the 10th SIAM international conference on data mining (SDM), pp 677–688
Shen HB, Yang J, Chou KC (2006) Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J Theor Biol 240(1):9–13
Sim J, Kim SY, Lee J (2005) Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinform Biol Insights 21(12):2844–2849
Singpurwalla N, Booker JM (2004) Membership functions and probability measures of fuzzy sets. J Am Stat Assoc 99:867–877
Tomašev N, Brehar R, Mladenić D, Nedevschi S (2011) The influence of hubness on nearest-neighbor methods in object recognition. In: Proceedings of the 7th IEEE international conference on intelligent computer communication and processing (ICCP), pp 367–374
Tomašev N, Mladenić D (2011) Exploring the hubness-related properties of oceanographic sensor data. In: Proceedings of the 14th international multiconference on information society (IS), A:149–152
Tomašev N, Mladenić D (2011) The influence of weighting the k-occurrences on hubness-aware classification methods. In: Proceedings of 14th international multiconference on information society
Tomašev N, Mladenić D (2012) Nearest neighbor voting in high dimensional data: learning from past occurrences. Comput Sci Inf Syst 9(2):691–712
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. In: Proceedings of the 7th international conference on machine learning and data mining (MLDM), Lecture Notes in Artificial Intelligence, vol 6871. Springer, pp 16–30
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: Proceedings of the 15th pacific-asia conference on knowledge discovery and data mining (PAKDD), Part I, Lecture Notes in Artificial Intelligence, vol 6634. Springer, pp 183–195
Wang XZ, He YL, Dong LC, Zhao HY (2011) Particle swarm optimization for determining fuzzy measures from data. Inf Sci 181(19):4230–4252
Yu S, Backer SD, Scheunders P (2002) Genetic feature selection combined with composite fuzzy nearest neighbor classifiers for hyperspectral satellite imagery. Pattern Recogn Lett 23(1–3):183–190
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Zhang Z, Zhang R (2009) Multimedia data mining. Chapman and Hall, London
Zheng K, Fung PC, Zhou X (2010) K-nearest neighbor search for fuzzy objects. In: Proceedings of the 36th ACM SIGMOD international conference on management of data, pp 699–710
Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classification. Pattern Anal Appl 11:247–257
Acknowledgments
This work was supported by the bilateral project between Slovenia and Serbia “Correlating images and words: Enhancing image analysis through machine learning and semantic technologies,” the Slovenian Research Agency, the Serbian Ministry of Education and Science through project no. OI174023, “Intelligent techniques and their integration into wide-spectrum decision support,” and the ICT Programme of the EC under PASCAL2 (ICT-NoE-216886) and PlanetData (ICT-NoE-257641).
Author information
Authors and Affiliations
Corresponding author
Additional information
This is an extended version of the paper Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification, which was presented at the MLDM 2011 conference [27].
Rights and permissions
About this article
Cite this article
Tomašev, N., Radovanović, M., Mladenić, D. et al. Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int. J. Mach. Learn. & Cyber. 5, 445–458 (2014). https://doi.org/10.1007/s13042-012-0137-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-012-0137-1