Hubness-Aware Shared Neighbor Distances for High-Dimensional k-Nearest Neighbor Classification

Tomašev, Nenad; Mladenić, Dunja

doi:10.1007/978-3-642-28931-6_12

Nenad Tomašev²⁵ &
Dunja Mladenić²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7209))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

1791 Accesses
9 Citations

Abstract

Learning from high-dimensional data is usually quite a challenging task, as captured by the well known phrase curse of dimensionality. Most distance-based methods become impaired due to the distance concentration of many widely used metrics in high-dimensional spaces. One recently proposed approach suggests that using secondary distances based on the number of shared k-nearest neighbors between different points might partly resolve the concentration issue, thereby improving overall performance. Nevertheless, the curse of dimensionality also affects the k-nearest neighbor inference in severely negative ways, one such consequence being known as hubness. The impact of hubness on forming shared neighbor distances has not been discussed before and it is what we focus on in this paper. Furthermore, we propose a new method for calculating the secondary distances which is aware of the underlying neighbor occurrence distribution. Our experiments suggest that this new approach achieves consistently superior performance on all considered high-dimensional data sets. An additional benefit is that it essentially requires no extra computations compared to the original methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Scott, D., Thompson, J.: Probability density estimation in higher dimensions. In: Proceedings of the Fifteenth Symposium on the Interface, Amsterdam, pp. 173–179 (1983)
Google Scholar
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional spaces. In: Proc. 8th Int. Conf. on Database Theory (ICDT), pp. 420–434 (2001)
Google Scholar
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering 19(7), 873–886 (2007)
Article Google Scholar
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity 25(4), 385–397 (2009)
Article MathSciNet MATH Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Nearest neighbors in high-dimensional data: The emergence and influence of hubs. In: Proc. 26th Int. Conf. on Machine Learning (ICML), pp. 865–872 (2009)
Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: On the existence of obstinate results in vector space models. In: Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 186–193 (2010)
Google Scholar
Aucouturier, J.J., Pachet, F.: Improving timbre similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences 1 (2004)
Google Scholar
Aucouturier, J.: Ten experiments on the modelling of polyphonic timbre. Technical report, Docteral dissertation, University of Paris 6 (2006)
Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 2487–2531 (2011)
Google Scholar
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22, 1025–1034 (1973)
Article Google Scholar
Ertz, L., Steinbach, M., Kumar, V.: Finding topics in collections of documents: A shared nearest neighbor approach. In: Proceedings of Text Mine 2001, First SIAM International Conference on Data Mining (2001)
Google Scholar
Yin, J., Fan, X., Chen, Y., Ren, J.: High-Dimensional Shared Nearest Neighbor Clustering Algorithm. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 494–502. Springer, Heidelberg (2005)
Chapter Google Scholar
Moëllic, P.A., Haugeard, J.E., Pitel, G.: Image clustering based on a shared nearest neighbors approach for tagged collections. In: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, CIVR 2008, pp. 269–278. ACM, New York (2008)
Chapter Google Scholar
Houle, M.E., Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 482–500. Springer, Heidelberg (2010)
Chapter Google Scholar
Bennett, K.P., Fayyad, U., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: ACM SIGKDD Conference Proceedings, pp. 233–243. ACM Press (1999)
Google Scholar
Ayad, H., Kamel, M.: Finding Natural Clusters using Multi-Clusterer Combiner Based on Shared Nearest Neighbors. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 166–175. Springer, Heidelberg (2003)
Chapter Google Scholar
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: The Role of Hubness in Clustering High-Dimensional Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 183–195. Springer, Heidelberg (2011)
Chapter Google Scholar
Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 149–160. Springer, Heidelberg (2011)
Chapter Google Scholar
Tomašev, N., Mladenić, D.: Exploring the hubness-related properties of oceanographic sensor data. In: Proceedings of the SiKDD Conference (2011)
Google Scholar
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: Hubness-based fuzzy measures for high dimensional k-nearest neighbor classification. In: Machine Learning and Data Mining in Pattern Recognition Conference, MLDM, New York (2011)
Google Scholar
Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: A probabilistic approach to nearest neighbor classification: Naive hubness bayesian k-nearest neighbor. In: Proceeding of the CIKM Conference (2011)
Google Scholar
Tomašev, N., Mladenić, D.: Nearest neighbor voting in high-dimensional data: learning from past occurences. In: PhD forum, ICDM Conference
Google Scholar
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, USAF School of Aviation Medicine, Randolph Field, Texas (1951)
Google Scholar
Stone, C.J.: Consistent nonparametric regression. Annals of Statistics 5, 595–645 (1977)
Article MathSciNet MATH Google Scholar
Devroye, L., Györfi, A.K., Lugosi, G.: On the strong universal consistency of nearest neighbor regression function estimates. Annals of Statistics 22, 1371–1385 (1994)
Article MathSciNet MATH Google Scholar
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT-13(1), 21–27 (1967)
Article Google Scholar
Devroye, L.: On the inequality of cover and hart. IEEE Transactions on Pattern Analysis and Machine Intelligence 3, 75–78 (1981)
Article MATH Google Scholar
Chen, J., Ren Fang, H., Saad, Y.: Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. Journal of Machine Learning Research 10, 1989–2012 (2009)
Google Scholar
Tomašev, N., Brehar, R., Mladenić, D., Nedevschi, S.: The influence of hubness on nearest-neighbor methods in object recognition. In: IEEE Conference on Intelligent Computer Communication and Processing (2011)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91 (2004)
Article Google Scholar
Zhang, Z., Zhang, R.: Multimedia Data Mining: a Systematic Introduction to Concepts and Theory. Chapman and Hall (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Laboratory, Institute Jožef Stefan, Jamova 39, 1000, Ljubljana, Slovenia
Nenad Tomašev & Dunja Mladenić

Authors

Nenad Tomašev
View author publications
You can also search for this author in PubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Emilio Corchado
VŠB-TU Ostrava 17, Listopadu 15, 70833, Ostrava, Czech Republic
Václav Snášel
Machine Intelligence Research Labs Machine Intelligence Research Labs(MIR Labs),, Scientific Network for Innovation and Research Excellence, P.O. Box 2259, 98071, Auburn, Washington, USA
Ajith Abraham
Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
Michał Woźniak
University of the Basque Country, Pº Manuel Lardizabal 1, 20018, San Sebastian, Spain
Manuel Graña
Yonsei University, 134 Shinchon-dong, 120-749, Sudaemoon-ku, Seoul, Korea
Sung-Bae Cho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tomašev, N., Mladenić, D. (2012). Hubness-Aware Shared Neighbor Distances for High-Dimensional k-Nearest Neighbor Classification. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, SB. (eds) Hybrid Artificial Intelligent Systems. HAIS 2012. Lecture Notes in Computer Science(), vol 7209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28931-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-28931-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28930-9
Online ISBN: 978-3-642-28931-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics