The Role of Hubness in Clustering High-Dimensional Data

Tomašev, Nenad; Radovanović, Miloš; Mladenić, Dunja; Ivanović, Mirjana

doi:10.1007/978-3-642-20841-6_16

Nenad Tomašev²²,
Miloš Radovanović²³,
Dunja Mladenić²² &
…
Mirjana Ivanović²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1790 Accesses
20 Citations

Abstract

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of some inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by proposing several hubness-based clustering algorithms and testing them on high-dimensional data. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
MATH Google Scholar
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proc. 26th ACM SIGMOD Int. Conf. on Management of Data, pp. 70–81 (2000)
Google Scholar
Kailing, K., Kriegel, H.P., Kröger, P., Wanka, S.: Ranking interesting subspaces for clustering high dimensional data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 241–252. Springer, Heidelberg (2003)
Chapter Google Scholar
Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. 4th SIAM Int. Conf. on Data Mining (SDM), pp. 246–257 (2004)
Google Scholar
Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: Proc. 5th IEEE Int. Conf. on Data Mining (ICDM), pp. 250–257 (2005)
Google Scholar
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000)
Chapter Google Scholar
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering 19(7), 873–886 (2007)
Article Google Scholar
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity 25(4), 385–397 (2009)
Article MathSciNet MATH Google Scholar
Agirre, E., Martínez, D., de Lacalle, O.L., Soroa, A.: Two graph-based algorithms for state-of-the-art WSD. In: Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 585–593 (2006)
Google Scholar
Tran, T.N., Wehrens, R., Buydens, L.M.C.: Knn density-based clustering for high dimensional multispectral images. In: Proc. 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas Workshop, pp. 147–151 (2003)
Google Scholar
Biçici, E., Yuret, D.: Locally scaled density based clustering. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part I. LNCS, vol. 4431, pp. 739–748. Springer, Heidelberg (2007)
Chapter Google Scholar
Zhang, C., Zhang, X., Zhang, M.Q., Li, Y.: Neighbor number, valley seeking and clustering. Pattern Recognition Letters 28(2), 173–180 (2007)
Article MathSciNet Google Scholar
Hader, S., Hamprecht, F.A.: Efficient density clustering using basin spanning trees. In: Proc. 26th Annual Conf. of the Gesellschaft für Klassifikation, pp. 39–48 (2003)
Google Scholar
Ding, C., He, X.: K-nearest-neighbor consistency in data clustering: Incorporating local information into global optimization. In: Proc. ACM Symposium on Applied Computing (SAC), pp. 584–589 (2004)
Google Scholar
Chang, C.T., Lai, J.Z.C., Jeng, M.D.: Fast agglomerative clustering using information of k-nearest neighbors. Pattern Recognition 43(12), 3958–3968 (2010)
Article MATH Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 2487–2531 (2010)
MathSciNet MATH Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)
Google Scholar
Chen, J., Fang, H., Saad, Y.: Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. Journal of Machine Learning Research 10, 1989–2012 (2009)
MATH Google Scholar
Corne, D., Dorigo, M., Glover, F.: New Ideas in Optimization. McGraw-Hill, New York (1999)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute Jožef Stefan, Artificial Intelligence Laboratory, Jamova 39, 1000, Ljubljana, Slovenia
Nenad Tomašev & Dunja Mladenić
Department of Mathematics and Informatics, University of Novi Sad, Trg D. Obradovića 4, 21000, Novi Sad, Serbia
Miloš Radovanović & Mirjana Ivanović

Authors

Nenad Tomašev
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Radovanović
View author publications
You can also search for this author in PubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar
Mirjana Ivanović
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, NSW 2007, Sydney, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, MN 55455, Minneapolis, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M. (2011). The Role of Hubness in Clustering High-Dimensional Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-20841-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics