Skip to main content

The Role of Hubness in Clustering High-Dimensional Data

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6634))

Included in the following conference series:

Abstract

High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. In this paper we take a novel perspective on the problem of clustering high-dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower-dimensional feature subspace, we embrace dimensionality by taking advantage of some inherently high-dimensional phenomena. More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest neighbor lists of other points, can be successfully exploited in clustering. We validate our hypothesis by proposing several hubness-based clustering algorithms and testing them on high-dimensional data. Experimental results demonstrate good performance of our algorithms in multiple settings, particularly in the presence of large quantities of noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)

    MATH  Google Scholar 

  2. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proc. 26th ACM SIGMOD Int. Conf. on Management of Data, pp. 70–81 (2000)

    Google Scholar 

  3. Kailing, K., Kriegel, H.P., Kröger, P., Wanka, S.: Ranking interesting subspaces for clustering high dimensional data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 241–252. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. 4th SIAM Int. Conf. on Data Mining (SDM), pp. 246–257 (2004)

    Google Scholar 

  5. Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: Proc. 5th IEEE Int. Conf. on Data Mining (ICDM), pp. 250–257 (2005)

    Google Scholar 

  6. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  7. François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering 19(7), 873–886 (2007)

    Article  Google Scholar 

  8. Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity 25(4), 385–397 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  9. Agirre, E., Martínez, D., de Lacalle, O.L., Soroa, A.: Two graph-based algorithms for state-of-the-art WSD. In: Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 585–593 (2006)

    Google Scholar 

  10. Tran, T.N., Wehrens, R., Buydens, L.M.C.: Knn density-based clustering for high dimensional multispectral images. In: Proc. 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas Workshop, pp. 147–151 (2003)

    Google Scholar 

  11. Biçici, E., Yuret, D.: Locally scaled density based clustering. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part I. LNCS, vol. 4431, pp. 739–748. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Zhang, C., Zhang, X., Zhang, M.Q., Li, Y.: Neighbor number, valley seeking and clustering. Pattern Recognition Letters 28(2), 173–180 (2007)

    Article  MathSciNet  Google Scholar 

  13. Hader, S., Hamprecht, F.A.: Efficient density clustering using basin spanning trees. In: Proc. 26th Annual Conf. of the Gesellschaft für Klassifikation, pp. 39–48 (2003)

    Google Scholar 

  14. Ding, C., He, X.: K-nearest-neighbor consistency in data clustering: Incorporating local information into global optimization. In: Proc. ACM Symposium on Applied Computing (SAC), pp. 584–589 (2004)

    Google Scholar 

  15. Chang, C.T., Lai, J.Z.C., Jeng, M.D.: Fast agglomerative clustering using information of k-nearest neighbors. Pattern Recognition 43(12), 3958–3968 (2010)

    Article  MATH  Google Scholar 

  16. Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 2487–2531 (2010)

    MathSciNet  MATH  Google Scholar 

  17. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)

    Google Scholar 

  18. Chen, J., Fang, H., Saad, Y.: Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. Journal of Machine Learning Research 10, 1989–2012 (2009)

    MATH  Google Scholar 

  19. Corne, D., Dorigo, M., Glover, F.: New Ideas in Optimization. McGraw-Hill, New York (1999)

    Google Scholar 

  20. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M. (2011). The Role of Hubness in Clustering High-Dimensional Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20841-6_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20840-9

  • Online ISBN: 978-3-642-20841-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics