skip to main content
research-article

Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

Published:01 February 2011Publication History
Skip Abstract Section

Abstract

Scaling up data mining algorithms for data of both high dimensionality and cardinality has been lately recognized as one of the most challenging problems in data mining research. The reason is that typical data mining tasks, such as clustering, cannot produce high quality results when applied on high-dimensional and/or large (in terms of cardinality) datasets. Data preprocessing and in particular dimensionality reduction constitute promising tools to deal with this problem. However, most of the existing dimensionality reduction algorithms share also the same disadvantages with data mining algorithms, when applied on large datasets of high dimensionality. In this article, we propose a fast and efficient dimensionality reduction algorithm (FEDRA), which is particularly scalable and therefore suitable for challenging datasets. FEDRA follows the landmark-based paradigm for embedding data objects in a low-dimensional projection space. By means of a theoretical analysis, we prove that FEDRA is efficient, while we demonstrate the achieved quality of results through experiments on datasets of higher cardinality and dimensionality than those employed in the evaluation of competitive algorithms. The obtained results prove that FEDRA manages to retain or ameliorate clustering quality while projecting in less than 10% of the initial dimensionality. Moreover, our algorithm produces embeddings that enable the faster convergence of clustering algorithms. Therefore, FEDRA emerges as a powerful and generic tool for data pre-processing, which can be integrated in other data mining algorithms, thus enhancing their performance.

References

  1. Abu-Khzam, F., Samatova, N., Ostrouchov, G., Langston, M., and Geist, A. 2002. Distributed dimension reduction algorithms for widely dispersed data. In International Conference on Parallel and Distributed Computing Systems (PDCS). 167--174.Google ScholarGoogle Scholar
  2. Achlioptas, D. 2001. Database-friendly random projections. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 274--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 557--563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ailon, N. and Chazelle, B. 2010. Faster dimension reduction. Comm. ACM 53, 2, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Athitsos, V., Alon, J., Sclaroff, S., and Kollios, G. 2008. BoostMap: An embedding method for efficient nearest neighbor retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 30, 1, 89--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory (ICDT). 217--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bourgain, J. 1985. On lipschitz embedding of finite metric spaces in hilbert space. Israel J. Math. 52, 46--52.Google ScholarGoogle ScholarCross RefCross Ref
  8. Carreira-Perpinan, M. A. 1997. A review of dimension reduction techniques. Tech. Rep. CS-96-09, University of Shefield.Google ScholarGoogle Scholar
  9. Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dasgupta, S. and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Rand. Struct. Algor. 22, 1, 60--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Datta, S., Giannella, C., and Kargupta, H. 2006. K-Means clustering over a large, dynamic network. In SIAM International Conference on Data Mining (SDM).Google ScholarGoogle Scholar
  12. de Silva, V. and Tenenbaum, J. B. 2002. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems. 705--712.Google ScholarGoogle Scholar
  13. de Silva, V. and Tenenbaum, J. B. 2004. Sparse multidimensional scaling using landmark points. Tech. Rep. Stanford University.Google ScholarGoogle Scholar
  14. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  15. Doulkeridis, C., Nørvåg, K., and Vazirgiannis, M. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE J. Select. Areas Commun. 25, 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Drineas, P., Kannan, R., Mahoney, M. W., and A, L. 2006. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36, 1, 184--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Faloutsos, C. and Lin, K.-I. 1995. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the International Conference on Management of Data (SIGMOD). 163--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory (EuroCOLT). 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gabriela, H. and Martin, F. 1999. Cluster-preserving embedding of proteins. Tech. rep., Center for Discrete Mathematics and Theoretical Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hennig, C. and Latecki, L. J. 2003. The choice of vantage objects for image retrieval. Patt. Recogn. 36, 9, 2187--2196.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hjaltason, G. R. and Samet, H. 2003. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Patt. Anal. Mach. Intell. 25, 5, 530--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kargupta, H., Huang, W., Sivakumar, K., Park, B.-H., and Wang, S. 2000. Collective principal component analysis from distributed heterogeneous data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 452--457. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lee, R., Slagle, J., and Blum, H. 1977. A triangulation method for the sequential mapping of points from n-space to two-space. IEEE Trans. Comput. C-26, 288--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Li, Z., Lin, D., and Tang, X. 2009. Nonparametric discriminant analysis for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 4, 755--761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lian, X. and Chen, L. 2009. General cost models for evaluating dimensionality reduction in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21, 10, 1447--1460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2006. K-Landmarks: Distributed dimensionality reduction for clustering quality maintenance. In Proceedings of the Principles of Data Mining and Knowledge Discovery (PKDD). 322--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2009. FEDRA: A fast and efficient dimensionality reduction algorithm. In Proceedings of the SIAM International Conference on Data Mining (SDM). 509--520.Google ScholarGoogle Scholar
  28. Mahoney, M. W. and Drineas, P. 2009. CUR matrix decompositions for improved data analysis. Proc. Nat. Acad. Sciences 106, 697--702.Google ScholarGoogle ScholarCross RefCross Ref
  29. Payne, T. R. and Edwards, P. 1999. Dimensionality reduction through correspondence analysis. Tech. rep., AUCS-TR-9910, Carnegie Mellon University.Google ScholarGoogle Scholar
  30. Qi, H., Wang, T., and Birdwell, D. 2004. Global principal component analysis for dimensionality reduction in distributed data mining. In Statistical Data Mining and Knowledge Discovery in Chapter 19, CRC Press, 327--342.Google ScholarGoogle Scholar
  31. Qu, Y., Ostrouchov, G., Samatova, N., and Geist, A. 2002. Principal component analysis for dimension reduction in massive distributed data sets. In 5th International Workshop on High Performance Data Mining.Google ScholarGoogle Scholar
  32. Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Schenker, S. 2001. A scalable content-addressable network. In Proceedings of ACM SIGCOMM. 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sharma, A. and Paliwal, K. K. 2008. Rotational linear discriminant analysis technique for dimensionality reduction. IEEE Trans. Knowl. Data Eng. 20, 10, 1336--1347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Stewart, G. W. 2001. Matrix Algorithms Vol I, II. SIAM, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stoica, I., Morris, R., Karger, D., Kaashoek, F. M., and Hari. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceeedings of ACM SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Swets, D. and Weng, J. 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 18, 831--836. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Togerson, W. 1958. Theory and Methods of Scaling. Wiley.Google ScholarGoogle Scholar
  38. Vleugels, J. and Veltkamp, R. 1999. Efficient image retrieval through vantage objects. Patt. Recog. 35, 69--80.Google ScholarGoogle ScholarCross RefCross Ref
  39. Wang, J., Wang, X., Shasha, D., and Zhang, K. 2005. MetricMap: an embedding technique for processing distance-based queries in metric spaces. IEEE Trans. Syst. Man. Cybern., Part B: Cybernetics 35, 5, 973--987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yang, Q. and Wu, X. 2006. 10 Challenging Problems in Data Mining Research. Int. J. Inform. Techn. Decis. Making 5, 4, 597--604.Google ScholarGoogle ScholarCross RefCross Ref
  41. Ye, J., Ye, J., H.Xiong, Li, Q., Xiong, H., Park, H., Janardan, R., and Kumar, V. 2004. IDR/QR: An incremental dimension reduction algorithm via QR decomposition. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 364--373. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 5, Issue 2
          February 2011
          192 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/1921632
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 February 2011
          • Accepted: 1 June 2010
          • Revised: 1 May 2010
          • Received: 1 December 2009
          Published in tkdd Volume 5, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader