research-article

Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

Authors:
Panagis Magdalinos

Athens University of Economics and Business

Athens University of Economics and Business
View Profile

,
Christos Doulkeridis

Norwegian University of Science and Technology

Norwegian University of Science and Technology
View Profile

,
Michalis Vazirgiannis

Athens University of Economics and Business

Athens University of Economics and Business
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 5 Issue 2Article No.: 11pp 1–44https://doi.org/10.1145/1921632.1921637

Published:01 February 2011Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Scaling up data mining algorithms for data of both high dimensionality and cardinality has been lately recognized as one of the most challenging problems in data mining research. The reason is that typical data mining tasks, such as clustering, cannot produce high quality results when applied on high-dimensional and/or large (in terms of cardinality) datasets. Data preprocessing and in particular dimensionality reduction constitute promising tools to deal with this problem. However, most of the existing dimensionality reduction algorithms share also the same disadvantages with data mining algorithms, when applied on large datasets of high dimensionality. In this article, we propose a fast and efficient dimensionality reduction algorithm (FEDRA), which is particularly scalable and therefore suitable for challenging datasets. FEDRA follows the landmark-based paradigm for embedding data objects in a low-dimensional projection space. By means of a theoretical analysis, we prove that FEDRA is efficient, while we demonstrate the achieved quality of results through experiments on datasets of higher cardinality and dimensionality than those employed in the evaluation of competitive algorithms. The obtained results prove that FEDRA manages to retain or ameliorate clustering quality while projecting in less than 10% of the initial dimensionality. Moreover, our algorithm produces embeddings that enable the faster convergence of clustering algorithms. Therefore, FEDRA emerges as a powerful and generic tool for data pre-processing, which can be integrated in other data mining algorithms, thus enhancing their performance.

References

Abu-Khzam, F., Samatova, N., Ostrouchov, G., Langston, M., and Geist, A. 2002. Distributed dimension reduction algorithms for widely dispersed data. In International Conference on Parallel and Distributed Computing Systems (PDCS). 167--174.Google Scholar
Achlioptas, D. 2001. Database-friendly random projections. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 274--281. Google ScholarDigital Library
Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 557--563. Google ScholarDigital Library
Ailon, N. and Chazelle, B. 2010. Faster dimension reduction. Comm. ACM 53, 2, 97--104. Google ScholarDigital Library
Athitsos, V., Alon, J., Sclaroff, S., and Kollios, G. 2008. BoostMap: An embedding method for efficient nearest neighbor retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 30, 1, 89--104. Google ScholarDigital Library
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory (ICDT). 217--235. Google ScholarDigital Library
Bourgain, J. 1985. On lipschitz embedding of finite metric spaces in hilbert space. Israel J. Math. 52, 46--52.Google ScholarCross Ref
Carreira-Perpinan, M. A. 1997. A review of dimension reduction techniques. Tech. Rep. CS-96-09, University of Shefield.Google Scholar
Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman. Google ScholarDigital Library
Dasgupta, S. and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Rand. Struct. Algor. 22, 1, 60--65. Google ScholarDigital Library
Datta, S., Giannella, C., and Kargupta, H. 2006. K-Means clustering over a large, dynamic network. In SIAM International Conference on Data Mining (SDM).Google Scholar
de Silva, V. and Tenenbaum, J. B. 2002. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems. 705--712.Google Scholar
de Silva, V. and Tenenbaum, J. B. 2004. Sparse multidimensional scaling using landmark points. Tech. Rep. Stanford University.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
Doulkeridis, C., Nørvåg, K., and Vazirgiannis, M. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE J. Select. Areas Commun. 25, 25--34. Google ScholarDigital Library
Drineas, P., Kannan, R., Mahoney, M. W., and A, L. 2006. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36, 1, 184--206. Google ScholarDigital Library
Faloutsos, C. and Lin, K.-I. 1995. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the International Conference on Management of Data (SIGMOD). 163--174. Google ScholarDigital Library
Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory (EuroCOLT). 23--37. Google ScholarDigital Library
Gabriela, H. and Martin, F. 1999. Cluster-preserving embedding of proteins. Tech. rep., Center for Discrete Mathematics and Theoretical Computer Science. Google ScholarDigital Library
Hennig, C. and Latecki, L. J. 2003. The choice of vantage objects for image retrieval. Patt. Recogn. 36, 9, 2187--2196.Google ScholarCross Ref
Hjaltason, G. R. and Samet, H. 2003. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Patt. Anal. Mach. Intell. 25, 5, 530--549. Google ScholarDigital Library
Kargupta, H., Huang, W., Sivakumar, K., Park, B.-H., and Wang, S. 2000. Collective principal component analysis from distributed heterogeneous data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 452--457. Google ScholarDigital Library
Lee, R., Slagle, J., and Blum, H. 1977. A triangulation method for the sequential mapping of points from n-space to two-space. IEEE Trans. Comput. C-26, 288--292. Google ScholarDigital Library
Li, Z., Lin, D., and Tang, X. 2009. Nonparametric discriminant analysis for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 4, 755--761. Google ScholarDigital Library
Lian, X. and Chen, L. 2009. General cost models for evaluating dimensionality reduction in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21, 10, 1447--1460. Google ScholarDigital Library
Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2006. K-Landmarks: Distributed dimensionality reduction for clustering quality maintenance. In Proceedings of the Principles of Data Mining and Knowledge Discovery (PKDD). 322--334. Google ScholarDigital Library
Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2009. FEDRA: A fast and efficient dimensionality reduction algorithm. In Proceedings of the SIAM International Conference on Data Mining (SDM). 509--520.Google Scholar
Mahoney, M. W. and Drineas, P. 2009. CUR matrix decompositions for improved data analysis. Proc. Nat. Acad. Sciences 106, 697--702.Google ScholarCross Ref
Payne, T. R. and Edwards, P. 1999. Dimensionality reduction through correspondence analysis. Tech. rep., AUCS-TR-9910, Carnegie Mellon University.Google Scholar
Qi, H., Wang, T., and Birdwell, D. 2004. Global principal component analysis for dimensionality reduction in distributed data mining. In Statistical Data Mining and Knowledge Discovery in Chapter 19, CRC Press, 327--342.Google Scholar
Qu, Y., Ostrouchov, G., Samatova, N., and Geist, A. 2002. Principal component analysis for dimension reduction in massive distributed data sets. In 5th International Workshop on High Performance Data Mining.Google Scholar
Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Schenker, S. 2001. A scalable content-addressable network. In Proceedings of ACM SIGCOMM. 161--172. Google ScholarDigital Library
Sharma, A. and Paliwal, K. K. 2008. Rotational linear discriminant analysis technique for dimensionality reduction. IEEE Trans. Knowl. Data Eng. 20, 10, 1336--1347. Google ScholarDigital Library
Stewart, G. W. 2001. Matrix Algorithms Vol I, II. SIAM, Philadelphia, PA. Google ScholarDigital Library
Stoica, I., Morris, R., Karger, D., Kaashoek, F. M., and Hari. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceeedings of ACM SIGCOMM. Google ScholarDigital Library
Swets, D. and Weng, J. 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 18, 831--836. Google ScholarDigital Library
Togerson, W. 1958. Theory and Methods of Scaling. Wiley.Google Scholar
Vleugels, J. and Veltkamp, R. 1999. Efficient image retrieval through vantage objects. Patt. Recog. 35, 69--80.Google ScholarCross Ref
Wang, J., Wang, X., Shasha, D., and Zhang, K. 2005. MetricMap: an embedding technique for processing distance-based queries in metric spaces. IEEE Trans. Syst. Man. Cybern., Part B: Cybernetics 35, 5, 973--987. Google ScholarDigital Library
Yang, Q. and Wu, X. 2006. 10 Challenging Problems in Data Mining Research. Int. J. Inform. Techn. Decis. Making 5, 4, 597--604.Google ScholarCross Ref
Ye, J., Ye, J., H.Xiong, Li, Q., Xiong, H., Park, H., Janardan, R., and Kumar, V. 2004. IDR/QR: An incremental dimension reduction algorithm via QR decomposition. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 364--373. Google ScholarDigital Library

Index Terms

Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Dimensionality reduction-based spoken emotion recognition

To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised ...
Read More
A unified framework for semi-supervised dimensionality reduction

In practice, many applications require a dimensionality reduction method to deal with the partially labeled problem. In this paper, we propose a semi-supervised dimensionality reduction framework, which can efficiently handle the unlabeled data. Under ...
Read More
Stable local dimensionality reduction approaches

Dimensionality reduction is a big challenge in many areas. A large number of local approaches, stemming from statistics or geometry, have been developed. However, in practice these local approaches are often in lack of robustness, since in contrast to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 5, Issue 2
February 2011
192 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1921632
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2011
- Accepted: 1 June 2010
- Revised: 1 May 2010
- Received: 1 December 2009
Published in tkdd Volume 5, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Landmarks
clustering quality
dimensionality reduction
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 795
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Dimensionality reduction-based spoken emotion recognition

A unified framework for semi-supervised dimensionality reduction

Stable local dimensionality reduction approaches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Dimensionality reduction-based spoken emotion recognition

A unified framework for semi-supervised dimensionality reduction

Stable local dimensionality reduction approaches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media