Abstract
Scaling up data mining algorithms for data of both high dimensionality and cardinality has been lately recognized as one of the most challenging problems in data mining research. The reason is that typical data mining tasks, such as clustering, cannot produce high quality results when applied on high-dimensional and/or large (in terms of cardinality) datasets. Data preprocessing and in particular dimensionality reduction constitute promising tools to deal with this problem. However, most of the existing dimensionality reduction algorithms share also the same disadvantages with data mining algorithms, when applied on large datasets of high dimensionality. In this article, we propose a fast and efficient dimensionality reduction algorithm (FEDRA), which is particularly scalable and therefore suitable for challenging datasets. FEDRA follows the landmark-based paradigm for embedding data objects in a low-dimensional projection space. By means of a theoretical analysis, we prove that FEDRA is efficient, while we demonstrate the achieved quality of results through experiments on datasets of higher cardinality and dimensionality than those employed in the evaluation of competitive algorithms. The obtained results prove that FEDRA manages to retain or ameliorate clustering quality while projecting in less than 10% of the initial dimensionality. Moreover, our algorithm produces embeddings that enable the faster convergence of clustering algorithms. Therefore, FEDRA emerges as a powerful and generic tool for data pre-processing, which can be integrated in other data mining algorithms, thus enhancing their performance.
- Abu-Khzam, F., Samatova, N., Ostrouchov, G., Langston, M., and Geist, A. 2002. Distributed dimension reduction algorithms for widely dispersed data. In International Conference on Parallel and Distributed Computing Systems (PDCS). 167--174.Google Scholar
- Achlioptas, D. 2001. Database-friendly random projections. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 274--281. Google ScholarDigital Library
- Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 557--563. Google ScholarDigital Library
- Ailon, N. and Chazelle, B. 2010. Faster dimension reduction. Comm. ACM 53, 2, 97--104. Google ScholarDigital Library
- Athitsos, V., Alon, J., Sclaroff, S., and Kollios, G. 2008. BoostMap: An embedding method for efficient nearest neighbor retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 30, 1, 89--104. Google ScholarDigital Library
- Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory (ICDT). 217--235. Google ScholarDigital Library
- Bourgain, J. 1985. On lipschitz embedding of finite metric spaces in hilbert space. Israel J. Math. 52, 46--52.Google ScholarCross Ref
- Carreira-Perpinan, M. A. 1997. A review of dimension reduction techniques. Tech. Rep. CS-96-09, University of Shefield.Google Scholar
- Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman. Google ScholarDigital Library
- Dasgupta, S. and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Rand. Struct. Algor. 22, 1, 60--65. Google ScholarDigital Library
- Datta, S., Giannella, C., and Kargupta, H. 2006. K-Means clustering over a large, dynamic network. In SIAM International Conference on Data Mining (SDM).Google Scholar
- de Silva, V. and Tenenbaum, J. B. 2002. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems. 705--712.Google Scholar
- de Silva, V. and Tenenbaum, J. B. 2004. Sparse multidimensional scaling using landmark points. Tech. Rep. Stanford University.Google Scholar
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
- Doulkeridis, C., Nørvåg, K., and Vazirgiannis, M. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE J. Select. Areas Commun. 25, 25--34. Google ScholarDigital Library
- Drineas, P., Kannan, R., Mahoney, M. W., and A, L. 2006. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36, 1, 184--206. Google ScholarDigital Library
- Faloutsos, C. and Lin, K.-I. 1995. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the International Conference on Management of Data (SIGMOD). 163--174. Google ScholarDigital Library
- Freund, Y. and Schapire, R. E. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory (EuroCOLT). 23--37. Google ScholarDigital Library
- Gabriela, H. and Martin, F. 1999. Cluster-preserving embedding of proteins. Tech. rep., Center for Discrete Mathematics and Theoretical Computer Science. Google ScholarDigital Library
- Hennig, C. and Latecki, L. J. 2003. The choice of vantage objects for image retrieval. Patt. Recogn. 36, 9, 2187--2196.Google ScholarCross Ref
- Hjaltason, G. R. and Samet, H. 2003. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Patt. Anal. Mach. Intell. 25, 5, 530--549. Google ScholarDigital Library
- Kargupta, H., Huang, W., Sivakumar, K., Park, B.-H., and Wang, S. 2000. Collective principal component analysis from distributed heterogeneous data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 452--457. Google ScholarDigital Library
- Lee, R., Slagle, J., and Blum, H. 1977. A triangulation method for the sequential mapping of points from n-space to two-space. IEEE Trans. Comput. C-26, 288--292. Google ScholarDigital Library
- Li, Z., Lin, D., and Tang, X. 2009. Nonparametric discriminant analysis for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 4, 755--761. Google ScholarDigital Library
- Lian, X. and Chen, L. 2009. General cost models for evaluating dimensionality reduction in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21, 10, 1447--1460. Google ScholarDigital Library
- Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2006. K-Landmarks: Distributed dimensionality reduction for clustering quality maintenance. In Proceedings of the Principles of Data Mining and Knowledge Discovery (PKDD). 322--334. Google ScholarDigital Library
- Magdalinos, P., Doulkeridis, C., and Vazirgiannis, M. 2009. FEDRA: A fast and efficient dimensionality reduction algorithm. In Proceedings of the SIAM International Conference on Data Mining (SDM). 509--520.Google Scholar
- Mahoney, M. W. and Drineas, P. 2009. CUR matrix decompositions for improved data analysis. Proc. Nat. Acad. Sciences 106, 697--702.Google ScholarCross Ref
- Payne, T. R. and Edwards, P. 1999. Dimensionality reduction through correspondence analysis. Tech. rep., AUCS-TR-9910, Carnegie Mellon University.Google Scholar
- Qi, H., Wang, T., and Birdwell, D. 2004. Global principal component analysis for dimensionality reduction in distributed data mining. In Statistical Data Mining and Knowledge Discovery in Chapter 19, CRC Press, 327--342.Google Scholar
- Qu, Y., Ostrouchov, G., Samatova, N., and Geist, A. 2002. Principal component analysis for dimension reduction in massive distributed data sets. In 5th International Workshop on High Performance Data Mining.Google Scholar
- Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Schenker, S. 2001. A scalable content-addressable network. In Proceedings of ACM SIGCOMM. 161--172. Google ScholarDigital Library
- Sharma, A. and Paliwal, K. K. 2008. Rotational linear discriminant analysis technique for dimensionality reduction. IEEE Trans. Knowl. Data Eng. 20, 10, 1336--1347. Google ScholarDigital Library
- Stewart, G. W. 2001. Matrix Algorithms Vol I, II. SIAM, Philadelphia, PA. Google ScholarDigital Library
- Stoica, I., Morris, R., Karger, D., Kaashoek, F. M., and Hari. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceeedings of ACM SIGCOMM. Google ScholarDigital Library
- Swets, D. and Weng, J. 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 18, 831--836. Google ScholarDigital Library
- Togerson, W. 1958. Theory and Methods of Scaling. Wiley.Google Scholar
- Vleugels, J. and Veltkamp, R. 1999. Efficient image retrieval through vantage objects. Patt. Recog. 35, 69--80.Google ScholarCross Ref
- Wang, J., Wang, X., Shasha, D., and Zhang, K. 2005. MetricMap: an embedding technique for processing distance-based queries in metric spaces. IEEE Trans. Syst. Man. Cybern., Part B: Cybernetics 35, 5, 973--987. Google ScholarDigital Library
- Yang, Q. and Wu, X. 2006. 10 Challenging Problems in Data Mining Research. Int. J. Inform. Techn. Decis. Making 5, 4, 597--604.Google ScholarCross Ref
- Ye, J., Ye, J., H.Xiong, Li, Q., Xiong, H., Park, H., Janardan, R., and Kumar, V. 2004. IDR/QR: An incremental dimension reduction algorithm via QR decomposition. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 364--373. Google ScholarDigital Library
Index Terms
- Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction
Recommendations
Dimensionality reduction-based spoken emotion recognition
To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised ...
A unified framework for semi-supervised dimensionality reduction
In practice, many applications require a dimensionality reduction method to deal with the partially labeled problem. In this paper, we propose a semi-supervised dimensionality reduction framework, which can efficiently handle the unlabeled data. Under ...
Stable local dimensionality reduction approaches
Dimensionality reduction is a big challenge in many areas. A large number of local approaches, stemming from statistics or geometry, have been developed. However, in practice these local approaches are often in lack of robustness, since in contrast to ...
Comments