ABSTRACT
Finding latent patterns in high dimensional data is an important research problem with numerous applications. The most well known approaches for high dimensional data analysis are feature selection and dimensionality reduction. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging applications, however, scientists are interested in the local latent patterns held by feature subspaces, which may be invisible via any global transformation.
In this paper, we investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. We formalize this problem as identifying reducible subspaces in the full dimensional space. Intuitively, a reducible subspace is a feature subspace whose intrinsic dimensionality is smaller than the number of features. We present an effective algorithm, REDUS, for finding the reducible subspaces. Two key components of our algorithm are finding the overall reducible subspace, and uncovering the individual reducible subspaces from the overall reducible subspace. A broad experimental evaluation demonstrates the effectiveness of our algorithm.
- C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. SIGMOD, 2000. Google ScholarDigital Library
- A. Alizadeh and et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403:503--11, 2000.Google ScholarCross Ref
- D. Barbara and P. Chen. Using the fractal dimension to cluster datasets. KDD, 2000. Google ScholarDigital Library
- M. Belkin and P. Niyogi. Şlaplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 2003. Google ScholarDigital Library
- A. Belussi and C. Faloutsos. Self-spacial join selectivity estimation using fractal concepts. ACM Transactions on Information Systems, 16(2):161--201, 1998. Google ScholarDigital Library
- A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245--271, 1997. Google ScholarDigital Library
- C. Bohm, K. Kailing, P. Kroger, and A. Zimek. Computing clusters of correlation connected objects. SIGMOD, 2004. Google ScholarDigital Library
- I. Borg and P. Groenen. Modern multidimensional scaling. New York: Springer, 1997.Google Scholar
- F. Camastra and A. Vinciarelli. Estimating intrinsic dimension of data with a fractal-based approach. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(10):1404--1407, 2002. Google ScholarDigital Library
- T. M. Cover and J. A. Thomas. The Elements of Information Theory. Wiley & Sons, New York, 1991. Google ScholarDigital Library
- M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863--68, 1998.Google ScholarCross Ref
- C. Faloutsos and I. Kamel. Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. PODS, 1994. Google ScholarDigital Library
- K. Fukunaga. Intrinsic dimensionality extraction. Classification, Pattern recongnition and Reduction of Dimensionality, Volume 2 of Handbook of Statistics, pages 347--360, P. R. Krishnaiah and L. N. Kanal eds., Amsterdam, North Holland, 1982.Google Scholar
- K. Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 20(2):165--171, 1976. Google ScholarDigital Library
- A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas. Dimension induced clustering. KDD, 2005. Google ScholarDigital Library
- G. Golub and A. Loan. Matrix computations. Johns Hopkins University Press, Baltimore, Maryland, 1996.Google Scholar
- V. Iyer and et. al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83--87, 1999.Google ScholarCross Ref
- I. Jolliffe. Principal component analysis. New York: Springer, 1986.Google Scholar
- M. Kendall and J. D. Gibbons. Rank Correlation Methods. New York: Oxford University Press, 1990.Google Scholar
- D. C. Lay. Linear Algebra and Its Applications. Addison Wesley, 2005.Google Scholar
- E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems, 2005.Google Scholar
- H. Liu and H. Motoda. Feature selection for knowledge discovery and data mining. Boston: Kluwer Academic Publishers, 1998. Google ScholarDigital Library
- B.-U. Pagel, F. Korn, and C. Faloutsos. De ating the dimensionality curse using multiple fractal dimensions. ICDE, 2000.Google ScholarCross Ref
- S. Papadimitriou, H. Kitawaga, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. ICDE, 2003.Google ScholarCross Ref
- S. N. Rasband. Chaotic Dynamics of Nonlinear Systems. Wiley-Interscience, 1990.Google Scholar
- H. T. Reynolds. The analysis of cross-classifications. The Free Press, New York, 1977.Google Scholar
- S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323--2326, 2000.Google ScholarCross Ref
- M. Schroeder. Fractals, Chaos, Power Lawers: Minutes from an Infinite Paradise. W. H. Freeman, New York, 1991.Google Scholar
- J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290 (5500):2319--2323, 2000.Google ScholarCross Ref
- A. K. H. Tung, X. Xin, and B. C. Ooi. Curler: Finding and visualizing nonlinear correlation. SIGMOD, 2005. Google ScholarDigital Library
- L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML, 2003.Google Scholar
- X. Zhang, F. Pan, and W. Wang. Care: Finding local linear correlations in high dimensional data. ICDE, 2008. Google ScholarDigital Library
- Z. Zhao and H. Liu. Searching for interacting features. IJCAI, 2007. Google ScholarDigital Library
Index Terms
- REDUS: finding reducible subspaces in high dimensional data
Recommendations
Visual subspace clustering based on dimension relevance
The proposed work aims at visual subspace clustering and addresses two challenges: an efficient visual subspace clustering workflow and an intuitive visual description of subspace structure. Handling the first challenge is to escape the circular ...
Relative Intrinsic Dimensionality Is Intrinsic to Learning
Artificial Neural Networks and Machine Learning – ICANN 2023AbstractHigh dimensional data can have a surprising property: pairs of data points may be easily separated from each other, or even from arbitrary subsets, with high probability using just simple linear classifiers. However, this is more of a rule of ...
Dimensionality reduction using magnitude and shape approximations
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementHigh dimensional data sets are encountered in many modern database applications. The usual approach is to construct a summary of the data set through a lossy compression technique, and use this lower dimensional synopsis to provide fast, approximate ...
Comments