Abstract
With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization.We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.
Chapter PDF
Similar content being viewed by others
References
M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, Vol. 37(4):pages 573–595, 1995.
P. Drienas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in Large Graphs and Matrices. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 291–299, 1999.
D. Gibson, J. Kleingberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB Journal, Vol. 8(3–4):pages 222–236, 2000.
R. M. Gray. Vector Quantization. IEEE ASSP Magazine, Vol. 1(2):pages 4–29, 1984.
S. Guha, R. Rastogi, and K. Shim. ROCK: ARo bust Clustering Algorithm for Categorical Attributes. Information Systems, Vol. 25(5):pages 345–366, 2000.
G. Gupta and J. Ghosh. Value Balanced Agglomerative Connectivity Clustering. In Proceedings of the SPIE conference on Data Mining and Knowledge Discovery III, April 2001.
E. H. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph-Based Clustering in High-Dimensional Datasets: ASumma ry of Results. Bulletin of the IEEE Technical Committee on Data Engineering, Vol. 21(1):pages 15–22, March 1998.
T. Hofmann. Probabilistic Latent Semantic Indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, 1999.
Z. Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
T. G. Kolda and D. P. O'Leary. AS emidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval. ACM Transactions on Information Systems, Vol. 16(4):pages 322–346, October 1998.
T. G. Kolda and D. P. O'Leary. Computation and Uses of the Semidiscrete Matrix Decomposition. ACM Transactions on Mathematical Software, Vol. 26(3):pages 416–437, September 2000.
D. D. Lee and H. S. Seung. Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature, Vol. 401:pages 788–791, 1999.
J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium, volume 1, pages 281–297, 1967.
S. McConnell and D. B. Skillicorn. Outlier Detection using Semi-Discrete Decomposition. Technical Report 2001-452, Dept. of Computing and Information Science, Queen’s University, 2001.
D. P. O'Leary and S. Peleg. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications, Vol. 31(3):pages 441–444, 1983.
M. Ozdal and C. Aykanat. Clustering Based on Data Patterns using Hypergraph Models. Data Mining and Knowledge Discovery, 2001. Submitted for publication.
S. Zyto, A. Grama, and W. Szpankowski. Semi-Discrete Matrix Transforms (SDD) for Image and Video Compression. Purdue University, 2002. Working manuscript.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koyutürk, M., Grama, A., Ramakrishnan, N. (2002). Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_26
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_26
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive