Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Koyutürk, Mehmet; Grama, Ananth; Ramakrishnan, Naren

doi:10.1007/3-540-45681-3_26

Mehmet Koyutürk⁴,
Ananth Grama⁴ &
Naren Ramakrishnan⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

1916 Accesses

Abstract

With the availability of large scale computing platforms and instrumentation for data gathering, increased emphasis is being placed on efficient techniques for analyzing large and extremely high-dimensional datasets. In this paper, we present a novel algebraic technique based on a variant of semi-discrete matrix decomposition (SDD), which is capable of compressing large discrete-valued datasets in an error bounded fashion. We show that this process of compression can be thought of as identifying dominant patterns in underlying data. We derive efficient algorithms for computing dominant patterns, quantify their performance analytically as well as experimentally, and identify applications of these algorithms in problems ranging from clustering to vector quantization.We demonstrate the superior characteristics of our algorithm in terms of (i) scalability to extremely high dimensions; (ii) bounded error; and (iii) hierarchical nature, which enables multiresolution analysis. Detailed experimental results are provided to support these claims.

Download to read the full chapter text

Chapter PDF

The Hadamard decomposition problem

Article Open access 21 May 2024

An Overview of Numerical Acceleration Techniques for Nonlinear Dimension Reduction

Smart Sampling and Optimal Dimensionality Reduction of Big Data Using Compressed Sensing

References

M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, Vol. 37(4):pages 573–595, 1995.
Article MATH MathSciNet Google Scholar
P. Drienas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in Large Graphs and Matrices. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 291–299, 1999.
Google Scholar
D. Gibson, J. Kleingberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB Journal, Vol. 8(3–4):pages 222–236, 2000.
Article Google Scholar
R. M. Gray. Vector Quantization. IEEE ASSP Magazine, Vol. 1(2):pages 4–29, 1984.
Article Google Scholar
S. Guha, R. Rastogi, and K. Shim. ROCK: ARo bust Clustering Algorithm for Categorical Attributes. Information Systems, Vol. 25(5):pages 345–366, 2000.
Article Google Scholar
G. Gupta and J. Ghosh. Value Balanced Agglomerative Connectivity Clustering. In Proceedings of the SPIE conference on Data Mining and Knowledge Discovery III, April 2001.
Google Scholar
E. H. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph-Based Clustering in High-Dimensional Datasets: ASumma ry of Results. Bulletin of the IEEE Technical Committee on Data Engineering, Vol. 21(1):pages 15–22, March 1998.
Google Scholar
T. Hofmann. Probabilistic Latent Semantic Indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, 1999.
Google Scholar
Z. Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
Google Scholar
T. G. Kolda and D. P. O'Leary. AS emidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval. ACM Transactions on Information Systems, Vol. 16(4):pages 322–346, October 1998.
Article MathSciNet Google Scholar
T. G. Kolda and D. P. O'Leary. Computation and Uses of the Semidiscrete Matrix Decomposition. ACM Transactions on Mathematical Software, Vol. 26(3):pages 416–437, September 2000.
Article Google Scholar
D. D. Lee and H. S. Seung. Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature, Vol. 401:pages 788–791, 1999.
Article Google Scholar
J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium, volume 1, pages 281–297, 1967.
MathSciNet Google Scholar
S. McConnell and D. B. Skillicorn. Outlier Detection using Semi-Discrete Decomposition. Technical Report 2001-452, Dept. of Computing and Information Science, Queen’s University, 2001.
Google Scholar
D. P. O'Leary and S. Peleg. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications, Vol. 31(3):pages 441–444, 1983.
Article Google Scholar
M. Ozdal and C. Aykanat. Clustering Based on Data Patterns using Hypergraph Models. Data Mining and Knowledge Discovery, 2001. Submitted for publication.
Google Scholar
S. Zyto, A. Grama, and W. Szpankowski. Semi-Discrete Matrix Transforms (SDD) for Image and Video Compression. Purdue University, 2002. Working manuscript.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Sciences, Purdue University, 47907, W. Lafayette, IN, USA
Mehmet Koyutürk & Ananth Grama
Dept. of Computer Science, Virginia Tech., 24061, Blacksburgh, VA, USA
Naren Ramakrishnan

Authors

Mehmet Koyutürk
View author publications
You can also search for this author in PubMed Google Scholar
Ananth Grama
View author publications
You can also search for this author in PubMed Google Scholar
Naren Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koyutürk, M., Grama, A., Ramakrishnan, N. (2002). Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_26

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_26
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Abstract

Chapter PDF

Similar content being viewed by others

The Hadamard decomposition problem

An Overview of Numerical Acceleration Techniques for Nonlinear Dimension Reduction

Smart Sampling and Optimal Dimensionality Reduction of Big Data Using Compressed Sensing

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Abstract

Chapter PDF

Similar content being viewed by others

The Hadamard decomposition problem

An Overview of Numerical Acceleration Techniques for Nonlinear Dimension Reduction

Smart Sampling and Optimal Dimensionality Reduction of Big Data Using Compressed Sensing

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation