Using Loglinear Models to Compress Datacubes

Barbará, Daniel; Wu, Xintao

doi:10.1007/3-540-45151-X_30

Using Loglinear Models to Compress Datacubes

Daniel Barbará⁶ &
Xintao Wu⁶

Conference paper
First Online: 01 January 2002

366 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Abstract

A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains in each cell an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situations, cubes can require a large amount of storage, so, compressing them is of practical importance. In this paper, we propose an approximation technique that reduces the storage cost of the cube at the price of getting approximate answers for the queries posed against the cube. The idea is to characterize regions of the cube by using statistical models whose description take less space than the data itself. Then, the model parameters can be used to estimate the cube cells with a certain level of accuracy. To increase the accuracy, and to guarantee the level of error in the query answers, some of the “outliers” (i.e., cells that incur in the largest errors when estimated), are retained. The storage taken by the model parameters and the retained cells, of course, should take a fraction of the space of the full cube and the estimation procedure should be faster than computing the data from the underlying relations. We use loglinear models to model the cube regions. Experiments show that the errors introduced in typical queries are small even when the description is substantially smaller than the full cube. The models also offer information about the underlying structure of the data modeled by them. Moreover, these models are relatively easy to update dynamically as data is added to the warehouse.

This work has been supported by NSF grant IIS-9732113

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. In Proceedings of the 1999 ACM-SIGMOD International Conference on Management of Data, Philadelphia, PA, June 1999.
Google Scholar
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, pages 506–521, September 1996.
Google Scholar
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, June 1998.
Google Scholar
A. Agresti. An Introduction to Categorical Data Analysis. John Wiley, New York, 1996.
MATH Google Scholar
E. B. Andersen. Introduction to the Statistical Analysis of Categorical Data. Springer Verlag, New York, 1997.
MATH Google Scholar
D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. G. Sevcik. The New Jersey Data Reduction Report. Bulletin of the Technial Committee on Data Engineering, 20(4):3–45, December 1997.
Google Scholar
D. Barbará and M. Sullivan. Quasi-Cubes: A space-efficient way to support approximate multidimensional databases. Technical Report, Department of Information and Software Systems Engineering, George Mason University, 1997.
Google Scholar
D. Barbará and M. Sullivan. Quasi-cubes: Exploiting approximations in multidimensional databases. SIGMOD Record, 26(3), September 1997.
Google Scholar
D. Barbará and X. Wu. Using loglinear models to compress datacubes. Technical Report, Department of Information and Software Systems Engineering, George Mason University, 1999.
Google Scholar
U. S. Census Bureau. Population data. http://www.census.gov/main/www/access.html.
B. Fingleton. Models of Category Counts. Cambridge University Press, 1984.
Google Scholar
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the International Conference on Data Engineering, New Orleans, 1996.
Google Scholar
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing Data Cubes Efficiently. In Proceedings of the ACM-SIGMOD Conference, Montreal, Canada, 1996.
Google Scholar
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
Google Scholar
MicroStrategy. DSS Server[tm] Features. http://www.microstrategy.com/products/ server/features.htm.
K. A. Ross and D. Srivastava. Fast Computation of Sparse Datacubes. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
Google Scholar
S. Sarawagi, R. Agrawal, and N. Meggido. Discovery-driven Exploration of OLAP Data Cubes. In Proceedings of the International Conference on Extending Data Base Technology, pages 168–182, 1998.
Google Scholar
J. Shanmugasundaram, U. Fayyad, and P. S. Bradley. Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999.
Google Scholar
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, UK, 1994.
Google Scholar
J. S. Vitter and M. Wang. Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In Proceedings of the 1999 ACM-SIGMOD International Conference on Management of Data, Philadelphia, PA, June 1999.
Google Scholar
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, Tucson, Arizona, pages 159–170, May 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

ISE Dept. MSN 4A4, George Mason University, Fairfax, VA, 22030, USA
Daniel Barbará & Xintao Wu

Authors

Daniel Barbará
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu
Department of Computer Science, Fudan University, 220 Handan Road, Shanghai, China
Aoying Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barbará, D., Wu, X. (2000). Using Loglinear Models to Compress Datacubes. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_30

Download citation

DOI: https://doi.org/10.1007/3-540-45151-X_30
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics