Loglinear-Based Quasi Cubes

Barbará, Daniel; Wu, Xintao

doi:10.1023/A:1011224019249

Daniel Barbará¹ &
Xintao Wu¹

47 Accesses
19 Citations
Explore all metrics

Abstract

A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains in each cell an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situations, cubes can require a large amount of storage, so, compressing them is of practical importance. In this paper, we propose an approximation technique that reduces the storage cost of the cube at the price of getting approximate answers for the queries posed against the cube. The idea is to characterize regions of the cube by using statistical models whose description take less space than the data itself. Then, the model parameters can be used to estimate the cube cells with a certain level of accuracy. To increase the accuracy, and to guarantee the level of error in the query answers, some of the “outliers” (i.e., cells that incur in the largest errors when estimated), are retained. The storage taken by the model parameters and the retained cells, of course, should take a fraction of the space of the full cube and the estimation procedure should be faster than computing the data from the underlying relations. We use loglinear models to model the cube regions. Experiments show that the errors introduced in typical queries are small even when the description is substantially smaller than the full cube. Since cubes are used to support data analysis and analysts are rarely interested in the precise values of the aggregates (but rather in trends), providing approximate answers is, in most cases, a satisfactory compromise. Although other techniques have been used for the purpose of compressing data cubes, ours has the advantage of using parametric (loglinear) models and the retaining of outliers, which enables the system to give error guarantees that are data independent, for every query posed on the data cube. The models also offer information about the underlying structure of the data modeled by them. Moreover, these models are relatively easy to update dynamically as data is added to the warehouse.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Acharya, S., Gibbons, P.B., Poosala, V., and Ramaswamy, S. (1999). Join Synopses for Approximate Query Answering. In Proceedings of the 1999 ACM-SIGMOD International Conference on Management of Data, Philadelphia, PA.
Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R., and Sarawagi, S. (1996). On the Computation of Multidimensional Aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, (pp. 506-521).
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA.
Agresti, A. (1996). An Introduction to Categorical Data Analysis, New York: John Wiley.
Google Scholar
Andersen, E.B. (1994). The Statistical Analysis of Categorical Data, New York: Springer Verlag.
Google Scholar
Andersen, E.B. (1997). Introduction to the Statistical Analysis of Categorical Data, New York: Springer Verlag.
Google Scholar
Barbará D. and Sullivan, M. (1997). Quasi-Cubes: A Space-efficient Way to Support Approximate Multidimensional Databases. Technical Report, Department of Information and Software Systems Engineering, George Mason University.
Barbará D. and Sullivan, M. (1997). Quasi-cubes: Exploiting Approximations in multidimensional Databases. SIGMOD Record, 26(3).
Barbará D. and Wu, X. (1999). Using Approximations to Scale Exploratory Data Analysis in Datacubes. In Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA.
Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H.V., Johnson, T., Ng, R., Poosala, V., Ross, K.A., and Sevcik, K.G. (1997). The New Jersey Data Reduction Report, Bulletin of the Technial Committee on Data Engineering, 20(4), 3-45.
Google Scholar
Cherkassky V. and Mulier, F. (1998). Learning from Data, New York: John Wiley and Sons.
Google Scholar
Fayyad, U. (1998). Data MiningTechniques.Tutorial at the 24th International Conference onVery Large Databases, New York, NY.
Fingleton, B. (1984). Models of Category Counts, Cambridge, UK: Cambridge University Press.
Google Scholar
Gray, J., Bosworth, A., Layman, A., and Pirahesh, H. (1996). Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the International Conference on Data Engineering, New Orleans, LA.
Harinarayan, V., Rajaraman, A., and Ullman, J.D. (1996). Implementing Data Cubes Efficiently. In Proceedings of the ACM-SIGMOD Conference, Montreal, Canada.
Hellerstein, J.M., Haas, P.J., and Wang, H.J. (1997). Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ.
Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data, Englewood cliffs, NJ: Prentice Hall.
Google Scholar
MicroStrategy. DSS Server[tm] Features. http://www.microstrategy.com/products/server/features.htm.
Poosala, V., Ioannidis, Y.E., Haas, P.J., and Shekita, E.J. (1996). Improved Histograms for Selectivity Estimation of Range Predicates. In Proceedings of the 1996 ACM-SIGMOD International Conference on Management of Data, Montreal, Canada (pp. 294-305).
Ross, K.A. and Srivastava, D. (1997). Fast Computation of Sparse Datacubes. In Proceedings of the 23rd VLDB Conference, Athens, Greece.
Sarawagi, S., Agrawal, R., and Meggido, N. (1998). Discovery-driven Exploration of OLAP Data Cubes. In Proceedings of the International Conference on Extending Data Base Technology, Valencia, Spain (pp. 168-182).
Shanmugasundaram, J., Fayyad, U., and Bradley, P.S. (1999). Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions. In Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA.
Silverman, B.W. (1994). Density Estimation for Statistics and Data Analysis, London, UK: Chapman and Hall.
Google Scholar
U.S. Census Bureau, Population data. http://www.census.gov/main/www/access.html.
Vitter J.S. and Wang, M. (1999). Approximate Computation of Multidimensional Aggregates of Sparse Data UsingWavelets. In Proceedings of the 1999 ACM-SIGMOD International Conference on Management of Data, Philadelphia, PA.
Zhao, Y., Deshpande, P.M., and Naughton, J.F. (1997). An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, Tucson, AZ (pp. 159-170).

Download references

Author information

Authors and Affiliations

ISE Department, George Mason University, MSN 4A4, Fairfax, VA, 22030, USA
Daniel Barbará & Xintao Wu

Authors

Daniel Barbará
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbará, D., Wu, X. Loglinear-Based Quasi Cubes. Journal of Intelligent Information Systems 16, 255–276 (2001). https://doi.org/10.1023/A:1011224019249

Download citation

Issue Date: August 2001
DOI: https://doi.org/10.1023/A:1011224019249

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Loglinear-Based Quasi Cubes

Abstract

Access this article

Similar content being viewed by others

Efficient Representation of Multidimensional Data over Hierarchical Domains

Robust regression via error tolerance

Honey, I Shrunk the Cube

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Loglinear-Based Quasi Cubes

Abstract

Access this article

Similar content being viewed by others

Efficient Representation of Multidimensional Data over Hierarchical Domains

Robust regression via error tolerance

Honey, I Shrunk the Cube

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation