Skip to main content
Log in

Querying Compressed Data in Data Warehouses

  • Published:
Information Technology and Management Aims and scope Submit manuscript

Abstract

The large size of most data warehouses (typically hundreds of gigabytes to terabytes) results in non-trivial storage costs and makes compression techniques attractive. For the most part, page-level compression (as opposed to attribute or record level schemes) has been shown to achieve the greatest reductions in storage size for databases. A key issue with such schemes is how to quickly access the data to answer queries, since individual tuple boundaries are lost. In this paper we introduce an approach that aims to maintain the benefits of page-level compression (i.e., large reductions in storage size), while at the same time improving query performance through an efficient signature file indexing scheme. The approach uses an attribute-level signature generation method that exploits the value distribution of each attribute in a data warehouse. We provide an extensive theoretical analysis of this approach in which we compare our approach with a recently proposed indexing technique, encoded bitmapped indexing, along a number of important metrics including query processing, insertion, and storage costs. Results show that our approach is preferred in many situations that are likely to occur in practice. We have also implemented a prototype system which validates our analytical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Chauduri and U. Dayal, An overview of data warehousing and OLAP technology, SIGMOD Record 26(1) (March 1997) 65-74.

    Google Scholar 

  2. C. Faloutsos, Signature files, Information Retrieval Data Structures and Algorithms (1992).

  3. C. Faloutsos and S. Christodoulakis, Signature files: An access method for documents and its analytical performance evaluation, IEEE Transactions on Office Information Systems 2(4) (October 1984) 267-288.

    Google Scholar 

  4. C. Faloutsos and S. Christodoulakis, Design of a signature file method that accounts for non uniform occurrence and query frequencies, in: Proc. 11th VLDB Conference, Stockholm, Sweden (1985) pp. 165-170.

  5. C. Faloutsos and S. Christodoulakis, Optimal signature extraction and information loss, IEEE Transactions on Database Systems 12(3) (September 1987) 44-65.

    Google Scholar 

  6. J. Goldstein, Personal communication (October 1999).

  7. J. Goldstein, R. Ramakrishnan and U. Shaft, Compressing relations and indexes, Technical Report 1366, C.S. Department, University of Wisconsin-Madison, December 1997.

  8. J. Goldstein, R. Ramakrishnan and U. Shaft, Compressing relations and indexes, in: Proc. 14th ICDE, Orlando, Florida, February 1998.

  9. J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd edn. (Morgan Kaufmann, 1996).

  10. D.A. Huffman, A method for constructing minimum redundancy codes, in: Proc. Inst. Elec. Radio Eng. (1952).

  11. Y.E. Ioannidis and V. Poosala, Balancing histogram optimality and practicality for query result size estimation, in: Proc. ACM SIGMOD International Conference on Management of Data, San Jose, CA, May 1995, pp. 233-244.

  12. B.R. Iyer and D. White, Data compression support in databases, in: Proc. 20th VLDB Conference (1994) pp. 695-704.

  13. C. Jermaine, A. Datta and E. Omiecinski, A novel index supporting high volume data warehouse insertion, in: Proc. 25th VLDB Conference, Edinburgh, Scotland (1999).

  14. W.-C. Lee and D.L. Lee, Signature file methods for indexing object-oriented databases, in: Proc. of International Computer Science Conference, IEEE, Hong Kong, December 1992, pp. 616-622.

  15. Z. Lin and C. Faloutsos, Frame-sliced signature files, IEEE Transactions on Knowledge and Data Engineering 4(3) (June 1992) 281-289.

    Google Scholar 

  16. J. Liv and A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory 23(3) (May 1977) 337-343.

    Google Scholar 

  17. U. Manber and S. Wu, Glimpse: A tool to search through entire file systems, in: Proc. of USENIX Winter 1994 Conference (1994) pp. 23-32.

  18. W.K. Ng and C.V. Ravishankar, Relational database compression using augmented vector quantization, in: Proc. 11th ICDE (1995) pp. 540-549.

  19. P. O’Neil, Model 204 architecture and performance, in: 2nd International Workshop on High Performance Transaction Systems (HPTS), Lecture Notes in Computer Science, Vol. 359 (Springer, Asilomar, CA, 1987) pp. 40-59.

    Google Scholar 

  20. P. O’Neil and D. Quass, Improved query performance with variant indexes, in: Proc. ACM SIGMOD International Conference on Management of Data, Tucson, AZ, May 13-15, 1997, pp. 38-49.

  21. Oracle Corp., Star queries in Oracle8, White Paper, June 1997.

  22. G. Orosz and L. Takacz, Some probability problems concerning the marking of codes in-tothe superimposition field, Journal of Documentation 12(4) (December 1956) 231-234.

    Google Scholar 

  23. V. Poosala, Y.E. Ioannidis, P.J. Haas and E.J. Shekita, Improved histograms for selectivity estimation of range predicates, in: Proc. ACM SIGMOD International Conference on Management of Data, Montreal, June 1996, pp. 294-305.

  24. G. Ray, J.R. Haritsa and S. Seshadri, Database compression, a performance enhancement tool, in: Proc. COMAD (1995).

  25. D. Simpson, Corral your storage management costs, Datamation (April 1997) pp. 88-93.

  26. S. Stiassny, Mathematical analysis of various superimposed coding methods, American Documentation 11(2) (February 1960) 155-169.

    Google Scholar 

  27. Sybase, Inc. Sybase IQ - optimizing interactive performance for the data warehouse, White Paper, 1997.

  28. P. Tiberio and P. Zezula, Selecting signature files for specific applications, in: Proc. of Advanced Computer Technology, Reliable Systems and Applications, IEEE, Bologna, Italy, May 1991, pp. 718-725.

    Google Scholar 

  29. T.A. Welch, A technique for high performance data compression, IEEE Computer 17(6) (June 1984).

  30. I. Witten, R. Neal and J. Cleary, Arithmetic coding for data compression, Communications of the ACM (1987).

  31. M.-C. Wu, Query optimization for selections using bitmaps, in: Proc. ACM SIGMOD International Conference on Management of Data, Philadelphis, PA, June 1999.

  32. M.-C. Wu and A. Buchmann, Encoded bitmap indexing for data warehouses, in: Proc. 14th ICDE, Orlando, Florida, February 1998, pp. 220-230.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Datta, A., Thomas, H. Querying Compressed Data in Data Warehouses. Information Technology and Management 3, 353–386 (2002). https://doi.org/10.1023/A:1019772807859

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019772807859

Navigation