Skip to main content
Log in

Compression and aggregation of Bayesian estimates for data intensive computing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New Jersey

    Book  MATH  Google Scholar 

  2. Barbara D, Wu X (2001) Loglinear-based quasi cubes. J Intell Inf Syst 16: 255–276

    Article  MATH  Google Scholar 

  3. Cadez I, Heckerman D, Smyth P, Meek C, White S (2000) Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research 2000. MSR-TR-00-18

  4. Chao MT (1970) The asymptotic behavior of Bayes’ estimators. Ann Math Stat 41(2): 601–608

    Article  MATH  Google Scholar 

  5. Charig CR, Webb DR, Payne SR, Wickham OE (1986) Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. Br Med J 292: 879–882

    Article  Google Scholar 

  6. Chen B, Chen L, Lin Y, Ramakrishnan R (2005) Prediction cubes. In: Proceedings of the 31st VLDB conference, pp 982–993

  7. Chen Y, Dong G, Han J, Pei J, Wah B, Wang J (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18: 1585–1599

    Article  Google Scholar 

  8. Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams, pp 323–334

  9. Chung KL (2001) A course in probability theory, 3rd edn. Elsevier, San Diego

    Google Scholar 

  10. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39: 1–38

    MathSciNet  MATH  Google Scholar 

  11. Centers for Disease Control and Prevention (2005–2008) Behavioral risk factor surveillance system survey data. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention

  12. Ghosh JK, Ramamoorthi RV (2002) Bayesian nonprametrics. Springer, New Jersey

    Google Scholar 

  13. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Min Knowl Discov 1: 29–54

    Article  Google Scholar 

  14. Han J, Chen Y, Dong G, Pei J, Wah BW, Wang J, Cai Y (2005) Stream cube: an architecture for multi-dimensional analysis of data streams. Distrib Parallel Databases 18(2): 173–197

    Article  Google Scholar 

  15. Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of ACM SIGMOD international conferernce on management of data. pp 205–216

  16. Julious SA, Mullee MA (1994) Confounding and Simpson’s paradox. Br Med J 309: 1480–1481

    Article  Google Scholar 

  17. Khoshgozaran A, Khodaei A, Sharifzadeh M, Shahabi C (2008) A hybrid aggregation and compression technique for road network databases. Knowl Inf Syst 17(3): 265–286

    Article  Google Scholar 

  18. Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New Jersey

    MATH  Google Scholar 

  19. Lenz H, Thalheim B (2001) OLAP databases and aggregation functions. In: Proceedings of the 13th international conference on scientific and statistical database management, pp 91–100

  20. Liu C, Zhang M, Zheng M, Chen Y (2003) Step-by-step regression: a more efficient alternative for polynomial multiple linear regression in stream cube. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining, pp 437–448

  21. Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26:1–30

    Google Scholar 

  22. Palpanas T, Koudas N, Mendelzon AO (2005) Using datacube aggregates for approximate querying and deviation detection. IEEE Trans Knowl Data Eng 17(11): 1465–1477

    Article  Google Scholar 

  23. Pang S, Ozawa S, Kasabov N (2005) Incremental linear discriminant analysis for classification of data streams. IEEE Trans Syst Man Cybern Part B 35(5): 905–914

    Article  Google Scholar 

  24. Ramoni M, Sebastiani P, Cohen P (2002) Bayesian clustering by dynamics. Mach Learn 47(1): 99–121

    Article  Google Scholar 

  25. Rao CR (1973) Linear statistical inference and its applications. Wiley, New York

    Book  MATH  Google Scholar 

  26. Ridgeway G (1997) Finite discrete markov process clustering. Technical report, Microsoft Research. MSR-TR-97-24

  27. Ridgeway G, Altschuler S (1998) Clustering finite discrete markov chains. In: Proceedings of the section on physical and engineering sciences, pp 228–229

  28. Safarinejadian B, Menhaj MB, Karrari M (2010) A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23(3): 267–292

    Article  Google Scholar 

  29. Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: Proceedings of the 27th VLDB conference, pp 531–540

  30. Sebastiani P, Ramonni M, Cohen P, Warwick J, Davis J (1999) Discovering dynamics using Bayesian clustering. In: Advances in intelligent data analysis. Lecture notes in computer science. Springer, pp 395–406

  31. Shiryaev AN (1995) Probability, 2nd edn. Springer, New Jersey

    MATH  Google Scholar 

  32. Tanner MA, Wong WH (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82: 528–540

    Article  MathSciNet  MATH  Google Scholar 

  33. Vassiliadis P (1998) Modeling multidimensional databases, cubes and cube operations. In: Proceedings of the 10th international conference on scientific and statistical database management, pp 53–62

  34. Xi R, Lin N, Chen Y (2009) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 21(4): 479–492

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yixin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xi, R., Lin, N., Chen, Y. et al. Compression and aggregation of Bayesian estimates for data intensive computing. Knowl Inf Syst 33, 191–212 (2012). https://doi.org/10.1007/s10115-011-0459-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0459-4

Keywords

Navigation