Compression and aggregation of Bayesian estimates for data intensive computing

Xi, Ruibin; Lin, Nan; Chen, Yixin; Kim, Youngjin

doi:10.1007/s10115-011-0459-4

Compression and aggregation of Bayesian estimates for data intensive computing

Regular Paper
Published: 29 November 2011

Volume 33, pages 191–212, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ruibin Xi¹,
Nan Lin²,
Yixin Chen³ &
…
Youngjin Kim⁴

361 Accesses
Explore all metrics

Abstract

Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Statistical Leveraging Methods in Big Data

Parallel inference for big data with the group Bayesian method

Article 25 June 2020

Streaming Methods in Data Analysis

References

Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New Jersey
Book MATH Google Scholar
Barbara D, Wu X (2001) Loglinear-based quasi cubes. J Intell Inf Syst 16: 255–276
Article MATH Google Scholar
Cadez I, Heckerman D, Smyth P, Meek C, White S (2000) Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research 2000. MSR-TR-00-18
Chao MT (1970) The asymptotic behavior of Bayes’ estimators. Ann Math Stat 41(2): 601–608
Article MATH Google Scholar
Charig CR, Webb DR, Payne SR, Wickham OE (1986) Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. Br Med J 292: 879–882
Article Google Scholar
Chen B, Chen L, Lin Y, Ramakrishnan R (2005) Prediction cubes. In: Proceedings of the 31st VLDB conference, pp 982–993
Chen Y, Dong G, Han J, Pei J, Wah B, Wang J (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18: 1585–1599
Article Google Scholar
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams, pp 323–334
Chung KL (2001) A course in probability theory, 3rd edn. Elsevier, San Diego
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39: 1–38
MathSciNet MATH Google Scholar
Centers for Disease Control and Prevention (2005–2008) Behavioral risk factor surveillance system survey data. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention
Ghosh JK, Ramamoorthi RV (2002) Bayesian nonprametrics. Springer, New Jersey
Google Scholar
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Min Knowl Discov 1: 29–54
Article Google Scholar
Han J, Chen Y, Dong G, Pei J, Wah BW, Wang J, Cai Y (2005) Stream cube: an architecture for multi-dimensional analysis of data streams. Distrib Parallel Databases 18(2): 173–197
Article Google Scholar
Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of ACM SIGMOD international conferernce on management of data. pp 205–216
Julious SA, Mullee MA (1994) Confounding and Simpson’s paradox. Br Med J 309: 1480–1481
Article Google Scholar
Khoshgozaran A, Khodaei A, Sharifzadeh M, Shahabi C (2008) A hybrid aggregation and compression technique for road network databases. Knowl Inf Syst 17(3): 265–286
Article Google Scholar
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New Jersey
MATH Google Scholar
Lenz H, Thalheim B (2001) OLAP databases and aggregation functions. In: Proceedings of the 13th international conference on scientific and statistical database management, pp 91–100
Liu C, Zhang M, Zheng M, Chen Y (2003) Step-by-step regression: a more efficient alternative for polynomial multiple linear regression in stream cube. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining, pp 437–448
Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26:1–30
Google Scholar
Palpanas T, Koudas N, Mendelzon AO (2005) Using datacube aggregates for approximate querying and deviation detection. IEEE Trans Knowl Data Eng 17(11): 1465–1477
Article Google Scholar
Pang S, Ozawa S, Kasabov N (2005) Incremental linear discriminant analysis for classification of data streams. IEEE Trans Syst Man Cybern Part B 35(5): 905–914
Article Google Scholar
Ramoni M, Sebastiani P, Cohen P (2002) Bayesian clustering by dynamics. Mach Learn 47(1): 99–121
Article Google Scholar
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Book MATH Google Scholar
Ridgeway G (1997) Finite discrete markov process clustering. Technical report, Microsoft Research. MSR-TR-97-24
Ridgeway G, Altschuler S (1998) Clustering finite discrete markov chains. In: Proceedings of the section on physical and engineering sciences, pp 228–229
Safarinejadian B, Menhaj MB, Karrari M (2010) A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23(3): 267–292
Article Google Scholar
Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: Proceedings of the 27th VLDB conference, pp 531–540
Sebastiani P, Ramonni M, Cohen P, Warwick J, Davis J (1999) Discovering dynamics using Bayesian clustering. In: Advances in intelligent data analysis. Lecture notes in computer science. Springer, pp 395–406
Shiryaev AN (1995) Probability, 2nd edn. Springer, New Jersey
MATH Google Scholar
Tanner MA, Wong WH (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82: 528–540
Article MathSciNet MATH Google Scholar
Vassiliadis P (1998) Modeling multidimensional databases, cubes and cube operations. In: Proceedings of the 10th international conference on scientific and statistical database management, pp 53–62
Xi R, Lin N, Chen Y (2009) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 21(4): 479–492
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Ruibin Xi
Department of Mathematics, Washington University, St. Louis, MO, USA
Nan Lin
Department of Computer Science, Washington University, St. Louis, MO, USA
Yixin Chen
Google Inc., Mountain View, CA, USA
Youngjin Kim

Authors

Ruibin Xi
View author publications
You can also search for this author inPubMed Google Scholar
Nan Lin
View author publications
You can also search for this author inPubMed Google Scholar
Yixin Chen
View author publications
You can also search for this author inPubMed Google Scholar
Youngjin Kim
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yixin Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xi, R., Lin, N., Chen, Y. et al. Compression and aggregation of Bayesian estimates for data intensive computing. Knowl Inf Syst 33, 191–212 (2012). https://doi.org/10.1007/s10115-011-0459-4

Download citation

Received: 06 April 2010
Revised: 03 October 2011
Accepted: 15 November 2011
Published: 29 November 2011
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10115-011-0459-4

Keywords

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Compression and aggregation of Bayesian estimates for data intensive computing

Abstract

Access this article

Similar content being viewed by others

Statistical Leveraging Methods in Big Data

Parallel inference for big data with the group Bayesian method

Streaming Methods in Data Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords