Abstract
Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.
Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New Jersey
Barbara D, Wu X (2001) Loglinear-based quasi cubes. J Intell Inf Syst 16: 255–276
Cadez I, Heckerman D, Smyth P, Meek C, White S (2000) Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research 2000. MSR-TR-00-18
Chao MT (1970) The asymptotic behavior of Bayes’ estimators. Ann Math Stat 41(2): 601–608
Charig CR, Webb DR, Payne SR, Wickham OE (1986) Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. Br Med J 292: 879–882
Chen B, Chen L, Lin Y, Ramakrishnan R (2005) Prediction cubes. In: Proceedings of the 31st VLDB conference, pp 982–993
Chen Y, Dong G, Han J, Pei J, Wah B, Wang J (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18: 1585–1599
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams, pp 323–334
Chung KL (2001) A course in probability theory, 3rd edn. Elsevier, San Diego
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39: 1–38
Centers for Disease Control and Prevention (2005–2008) Behavioral risk factor surveillance system survey data. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention
Ghosh JK, Ramamoorthi RV (2002) Bayesian nonprametrics. Springer, New Jersey
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Min Knowl Discov 1: 29–54
Han J, Chen Y, Dong G, Pei J, Wah BW, Wang J, Cai Y (2005) Stream cube: an architecture for multi-dimensional analysis of data streams. Distrib Parallel Databases 18(2): 173–197
Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of ACM SIGMOD international conferernce on management of data. pp 205–216
Julious SA, Mullee MA (1994) Confounding and Simpson’s paradox. Br Med J 309: 1480–1481
Khoshgozaran A, Khodaei A, Sharifzadeh M, Shahabi C (2008) A hybrid aggregation and compression technique for road network databases. Knowl Inf Syst 17(3): 265–286
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New Jersey
Lenz H, Thalheim B (2001) OLAP databases and aggregation functions. In: Proceedings of the 13th international conference on scientific and statistical database management, pp 91–100
Liu C, Zhang M, Zheng M, Chen Y (2003) Step-by-step regression: a more efficient alternative for polynomial multiple linear regression in stream cube. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining, pp 437–448
Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26:1–30
Palpanas T, Koudas N, Mendelzon AO (2005) Using datacube aggregates for approximate querying and deviation detection. IEEE Trans Knowl Data Eng 17(11): 1465–1477
Pang S, Ozawa S, Kasabov N (2005) Incremental linear discriminant analysis for classification of data streams. IEEE Trans Syst Man Cybern Part B 35(5): 905–914
Ramoni M, Sebastiani P, Cohen P (2002) Bayesian clustering by dynamics. Mach Learn 47(1): 99–121
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Ridgeway G (1997) Finite discrete markov process clustering. Technical report, Microsoft Research. MSR-TR-97-24
Ridgeway G, Altschuler S (1998) Clustering finite discrete markov chains. In: Proceedings of the section on physical and engineering sciences, pp 228–229
Safarinejadian B, Menhaj MB, Karrari M (2010) A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23(3): 267–292
Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: Proceedings of the 27th VLDB conference, pp 531–540
Sebastiani P, Ramonni M, Cohen P, Warwick J, Davis J (1999) Discovering dynamics using Bayesian clustering. In: Advances in intelligent data analysis. Lecture notes in computer science. Springer, pp 395–406
Shiryaev AN (1995) Probability, 2nd edn. Springer, New Jersey
Tanner MA, Wong WH (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82: 528–540
Vassiliadis P (1998) Modeling multidimensional databases, cubes and cube operations. In: Proceedings of the 10th international conference on scientific and statistical database management, pp 53–62
Xi R, Lin N, Chen Y (2009) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 21(4): 479–492
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xi, R., Lin, N., Chen, Y. et al. Compression and aggregation of Bayesian estimates for data intensive computing. Knowl Inf Syst 33, 191–212 (2012). https://doi.org/10.1007/s10115-011-0459-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0459-4