In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

Li, Kun; Zhou, Xiaofeng; Wang, Daisy Zhe; Grant, Christan; Dobra, Alin; Dudley, Christopher

doi:10.1007/s00778-016-0446-1

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

Regular Paper
Published: 02 November 2016

Volume 26, pages 177–201, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Kun Li¹,
Xiaofeng Zhou ORCID: orcid.org/0000-0002-5908-5710¹,
Daisy Zhe Wang¹,
Christan Grant²,
Alin Dobra¹ &
…
Christopher Dudley¹

1715 Accesses
5 Citations
Explore all metrics

Abstract

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Load-Balancing Parallel Relational Algebra

Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Article Open access 09 May 2019

SemReasoner - A High-Performance Knowledge Graph Store and Rule-Based Reasoner

References

Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: Elmagarmid, A.K., Agrawal, D. (eds.) SIGMOD Conference, pp. 519–530. ACM (2010)
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Boitet, C., Whitelock, P. (eds.) COLING-ACL, pp. 79–85. Morgan Kaufmann Publishers/ACL (1998)
Bain, T., Davidson, L., Dewson, R., Hawkins, C.: User defined functions. In: SQL Server 2000 Stored Procedures Handbook, pp. 178–195. Springer, New York (2003)
Beedkar, K., Del Corro, L., Gemulla, R.: Fully parallel inference in Markov Logic networks. In: BTW, pp. 205–224. Citeseer (2013)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R. Jr, Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, vol. 5, p. 3 (2010)
Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)
MathSciNet Google Scholar
Chafi, H., Sujeeth, A.K., Brown, K.J., Lee, H., Atreya, A.R., Olukotun, K.: A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46(8), 35–46 (2011)
Article Google Scholar
Chechetka, A., Guestrin, C.: Focused belief propagation for query-specific inference. In: International Conference on Artificial Intelligence and Statistics, pp. 89–96 (2010)
Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14, pp. 649–660. ACM, New York, NY, USA (2014)
Chib, S., Greenberg, E.: Understanding the Metropolis–Hastings algorithm. Am. Stat. 49(4), 327–335 (1995)
Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)
Google Scholar
Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 49–60. ACM (2006)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150. USENIX Association (2004)
Dobra, A.: Datapath: high-performance database engine, June (2011)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. CVPR 1, 261–268 (2004)
Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Article MATH Google Scholar
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or mad skills, the sql. CoRR, arXiv:1208.4165 (2012)
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Article Google Scholar
Ihler, A.T., Iii, J., Willsky, A.S.: Loopy belief propagation: convergence and effects of message errors. J. Mach. Learn. Res. 905–936 (2005)
Jiang, S., Lowd, D., Dou, D.: Learning to refine an automatically extracted knowledge base using Markov Logic. In: ICDM, pp. 912–917 (2012)
Kok, S., Singla, P., Richardson, M., Domingos, P., Sumner, M., Poon, H., Lowd, D.: The Alchemy System for Statistical Relational AI. University of Washington, Seattle (2005)
Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)
MATH Google Scholar
Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 529–539, Edinburgh, Scotland, UK., July. Association for Computational Linguistics (2011)
Li, K., Grant, C., Wang, D.Z., Khatri, S., Chitouras, G.: Gptext: Greenplum parallel statistical text analysis framework. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 31–35. ACM (2013)
Li, K., Wang, D.Z., Dobra, A., Dudley, C.: UDA–GIST: An in-database framework to unify data-parallel and state-parallel analytics. In: Proceedings of the VLDB Endowment, vol. 8, no. 5 (2015)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., J.M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, arXiv:1006.4990 (2010)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)
Google Scholar
Mahout, A.: Scalable machine-learning and data-mining library. Available at mahout. apache. org
Meng, J., Chakradhar, S., A.R. Best-effort parallel execution framework for recognition and mining applications. In: IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, pp. 1–12, May (2009)
Mitchell, T., Cohen, W.: Data sets and supplementary files (2010). Online; accessed 5 Mar 2015
Mitzenmacher, M.: The power of two choices in randomized load balancing. IEEE Trans. Parallel Distrib. Syst. 12(10), 1094–1104 (2001)
Article Google Scholar
Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc. (1999)
Niepert, M., Domingos, P.M.: Tractable probabilistic knowledge bases: Wikipedia and beyond. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2014)
Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. Proc. VLDB Endow. 4(6), 373–384 (2011)
Article Google Scholar
Poon, H., Domingos, P.: Sound and efficient inference with probabilistic and deterministic dependencies. AAAI 6, 458–463 (2006)
Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Article Google Scholar
Rozanov, Y.A.: Markov Random Fields. Springer, New York (1982)
Book MATH Google Scholar
Rusu, F., Dobra, A.: Glade: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)
Article Google Scholar
Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1088–1098. Association for Computational Linguistics (2010)
Sen, P., Deshpande, A., Getoor, L.: Prdb: managing and exploiting rich correlations in probabilistic databases. VLDB J. Int. J. Very Large Data Bases 18(5), 1065–1090 (2009)
Article Google Scholar
Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)
Article Google Scholar
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015 (2012)
Singh, S., Subramanya, A., Pereira, F.C.N., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 793–803. The Association for Computer Linguistics (2011)
Smullyan, R.M.: First-Order Logic, vol. 21968. Springer, Berlin (1968)
Book MATH Google Scholar
Sümer, Ö., Acar, U.A., Ihler, A.T., Mettu, R.R.: Adaptive exact inference in graphical models. J. Mach. Learn. Res. 12, 3147–3186 (2011)
MathSciNet MATH Google Scholar
Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Eng. Bull. 37, 41–51 (2014)
Wang, H., Zaniolo, C.: User defined aggregates in object-relational systems. In: Proceedings of 16th International Conference on Data Engineering, 2000, pp. 135–144 (2000)
Wei, W., Erenrich, J., Selman, B.: Towards efficient sampling: exploiting random walk strategies. AAAI 4, 670–676 (2004)
Google Scholar
Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. Proc. VLDB Endow. 3(1–2), 794–804 (2010)
Article Google Scholar
Wick, M.L., McCallum, A.: Query-aware MCMC. In: Advances in Neural Information Processing Systems, pp. 2564–2572 (2011)
Wikipedia. Hierarchical and recursive queries in SQL (2014). Online; accessed 25 Jan 2015
Wikipedia. Barack obama citizenship conspiracy theories (2015). Online; Accessed 25 Jan 2015
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM (2013)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10 (2010)

Download references

Acknowledgments

This work was partially supported by NSF IIS Award No. 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google.

Author information

Authors and Affiliations

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32611, USA
Kun Li, Xiaofeng Zhou, Daisy Zhe Wang, Alin Dobra & Christopher Dudley
School of Computer Science, University of Oklahoma, Norman, OK, 73019, USA
Christan Grant

Authors

Kun Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Daisy Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Christan Grant
View author publications
You can also search for this author in PubMed Google Scholar
Alin Dobra
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Dudley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Zhou.

Additional information

Kun Li and Xiaofeng Zhou both authors contribute equally to this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, K., Zhou, X., Wang, D.Z. et al. In-database batch and query-time inference over probabilistic graphical models using UDA–GIST. The VLDB Journal 26, 177–201 (2017). https://doi.org/10.1007/s00778-016-0446-1

Download citation

Received: 15 December 2015
Revised: 22 September 2016
Accepted: 20 October 2016
Published: 02 November 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s00778-016-0446-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Load-Balancing Parallel Relational Algebra

Enabling efficient process mining on large data sets: realizing an in-database process mining operator

SemReasoner - A High-Performance Knowledge Graph Store and Rule-Based Reasoner

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now