Abstract
Prominent sources of Big Data include technological and social trends, such as mobile computing, blogging, and social networking. The means to analyse such data are becoming more accessible with the development of business models like cloud computing, open-source and crowd sourcing. But that data have characteristics that pose challenges to traditional database systems. Due to the uncontrolled nature by which data is produced, much of it is free text, often in informal natural language, leading to computing environments with high levels of uncertainty and error. In this talk I will offer a vision of a database system that aims to facilitate the development of modern data-centric applications, by naturally unifying key functionalities of databases, text analytics, machine learning and artificial intelligence. I will also describe my past research towards pursuing the vision by extensions of Datalog — a well studied rule-based programming paradigm that features an inherent integration with the database, and has a robust declarative semantics. These extensions allow for incorporating information extraction from text, and for specifying statistical models by probabilistic programming.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abiteboul, S., Deutch, D., Vianu, V.: Deduction with contradictions in Datalog. In: ICDT, pp. 143–154 (2014)
Aone, C., Ramos-Santacruz, M.: Rees: a large-scale relation and event extraction system. In: ANLP, pp. 76–83 (2000)
Appelt, D.E., Onyshkevych, B.: The common pattern specification language. In: Proceedings of the TIPSTER Text Program: Phase III, pp. 23–30, Baltimore, Maryland, USA (1998)
Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, pp. 68–79 (1999)
Baral, C., Gelfond, M., Rushton, N.: Probabilistic reasoning with answer sets. Theory Pract. Log. Program. 9(1), 57–144 (2009)
Barany, V., Cate, B.T., Kimelfeld, B., Olteanu, D., Vagena, Z.: Declarative statistical modeling with datalog (2014). arXiv preprint arXiv:1412.2221
Barceló, P., Figueira, D., Libkin, L.: Graph logics with rational relations and the generalized intersection problem. In: LICS, pp. 115–124 (2012)
Barceló, P., Libkin, L., Lin, A.W., Wood, P.T.: Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst. 37(4), 31 (2012)
Bikel, D.M., Miller, S., Schwartz, R.M., Weischedel, R.M.: Nymble: a high-performance learning name-finder. In: ANLP, pp. 194–201 (1997)
Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD Conference, pp. 175–186. ACM (2001)
Bröcheler, M., Mihalkova, L., Getoor, L.: Probabilistic similarity logic. In: UAI, pp. 73–82. AUAI Press (2010)
Calì, A., Gottlob, G., Lukasiewicz, T., Marnette, B., Pieris, A.: Datalog+/-: a family of logical knowledge representation and query languages for new applications. In: LICS, pp. 228–242 (2010)
Chakravarthy, S., Venkatachalam, A., Telang, A., Aery, M.: Infosift: a novel, mining-based framework for document classification. IJNGC 5(2) (2014)
Chen, F., Feng, X., Re, C., Wang, M.: Optimizing statistical information extraction programs over evolving text. In: ICDE, pp. 870–881. IEEE Computer Society (2012)
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: an algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010)
Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! Long live rule-based information extraction systems! In: EMNLP, pp. 827–832. ACL (2013)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256. Morgan Kaufmann (2001)
Cohen, S., Kimelfeld, B., Sagiv, Y.: Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci. 74(7), 1147–1159 (2008)
Cunningham, H.: GATE: a general architecture for text engineering. Comput. Humanit. 36(2), 223–254 (2002)
Dylla, M., Miliaraki, I., Theobald, M.: A temporal-probabilistic database model for information extraction. PVLDB 6(14), 1810–1821 (2013)
Fagin, R., Kimelfeld, B., Kolaitis, P.G.: Dichotomies in the complexity of preferred repairs. In: PODS 2015 (2015) (To appear)
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Cleaning inconsistencies in information extraction via prioritized repairs. In: PODS. ACM (2014)
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: a formal approach to information extraction. J. ACM (JACM) 62(2), 12 (2015)
Ginsburg, S., Wang, X.S.: Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56(1), 1–26 (1998)
Grant, C.E., Gumbs, J., Li, K., Wang, D.Z., Chitouras, G.: Madden: query-driven statistical text analytics. In: CIKM, pp. 2740–2742. ACM (2012)
Green, T.J., Aref, M., Karvounarakis, G.: LogicBlox, platform and language: a tutorial. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 1–8. Springer, Heidelberg (2012)
Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108. ACL (2014)
Huan, J., Wang, W., Prins, J., Yang, J.: SPIN: mining maximal frequent subgraphs from graph databases. In: KDD, pp. 581–586 (2004)
Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J.: Enterprise data analysis and visualization: an interview study. IEEE Trans. Vis. Comput. Graph. 18(12), 2917–2926 (2012)
Kimelfeld, B.: Database principles in information extraction. In: PODS, pp. 156–163. ACM (2014)
Kimelfeld, B., Kolaitis, P.G.: The complexity of mining maximal frequent subgraphs. ACM Trans. Database Syst. 39(4), 32:1–32:33 (2014)
Kimmig, A., Demoen, B., De Raedt, L., Santos Costa, V., Rocha, R.: On the implementation of the probabilistic logic programming language ProbLog. Theory Pract. Logic Program. 11, 235–262 (2011)
Klein, D., Manning, C.D.: Conditional structure versus conditional estimation in NLP models. In: EMNLP, pp. 9–16. Association for Computational Linguistics (2002)
Kok, S., Domingos, P.M.: Learning markov logic networks using structural motifs. In: ICML, pp. 551–558. Omnipress (2010)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Leek, T.R.: Information extraction using hidden Markov models. Master’s thesis, UC San Diego (1997)
Ling, X., Weld, D.S.: Temporal information extraction. In AAAI. AAAI Press (2010)
Liu, B., Chiticariu, L., Chu, V., Jagadish, H.V., Reiss, F.: Automatic rule refinement for information extraction. PVLDB 3(1), 588–597 (2010)
Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy Markov models for information extraction and segmentation. In: ICML, pp. 591–598 (2000)
Mihalkova, L., Mooney, R.J.: Bottom-up learning of Markov logic network structure. In: ICML, pp. 625–632. ACM (2007)
Milch, B., et al: BLOG: probabilistic models with unknown objects. In: IJCAI, pp. 1352–1359 (2005)
Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)
Niu, F., Zhang, C., Re, C., Shavlik, J.W.: DeepDive: Web-scale knowledge-base construction using statistical learning and inference. In: Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, CEUR Workshop Proceedings, vol. 884, pp. 25–28 (2012). http://CEUR-WS.org
Pons-Porrata, A., Llavori, R.B., Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques. Inf. Process. Manage. 43(3), 752–768 (2007)
Poole, D.: The independent choice logic and beyond. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S.H. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 222–243. Springer, Heidelberg (2008)
Poon, H., Domingos, P.: Joint inference in information extraction. In: Proceedings of the 22nd national conference on Artificial intelligence, AAAI 2007, pp. 913–918. AAAI Press (2007)
Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., Manning, C.D.: A multi-pass sieve for coreference resolution. In: EMNLP, pp. 492–501. ACL (2010)
Reiss, F., Raghavan, S., Krishnamurthy, R., Zhu, H., Vaithyanathan, S.: An algebraic approach to rule-based information extraction. In: ICDE, pp. 933–942 (2008)
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Rink, B., Bejan, C.A., Harabagiu, S.M.: Learning textual graph patterns to detect causal event relations. In: Proceedings of the Twenty-Third International Florida Artificial Intelligence Research Society Conference. AAAI Press (2010)
Sato, T., Kameya, Y.: PRISM: a language for symbolic-statistical modeling. In: IJCAI, pp. 1330–1339 (1997)
Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: VLDB, pp. 1033–1044 (2007)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)
Staworko, S., Chomicki, J., Marcinkowski, J.: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2–3), 209–246 (2012)
Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2011)
Thomas, L.T., Valluri, S.R., Karlapalem, K.: Margin: Maximal frequent subgraph mining. TKDD 4(3) (2010)
Wang, D.Z., Franklin, M.J., Garofalakis, M.N., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD Conference, pp. 517–528. ACM (2011)
Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083–1106 (2003)
Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B., Li, Y.: Adaptive parser-centric text normalization. In: ACL, vol. 1, pp. 1159–1168. The Association for Computer Linguistics (2013)
Zhang, C., Kumar, A., Ré, C.: Materialization optimizations for feature selection workloads. In: SIGMOD Conference, pp. 265–276 (2014)
Zhao, S., Grishman, R.: Extracting relations with integrated information using kernel methods. In ACL. The Association for Computer Linguistics (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kimelfeld, B. (2015). Extending Datalog Intelligence. In: ten Cate, B., Mileo, A. (eds) Web Reasoning and Rule Systems. RR 2015. Lecture Notes in Computer Science(), vol 9209. Springer, Cham. https://doi.org/10.1007/978-3-319-22002-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-22002-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22001-7
Online ISBN: 978-3-319-22002-4
eBook Packages: Computer ScienceComputer Science (R0)