ABSTRACT
Data in enterprises resides largely under relational schemas. Annotating such schemas with a knowledge graph (KG) that represents knowledge of the domain is useful for semantic understanding of the data as well as downstream processing by machines and humans. Existing approaches annotate only individual tables using small and simple KGs, and also fail to generalize to unseen KG entities and relationships during test. We propose a probabilistic model that generates complex relational schemas — tables, grouping of tables into neighborhoods, foreign key connections between tables and fields associated with tables — by traversing over paths in a knowledge graph. An efficient two-pass inference algorithm based on inverting this model jointly annotates schema elements such as fields, tables and neighborhoods with entities, and the associations between schema elements with relational paths in the KG. The algorithm also generalizes to unseen paths at test time. We show using experiments on a real-world schema and domain knowledge graph, in addition to benchmark datasets, that the proposed approach significantly out-performs existing approaches while demonstrating better scalability.
- Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. Proc. VLDB Endow. 4, 11 (aug 2011), 695–701. https://doi.org/10.14778/3402707.3402710Google ScholarDigital Library
- David M. Blei, Michael I. Jordan, Thomas L. Griffiths, and Joshua B. Tenenbaum. 2003. Hierarchical Topic Models and the Nested Chinese Restaurant Process. In Proceedings of the 16th International Conference on Neural Information Processing Systems (Whistler, British Columbia, Canada) (NIPS’03). MIT Press, Cambridge, MA, USA, 17–24.Google ScholarDigital Library
- Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 4, 8 pages. https://doi.org/10.1609/aaai.v33i01.330129Google ScholarDigital Library
- Diego De Uña, Nataliia Rümmele, Graeme Gange, Peter Schachte, and Peter J. Stuckey. 2018. Machine Learning and Constraint Programming for Relational-to-Ontology Schema Mapping. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 1277–1283.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google ScholarCross Ref
- Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy, and Pedro Domingos. 2004. IMAP: Discovering Complex Semantic Matches between Database Schemas. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD ’04). Association for Computing Machinery, New York, NY, USA, 383–394. https://doi.org/10.1145/1007568.1007612Google ScholarDigital Library
- Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2020. Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics 8 (2020), 439–453. https://doi.org/10.1162/tacl_a_00325Google ScholarCross Ref
- AnHai Doan, Pedro Domingos, and Alon Y. Halevy. 2001. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, California, USA) (SIGMOD ’01). Association for Computing Machinery, New York, NY, USA, 509–520. https://doi.org/10.1145/375663.375731Google ScholarDigital Library
- Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. arxiv:2305.09645 [cs.CL]Google Scholar
- Saurabh S. Kataria, Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. 2011. Entity Disambiguation with Hierarchical Topic Models. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, California, USA) (KDD ’11). Association for Computing Machinery, New York, NY, USA, 1037–1045. https://doi.org/10.1145/2020408.2020574Google ScholarDigital Library
- Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1–2 (sep 2010), 1338–1347. https://doi.org/10.14778/1920841.1921005Google ScholarDigital Library
- Ye Liu, Semih Yavuz, Rui Meng, Dragomir Radev, Caiming Xiong, and Yingbo Zhou. 2022. Uni-Parser: Unified Semantic Parser for Question Answering on Knowledge Base and Database. arxiv:2211.05165 [cs.CL]Google Scholar
- Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. In Proceedings of the 27th International Conference on Very Large Data Bases(VLDB ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 49–58.Google ScholarDigital Library
- Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang, and Jiawei Han. 2020. Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1908–1917. https://doi.org/10.1145/3394486.3403242Google ScholarDigital Library
- David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of Hierarchical Topics with Pachinko Allocation. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, USA) (ICML ’07). Association for Computing Machinery, New York, NY, USA, 633–640. https://doi.org/10.1145/1273496.1273576Google ScholarDigital Library
- Varish Mulwad, Tim Finin, and Anupam Joshi. 2013. Semantic Message Passing for Generating Linked Data from Tables. In The Semantic Web – ISWC 2013, Harith Alani, Lalana Kagal, Achille Fokoue, Paul Groth, Chris Biemann, Josiane Xavier Parreira, Lora Aroyo, Natasha Noy, Chris Welty, and Krzysztof Janowicz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 363–378.Google ScholarDigital Library
- Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics 3 (2015), 299–313. https://doi.org/10.1162/tacl_a_00140Google ScholarCross Ref
- Minh Pham, Suresh Alse, Craig A. Knoblock, and Pedro Szekely. 2016. Semantic Labeling: A Domain-Independent Approach. In The Semantic Web – ISWC 2016, Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil (Eds.). Springer International Publishing, Cham, 446–462.Google Scholar
- Aniket Pramanick and Indrajit Bhattacharya. 2021. Joint Learning of Representations for Web-tables, Entities and Types using Graph Convolutional Network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1197–1206. https://doi.org/10.18653/v1/2021.eacl-main.102Google ScholarCross Ref
- S.K. Ramnandan, Amol Mittal, Craig A. Knoblock, and Pedro Szekely. 2015. Assigning Semantic Labels to Data Sources. In The Semantic Web. Latest Advances and New Domains, Fabien Gandon, Marta Sabou, Harald Sack, Claudia d’Amato, Philippe Cudré-Mauroux, and Antoine Zimmermann (Eds.). Springer International Publishing, Cham, 403–417.Google Scholar
- Nataliia Rümmele, Yuriy Tyshetskiy, and Alex Collins. 2018. Evaluating Approaches for Supervised Semantic Labeling. In Workshop on Linked Data on the Web co-located with The Web Conference 2018, LDOW@WWW 2018, Lyon, France April 23rd, 2018(CEUR Workshop Proceedings, Vol. 2073), Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, and Jens Lehmann (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-2073/article-04.pdfGoogle Scholar
- Charles Sutton and Andrew McCallum. 2010. An Introduction to Conditional Random Fields. arxiv:1011.4088 [stat.ML]Google Scholar
- Mohsen Taheriyan, Craig A. Knoblock, Pedro Szekely, and José Luis Ambite. 2016. Learning the Semantics of Structured Data Sources. Web Semant. 37, C (mar 2016), 152–169. https://doi.org/10.1016/j.websem.2015.12.003Google ScholarDigital Library
- Mohsen Taheriyan, Craig A. Knoblock, Pedro Szekely, and José Luis Ambite. 2016. Leveraging Linked Data to Discover Semantic Relations Within Data Sources. In The Semantic Web – ISWC 2016, Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil (Eds.). Springer International Publishing, Cham, 549–565.Google ScholarDigital Library
- Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 35, 8 pages. https://doi.org/10.1609/aaai.v33i01.3301281Google ScholarDigital Library
- Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (jun 2011), 528–538. https://doi.org/10.14778/2002938.2002939Google ScholarDigital Library
- Binh Vu, Craig Knoblock, and Jay Pujara. 2019. Learning Semantic Models of Data Sources Using Probabilistic Graphical Models. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1944–1953. https://doi.org/10.1145/3308558.3313711Google ScholarDigital Library
- Martin Wainwright and Michael Jordan. 2003. Graphical Models, Exponential Families and Variational Inference. Technical Report. Dep. of Statistics, Univ. of California, Berkeley.Google Scholar
- Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8413–8426. https://doi.org/10.18653/v1/2020.acl-main.745Google ScholarCross Ref
Index Terms
- Semantic Annotation of Relational Schemas Using a Probabilistic Generative Model
Recommendations
Translating relational schema into XML schema definition with data semantic preservation and XSD graph
Many legacy systems have been created by using relational database operating not for the Internet expression. Since the relational database is not an efficient way for data explosion, electronic transfer of data, and electronic business on the Web, we ...
Mapping DTDs to relational schemas with semantic constraints
XML is becoming a prevalent format and standard for data exchange in many applications. With the increase of XML data, there is an urgent need to research some efficient methods to store and manage XML data. As relational databases are the primary ...
Mapping Relational Schemas to XML DTDs with Constraints
IMSCCS '06: Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences - Volume 2 (IMSCCS'06) - Volume 02XML is becoming a prevalent format and de facto standard for data exchange in many applications. While traditionally, lots of data are stored and managed in relational databases. There is an urgent need to research some efficient methods to convert ...
Comments