skip to main content
10.1145/3632410.3632414acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
research-article

Semantic Annotation of Relational Schemas Using a Probabilistic Generative Model

Published:04 January 2024Publication History

ABSTRACT

Data in enterprises resides largely under relational schemas. Annotating such schemas with a knowledge graph (KG) that represents knowledge of the domain is useful for semantic understanding of the data as well as downstream processing by machines and humans. Existing approaches annotate only individual tables using small and simple KGs, and also fail to generalize to unseen KG entities and relationships during test. We propose a probabilistic model that generates complex relational schemas — tables, grouping of tables into neighborhoods, foreign key connections between tables and fields associated with tables — by traversing over paths in a knowledge graph. An efficient two-pass inference algorithm based on inverting this model jointly annotates schema elements such as fields, tables and neighborhoods with entities, and the associations between schema elements with relational paths in the KG. The algorithm also generalizes to unseen paths at test time. We show using experiments on a real-world schema and domain knowledge graph, in addition to benchmark datasets, that the proposed approach significantly out-performs existing approaches while demonstrating better scalability.

References

  1. Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. Proc. VLDB Endow. 4, 11 (aug 2011), 695–701. https://doi.org/10.14778/3402707.3402710Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M. Blei, Michael I. Jordan, Thomas L. Griffiths, and Joshua B. Tenenbaum. 2003. Hierarchical Topic Models and the Nested Chinese Restaurant Process. In Proceedings of the 16th International Conference on Neural Information Processing Systems (Whistler, British Columbia, Canada) (NIPS’03). MIT Press, Cambridge, MA, USA, 17–24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. 2019. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 4, 8 pages. https://doi.org/10.1609/aaai.v33i01.330129Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Diego De Uña, Nataliia Rümmele, Graeme Gange, Peter Schachte, and Peter J. Stuckey. 2018. Machine Learning and Constraint Programming for Relational-to-Ontology Schema Mapping. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 1277–1283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google ScholarGoogle ScholarCross RefCross Ref
  6. Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy, and Pedro Domingos. 2004. IMAP: Discovering Complex Semantic Matches between Database Schemas. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD ’04). Association for Computing Machinery, New York, NY, USA, 383–394. https://doi.org/10.1145/1007568.1007612Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2020. Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics 8 (2020), 439–453. https://doi.org/10.1162/tacl_a_00325Google ScholarGoogle ScholarCross RefCross Ref
  8. AnHai Doan, Pedro Domingos, and Alon Y. Halevy. 2001. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, California, USA) (SIGMOD ’01). Association for Computing Machinery, New York, NY, USA, 509–520. https://doi.org/10.1145/375663.375731Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. arxiv:2305.09645 [cs.CL]Google ScholarGoogle Scholar
  10. Saurabh S. Kataria, Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. 2011. Entity Disambiguation with Hierarchical Topic Models. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, California, USA) (KDD ’11). Association for Computing Machinery, New York, NY, USA, 1037–1045. https://doi.org/10.1145/2020408.2020574Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1–2 (sep 2010), 1338–1347. https://doi.org/10.14778/1920841.1921005Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ye Liu, Semih Yavuz, Rui Meng, Dragomir Radev, Caiming Xiong, and Yingbo Zhou. 2022. Uni-Parser: Unified Semantic Parser for Question Answering on Knowledge Base and Database. arxiv:2211.05165 [cs.CL]Google ScholarGoogle Scholar
  13. Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. In Proceedings of the 27th International Conference on Very Large Data Bases(VLDB ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 49–58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang, and Jiawei Han. 2020. Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1908–1917. https://doi.org/10.1145/3394486.3403242Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of Hierarchical Topics with Pachinko Allocation. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, USA) (ICML ’07). Association for Computing Machinery, New York, NY, USA, 633–640. https://doi.org/10.1145/1273496.1273576Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Varish Mulwad, Tim Finin, and Anupam Joshi. 2013. Semantic Message Passing for Generating Linked Data from Tables. In The Semantic Web – ISWC 2013, Harith Alani, Lalana Kagal, Achille Fokoue, Paul Groth, Chris Biemann, Josiane Xavier Parreira, Lora Aroyo, Natasha Noy, Chris Welty, and Krzysztof Janowicz (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 363–378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics 3 (2015), 299–313. https://doi.org/10.1162/tacl_a_00140Google ScholarGoogle ScholarCross RefCross Ref
  18. Minh Pham, Suresh Alse, Craig A. Knoblock, and Pedro Szekely. 2016. Semantic Labeling: A Domain-Independent Approach. In The Semantic Web – ISWC 2016, Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil (Eds.). Springer International Publishing, Cham, 446–462.Google ScholarGoogle Scholar
  19. Aniket Pramanick and Indrajit Bhattacharya. 2021. Joint Learning of Representations for Web-tables, Entities and Types using Graph Convolutional Network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1197–1206. https://doi.org/10.18653/v1/2021.eacl-main.102Google ScholarGoogle ScholarCross RefCross Ref
  20. S.K. Ramnandan, Amol Mittal, Craig A. Knoblock, and Pedro Szekely. 2015. Assigning Semantic Labels to Data Sources. In The Semantic Web. Latest Advances and New Domains, Fabien Gandon, Marta Sabou, Harald Sack, Claudia d’Amato, Philippe Cudré-Mauroux, and Antoine Zimmermann (Eds.). Springer International Publishing, Cham, 403–417.Google ScholarGoogle Scholar
  21. Nataliia Rümmele, Yuriy Tyshetskiy, and Alex Collins. 2018. Evaluating Approaches for Supervised Semantic Labeling. In Workshop on Linked Data on the Web co-located with The Web Conference 2018, LDOW@WWW 2018, Lyon, France April 23rd, 2018(CEUR Workshop Proceedings, Vol. 2073), Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, and Jens Lehmann (Eds.). CEUR-WS.org. https://ceur-ws.org/Vol-2073/article-04.pdfGoogle ScholarGoogle Scholar
  22. Charles Sutton and Andrew McCallum. 2010. An Introduction to Conditional Random Fields. arxiv:1011.4088 [stat.ML]Google ScholarGoogle Scholar
  23. Mohsen Taheriyan, Craig A. Knoblock, Pedro Szekely, and José Luis Ambite. 2016. Learning the Semantics of Structured Data Sources. Web Semant. 37, C (mar 2016), 152–169. https://doi.org/10.1016/j.websem.2015.12.003Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mohsen Taheriyan, Craig A. Knoblock, Pedro Szekely, and José Luis Ambite. 2016. Leveraging Linked Data to Discover Semantic Relations Within Data Sources. In The Semantic Web – ISWC 2016, Paul Groth, Elena Simperl, Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, and Yolanda Gil (Eds.). Springer International Publishing, Cham, 549–565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 35, 8 pages. https://doi.org/10.1609/aaai.v33i01.3301281Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (jun 2011), 528–538. https://doi.org/10.14778/2002938.2002939Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Binh Vu, Craig Knoblock, and Jay Pujara. 2019. Learning Semantic Models of Data Sources Using Probabilistic Graphical Models. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 1944–1953. https://doi.org/10.1145/3308558.3313711Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Martin Wainwright and Michael Jordan. 2003. Graphical Models, Exponential Families and Variational Inference. Technical Report. Dep. of Statistics, Univ. of California, Berkeley.Google ScholarGoogle Scholar
  29. Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8413–8426. https://doi.org/10.18653/v1/2020.acl-main.745Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Semantic Annotation of Relational Schemas Using a Probabilistic Generative Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)
      January 2024
      627 pages

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format