Abstract
We introduce a novel approach for linking table columns to types in an ontology unseen during training. As the target ontology is unknown to the model during training, this may be considered a zero-shot linking task at the ontological level. This task is often a requirement for businesses that wish to semantically enrich their tabular data with types from their custom or industry-specific ontologies without the benefit of initial supervision. In this paper, we describe specific approaches and provide datasets for this new task: training models on open domain tables using a broad source ontology and evaluating them on increasingly difficult tables with target ontologies having different levels of type granularity. We use pre-trained Transformer encoder models and a range of encoding strategies to explore methods of encoding increasing amounts of ontological knowledge, such as type glossaries and taxonomies, to obtain better zero-shot performance. We demonstrate these results empirically through extensive experiments on three new public benchmark datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdelmageed, N., Schindler, S.: Jentab meets semtab 2021’s new challenges. In: SemTab@ ISWC, pp. 42–53 (2021)
Abdelmageed, N., Schindler, S., König-Ries, B.: BiodivTab: a tabular benchmark based on biodiversity research data. In: SemTab@ISWC, submitted (2021)
Baazouzi, W., Kachroudi, M., Faiz, S.: Kepler-asi at semtab 2021. In: SemTab@ ISWC, pp. 54–67 (2021)
Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25
Bogatu, A., Fernandes, A.A.A., Paton, N.W., Konstantinou, N.: Dataset discovery in data lakes. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 709–720 (2020)
Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Colnet: embedding the semantics of web tables for column type prediction. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, 27 January–1 February 2019, pp. 29–36. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.330129
Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations for tabular data. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019, pp. 2088–2094. ijcai.org (2019). https://doi.org/10.24963/ijcai.2019/289
Chen, Y., et al.: An empirical study on multiple information sources for zero-shot fine-grained entity typing. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November, 2021, pp. 2668–2678. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.210
Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: carefully evaluating entity linking for tabular data. In: Pan, J.Z., et al. (eds.) ISWC 2020. LNCS, vol. 12507, pp. 328–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_21
Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A.: Permutation invariant strategy using transformer encoders for table understanding. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 788–800. Association for Computational Linguistics, Seattle (2022). https://doi.org/10.18653/v1/2022.findings-naacl.59. https://aclanthology.org/2022.findings-naacl.59
Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: table understanding through representation learning. Proc. VLDB Endow. 14(3), 307–319 (2020). https://doi.org/10.5555/3430915.3442430. http://www.vldb.org/pvldb/vol14/p307-deng.pdf
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. JMLR Proceedings, vol. 9, pp. 249–256. JMLR.org (2010). http://proceedings.mlr.press/v9/glorot10a.html
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.: TaPas: weakly supervised table parsing via pre-training. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4320–4333. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.398. https://aclanthology.org/2020.acl-main.398
Hu, K., et al.: Viznet: towards a large-scale visualization learning and benchmarking repository. In: Proceedings of the 2019 Conference on Human Factors in Computing Systems (CHI). ACM (2019)
Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., Karypis, G. (eds.) Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019, pp. 1500–1508. ACM (2019). https://doi.org/10.1145/3292500.3330993
Huynh, V.P., et al.: Dagobah: table and graph contexts for efficient semantic annotation of tabular data. In: SemTab@ISWC, pp. 19–31 (2021)
Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Toutanova, K., et al (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021, pp. 3446–3456. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.270
Iida, H., Thai, D., Manjunatha, V., Iyyer, M.: TABBIE: pretrained representations of tabular data. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3446–3456. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.270. https://aclanthology.org/2021.naacl-main.270
Jiao, X., et al.: Tinybert: distilling BERT for natural language understanding. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, Findings of ACL, 16–20 November 2020, vol. EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.372
Jiménez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: SemTab 2019: resources to benchmark tabular data to knowledge graph matching systems. In: Harth, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 514–530. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_30
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. CoRR abs/1702.08734 (2017). http://arxiv.org/abs/1702.08734
McCray, A.T.: An upper-level ontology for the biomedical domain. Comput. Funct. Genomics 4, 80–84 (2003)
Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M.: Weisfeiler and leman go neural: Higher-order graph neural networks. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, USA, 27 January–1 February 2019, pp. 4602–4609. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33014602
Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: Hartig, O., Harth, A., Sequeda, J.F. (eds.) Proceedings of the First International Workshop on Consuming Linked Data, Shanghai, China, 8 November 2010, CEUR Workshop Proceedings, vol. 665. CEUR-WS.org (2010). http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf
Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Semtab 2021: Tabular data annotation with mtab tool. In: Jiménez-Ruiz, E., et al. (eds.) Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, 27 October 2021, CEUR Workshop Proceedings, vol. 3103, pp. 92–101. CEUR-WS.org (2021). http://ceur-ws.org/Vol-3103/paper8.pdf
Obeidat, R., Fern, X., Shahbazi, H., Tadepalli, P.: Description-based zero-shot fine-grained entity typing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 807–814 (2019)
,bibitemch27DBLP:confspswwwspsRenLZ20 Ren, Y., Lin, J., Zhou, J.: Neural zero-shot fine-grained entity typing. In: Seghrouchni, A.E.F., Sukthankar, G., Liu, T., van Steen, M. (eds.) Companion of The 2020 Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. pp. 846–847. ACM/IW3C2 (2020). https://doi.org/10.1145/3366424.3382725
Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to dbpedia. In: Akerkar, R., Dikaiakos, M.D., Achilleos, A., Omitola, T. (eds.) Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, WIMS 2015, Larnaca, Cyprus, 13–15 July 2015, pp. 10:1–10:6. ACM (2015)
Suhara, Y., et al.: Annotating columns with pre-trained language models. arXiv preprint arXiv:2104.01785 (2021)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=rJXMpikCZ
Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., Tan, W.: Sato: contextual semantic type detection in tables. Proc. VLDB Endow. 13(11), 1835–1848 (2020). http://www.vldb.org/pvldb/vol13/p1835-zhang.pdf
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 13:1–13:35 (2020). https://doi.org/10.1145/3372117
Zhang, T., Xia, C., Lu, C.T., Philip, S.Y.: Mzet: memory augmented zero-shot fine-grained named entity typing. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 77–87 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Model Predictions
The following tables below shows examples of predictions returned by our proposed model built using a pretrained TinyBERT encoder. This model is trained using Wikidata labels and is asked to predict from the DBpedia target ontology for the top two tables. For the bottom two tables, the model predicts from the UMLS Semantic Network (UMLS SN).
The first row in the block titled Top model prediction returns model predictions using Type labels only. The second row returns predictions using Type labels and associated glossaries. The final row in this block returns predictions using our proposed encoding strategy. Note that the BioDivTab benchmark does not contain table metadata.
![figure a](http://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-031-47240-4_27/MediaObjects/552750_1_En_27_Figa_HTML.png)
![figure b](http://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-031-47240-4_27/MediaObjects/552750_1_En_27_Figb_HTML.png)
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dash, S., Bagchi, S., Mihindukulasooriya, N., Gliozzo, A. (2023). Linking Tabular Columns to Unseen Ontologies. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14265. Springer, Cham. https://doi.org/10.1007/978-3-031-47240-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-47240-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47239-8
Online ISBN: 978-3-031-47240-4
eBook Packages: Computer ScienceComputer Science (R0)