Abstract
Table retrieval is the task of answering a search query with a ranked list of tables that are considered as relevant to that query. Computing table similarity is a critical part of this process. Current Transformer-based language models have been successfully used to obtain word embedding representations of the tables to calculate their semantic similarity. Unfortunately, obtaining word embedding representations of large tables with thousands or millions of rows can be a computationally expensive process. The present work states the hypothesis that much of the content of a table can be deleted (i.e. rows can be dropped) without significantly affecting its word embedding representation, thus maintaining system performance at a much lower computational cost. To test this hypothesis a study was carried out using two different datasets and three state-of-the-art language models. The results obtained reveal that, in large tables, keeping just 10% of the content produces a word embedding representation that is 90% similar to the original one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Similar results were obtained in Chicago dataset and are not included here due to space limitations.
References
Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) The Semantic Web - ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Embdi: generating embeddings for relational data integration. In CEUR (ed.) 29th Italian Symposium on Advanced Database Systems (SEDB), Pizzo Calabro, Italy (2021)
Chen, Z., Trabelsi, M., Heflin, J., Xu, Y., Davison, B.D.: Table search using a deep contextualized language model. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 589–598. Association for Computing Machinery, Online (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)
Gupta, S., Kanchinadam, T., Conathan, D., Fung, G.: Task-optimized word embeddings for text classification representations. Front. Appl. Math. Statis. 5, 1–10 (2020)
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations, pp. 1–21, Online (2021)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 328–339. Association for Computational Linguistics, Melbourne (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2013, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. Proc. VLDB Endow. 11(7), 813–825 (2018)
Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Weidlich Matthias, and Aberer Karl. Result selection and summarization for web table search. In Proceedings of the 31st International Conference on Data Engineering (ISDE 2015), pp. 231–242. IEEE, Seoul (2015)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong (2019)
Shraga, R., Roitman, H., Feigenblat, G., Cannim, M.: Web table retrieval using multimodal deep learning. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1399–1408. Association for Computing Machinery, Online (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc., Long Beach (2017)
Wallace, E., Wang, Y., Li, S., Singh, S., Gardner, M.: Do NLP models know numbers? Probing numeracy in embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5307–5315. Association for Computational Linguistics, Hong Kong (2019)
Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 World Wide Web Conference, pp. 1553–1562. International World Wide Web Conferences Steering Committee, Lyon (2018)
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 1–35 (2020)
Zhang, X., Ramachandran, D., Tenney, I., Elazar, Y., Roth, D.: Do language embeddings capture scales? In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4889–4896. Association for Computational Linguistics, Online (2020)
Acknowledgements
This research has been partially funded by project “Desarrollo de un ecosistema de datos abiertos para transformar el sector turístico” (GVA-COVID19/2021/103) funded by “Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pilaluisa, J., Tomás, D. (2023). The Impact of Content Deletion on Tabular Data Similarity Using Contextual Word Embeddings. In: García Bringas, P., et al. 17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022). SOCO 2022. Lecture Notes in Networks and Systems, vol 531. Springer, Cham. https://doi.org/10.1007/978-3-031-18050-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-18050-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18049-1
Online ISBN: 978-3-031-18050-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)