The Impact of Content Deletion on Tabular Data Similarity Using Contextual Word Embeddings

Pilaluisa, José; Tomás, David

doi:10.1007/978-3-031-18050-7_24

José Pilaluisa²⁰ &
David Tomás²¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 531))

Included in the following conference series:

International Workshop on Soft Computing Models in Industrial and Environmental Applications

696 Accesses

Abstract

Table retrieval is the task of answering a search query with a ranked list of tables that are considered as relevant to that query. Computing table similarity is a critical part of this process. Current Transformer-based language models have been successfully used to obtain word embedding representations of the tables to calculate their semantic similarity. Unfortunately, obtaining word embedding representations of large tables with thousands or millions of rows can be a computationally expensive process. The present work states the hypothesis that much of the content of a table can be deleted (i.e. rows can be dropped) without significantly affecting its word embedding representation, thus maintaining system performance at a much lower computational cost. To test this hypothesis a study was carried out using two different datasets and three state-of-the-art language models. The results obtained reveal that, in large tables, keeping just 10% of the content produces a word embedding representation that is 90% similar to the original one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating the Impact of Content Deletion on Tabular Data Similarity and Retrieval Using Contextual Word Embeddings

Contextual word embeddings for tabular data search and integration

Article Open access 30 November 2022

Distributed Representations for Words on Tables

Notes

1.
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
2.
https://data.cityofchicago.org/.
3.
Similar results were obtained in Chicago dataset and are not included here due to space limitations.

References

Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) The Semantic Web - ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25
Chapter Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Embdi: generating embeddings for relational data integration. In CEUR (ed.) 29th Italian Symposium on Advanced Database Systems (SEDB), Pizzo Calabro, Italy (2021)
Google Scholar
Chen, Z., Trabelsi, M., Heflin, J., Xu, Y., Davison, B.D.: Table search using a deep contextualized language model. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 589–598. Association for Computing Machinery, Online (2020)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)
Google Scholar
Gupta, S., Kanchinadam, T., Conathan, D., Fung, G.: Task-optimized word embeddings for text classification representations. Front. Appl. Math. Statis. 5, 1–10 (2020)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations, pp. 1–21, Online (2021)
Google Scholar
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 328–339. Association for Computational Linguistics, Melbourne (2018)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2013, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)
Google Scholar
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. Proc. VLDB Endow. 11(7), 813–825 (2018)
Article Google Scholar
Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Weidlich Matthias, and Aberer Karl. Result selection and summarization for web table search. In Proceedings of the 31st International Conference on Data Engineering (ISDE 2015), pp. 231–242. IEEE, Seoul (2015)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong (2019)
Google Scholar
Shraga, R., Roitman, H., Feigenblat, G., Cannim, M.: Web table retrieval using multimodal deep learning. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1399–1408. Association for Computing Machinery, Online (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc., Long Beach (2017)
Google Scholar
Wallace, E., Wang, Y., Li, S., Singh, S., Gardner, M.: Do NLP models know numbers? Probing numeracy in embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5307–5315. Association for Computational Linguistics, Hong Kong (2019)
Google Scholar
Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 World Wide Web Conference, pp. 1553–1562. International World Wide Web Conferences Steering Committee, Lyon (2018)
Google Scholar
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 1–35 (2020)
Article Google Scholar
Zhang, X., Ramachandran, D., Tenney, I., Elazar, Y., Roth, D.: Do language embeddings capture scales? In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4889–4896. Association for Computational Linguistics, Online (2020)
Google Scholar

Download references

Acknowledgements

This research has been partially funded by project “Desarrollo de un ecosistema de datos abiertos para transformar el sector turístico” (GVA-COVID19/2021/103) funded by “Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana”.

Author information

Authors and Affiliations

Faculty of Engineering, Physical Sciences and Mathematics, Central University of Ecuador, Avenida Universitatia, Quito, 170129, Ecuador
José Pilaluisa
Department of Software and Computing Systems, University of Alicante, Carretera San Vicente del Raspeig s/n, 03690, San Vicente del Raspeig, Spain
David Tomás

Authors

José Pilaluisa
View author publications
You can also search for this author in PubMed Google Scholar
David Tomás
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Tomás .

Editor information

Editors and Affiliations

Faculty of Engineering, University of Deusto, Bilbao, Spain
Pablo García Bringas
University of León, León, Spain
Hilde Pérez García
Mechanical Engineering Department, University of La Rioja, Logroño, La Rioja, Spain
Francisco Javier Martinez-de-Pison
Inteligencia Artificial, University of Oviedo, A Coruña, La Coruña, Spain
José Ramón Villar Flecha
Data Science and Big Data Lab, Pablo de Olavide University, Sevilla, Spain
Alicia Troncoso Lora
University of Oviedo, Oviedo, Spain
Enrique A. de la Cal
Department of Civil Engineering, University of Burgos, Burgos, Spain
Álvaro Herrero
School of engineering, Pablo Olavide University, Seville, Spain
Francisco Martínez Álvarez
DIGIP, University of Bergamo, Dalmine, Bergamo, Italy
Giuseppe Psaila
Department of Industrial Engineering, University of A Coruña, Ferrol, Spain
Héctor Quintián
Department of Computing Science, University of Salamanca, Salamanca, Spain
Emilio S. Corchado Rodriguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pilaluisa, J., Tomás, D. (2023). The Impact of Content Deletion on Tabular Data Similarity Using Contextual Word Embeddings. In: García Bringas, P., et al. 17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022). SOCO 2022. Lecture Notes in Networks and Systems, vol 531. Springer, Cham. https://doi.org/10.1007/978-3-031-18050-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-18050-7_24
Published: 12 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18049-1
Online ISBN: 978-3-031-18050-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

The Impact of Content Deletion on Tabular Data Similarity Using Contextual Word Embeddings