Abstract
Taxonomies group synonymous terms together into concepts, arranged into hierarchical “broader than” semantic relations. However, creating and maintaining taxonomies is labour-intensive, especially when they reach a scale of hundreds of thousands or millions of terms. Here, we present an automated solution to support taxonomy editors in identifying synonymous terms in scientific literature, by leveraging semantic search techniques. Our method first encodes all taxonomy terms or phrases using a pre-trained BERT-based model. Subsequently, we employ FAISS vector search to efficiently discover synonyms for each term. We evaluate by comparing the terms considered synonymous by our method to a manually curated taxonomy that consists of more than 770,000 terms. By integrating state-of-the-art NLP and search methodologies, our approach offers a practical and efficient solution, that can achieve up to 0.79 precision and 0.25 recall for synonym discovery. This automation scales to large taxonomies and can be used at runtime in large taxonomy-driven document retrieval systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Azad, H.K., Deepak, A.: Query expansion techniques for information retrieval: a survey 56, 1698–1735 (2019). https://doi.org/10.1016/j.ipm.2019.05.009
Ayazbayev, D., Bogdanchikov, A., Orynbekova, K., Varlamis, I.: Defining semantically close words of Kazakh language with Distributed System Apache Spark. Big Data Cogn. Comput. 7(4), 160 (2023). https://doi.org/10.3390/bdcc7040160
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling (2021)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
Ni, J., et al.: Large dual encoders are generalizable retrievers (2021)
Ni, J., et al.: Sentence-T5: scalable sentence encoders from pre-trained text-to-text models (2021)
Peters, M.E., et al.: Deep contextualized word representations (2018)
Qu, M., Ren, X., Han, J.: Automatic synonym discovery with knowledge bases (2017)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019
Schumacher, E., Dredze, M.: Learning unsupervised contextual representations for medical synonym discovery. JAMIA Open 2, 538–546 (2019)
Serina, L., Putelli, L., Gerevini, A.E., Serina, I.: Synonyms, antonyms and factual knowledge in BERT heads. Future Internet 15(7), 230 (2023). https://doi.org/10.3390/fi15070230
Shen, J., Qiu, W., Shang, J., Vanni, M., Ren, X., Han, J.: SynSetExpan: an iterative framework for joint entity set expansion and synonym discovery. CoRR abs/2009.13827 (2020)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding (2020)
Thießen, F., D’Souza, J., Stocker, M.: Probing large language models for scientific synonyms (2023)
Yang, D., Wang, P., Sun, X., Li, N.: Synonym detection using syntactic dependency and neural embeddings (2022)
Zeng, S., Yuan, Z., Yu, S.: Automatic biomedical term clustering by learning fine-grained term representations (2022)
Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.S.: Entity synonym discovery via multipiece bilateral context matching (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moradi Fard, M., Thorne, C., Sorolla Bayod, P., Akhondi, S., Vlietstra, W. (2024). Automated Synonym Discovery for Taxonomy Maintenance Using Semantic Search Techniques. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-70242-6_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70241-9
Online ISBN: 978-3-031-70242-6
eBook Packages: Computer ScienceComputer Science (R0)