Skip to main content

Automated Synonym Discovery for Taxonomy Maintenance Using Semantic Search Techniques

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Abstract

Taxonomies group synonymous terms together into concepts, arranged into hierarchical “broader than” semantic relations. However, creating and maintaining taxonomies is labour-intensive, especially when they reach a scale of hundreds of thousands or millions of terms. Here, we present an automated solution to support taxonomy editors in identifying synonymous terms in scientific literature, by leveraging semantic search techniques. Our method first encodes all taxonomy terms or phrases using a pre-trained BERT-based model. Subsequently, we employ FAISS vector search to efficiently discover synonyms for each term. We evaluate by comparing the terms considered synonymous by our method to a manually curated taxonomy that consists of more than 770,000 terms. By integrating state-of-the-art NLP and search methodologies, our approach offers a practical and efficient solution, that can achieve up to 0.79 precision and 0.25 recall for synonym discovery. This automation scales to large taxonomies and can be used at runtime in large taxonomy-driven document retrieval systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.w3.org/2004/02/skos/.

References

  1. Azad, H.K., Deepak, A.: Query expansion techniques for information retrieval: a survey 56, 1698–1735 (2019). https://doi.org/10.1016/j.ipm.2019.05.009

  2. Ayazbayev, D., Bogdanchikov, A., Orynbekova, K., Varlamis, I.: Defining semantically close words of Kazakh language with Distributed System Apache Spark. Big Data Cogn. Comput. 7(4), 160 (2023). https://doi.org/10.3390/bdcc7040160

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  4. Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling (2021)

    Google Scholar 

  5. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)

    Article  Google Scholar 

  6. Ni, J., et al.: Large dual encoders are generalizable retrievers (2021)

    Google Scholar 

  7. Ni, J., et al.: Sentence-T5: scalable sentence encoders from pre-trained text-to-text models (2021)

    Google Scholar 

  8. Peters, M.E., et al.: Deep contextualized word representations (2018)

    Google Scholar 

  9. Qu, M., Ren, X., Han, J.: Automatic synonym discovery with knowledge bases (2017)

    Google Scholar 

  10. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019

    Google Scholar 

  11. Schumacher, E., Dredze, M.: Learning unsupervised contextual representations for medical synonym discovery. JAMIA Open 2, 538–546 (2019)

    Article  Google Scholar 

  12. Serina, L., Putelli, L., Gerevini, A.E., Serina, I.: Synonyms, antonyms and factual knowledge in BERT heads. Future Internet 15(7), 230 (2023). https://doi.org/10.3390/fi15070230

  13. Shen, J., Qiu, W., Shang, J., Vanni, M., Ren, X., Han, J.: SynSetExpan: an iterative framework for joint entity set expansion and synonym discovery. CoRR abs/2009.13827 (2020)

    Google Scholar 

  14. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MPNet: masked and permuted pre-training for language understanding (2020)

    Google Scholar 

  15. Thießen, F., D’Souza, J., Stocker, M.: Probing large language models for scientific synonyms (2023)

    Google Scholar 

  16. Yang, D., Wang, P., Sun, X., Li, N.: Synonym detection using syntactic dependency and neural embeddings (2022)

    Google Scholar 

  17. Zeng, S., Yuan, Z., Yu, S.: Automatic biomedical term clustering by learning fine-grained term representations (2022)

    Google Scholar 

  18. Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.S.: Entity synonym discovery via multipiece bilateral context matching (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paula Sorolla Bayod .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moradi Fard, M., Thorne, C., Sorolla Bayod, P., Akhondi, S., Vlietstra, W. (2024). Automated Synonym Discovery for Taxonomy Maintenance Using Semantic Search Techniques. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70242-6_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70241-9

  • Online ISBN: 978-3-031-70242-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics