Skip to main content

Performance Improvement of Semantic Search Using Sentence Embeddings by Dimensionality Reduction

  • Conference paper
  • First Online:
Advanced Information Networking and Applications (AINA 2024)

Abstract

Semantic search, which searches for sentences with a high similarity in meaning to that of queries, allows a user to search for the desired sentences even when they cannot think of the appropriate keywords for a lexical search. Moreover, the search function can appropriately handle synonyms and spelling variations. We previously reported a semantic search method for Japanese sentences using sentence embeddings that appropriately processed queries in which sentences were combined using the logical operators AND, OR, and NOT. Reducing the dimensionality of sentence embeddings is expected to make semantic search more robust to noise in the embeddings, resulting in improved search accuracy and faster semantic search computation. In this study, we experimentally verified the improvement in semantic search performance by reducing the dimensionality of sentence embeddings generated by Japanese SimCSE. We also evaluated the runtimes for generating sentence embeddings and reducing dimensionality with PCA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lewis, D.D., Jones, K.S.: Natural language processing for information retrieval. Commun. ACM 39(1), 92–101 (1996)

    Google Scholar 

  2. Namazu Project. Namazu: a Full-Text Search Engine. http://www.namazu.org/index.html.en. Accessed 21 June 2023

  3. Groonga Project. Groonga. https://groonga.org/. Accessed 21 June 2023

  4. Hugging Face. Using Sentence Transformers for semantic search. https://huggingface.co/spaces/sentence-transformers/embeddings-semantic-search. Accessed 21 June 2023

  5. Elastic. Accelerate time to insight with Elasticsearch and AI. https://www.elastic.co/. Accessed 21 June 2023

  6. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. arXiv:2104.08821v4 (2022)

  7. Tsukagoshi, H., Sasano, R., Takeda, K.: Japanese SimCSE technical report. arXiv:2310.19349v1 (2023)

  8. Tsukagoshi, H.: Japanese Simple-SimCSE. https://github.com/hppRC/simple-simcse-ja. Accessed 15 Dec 2023

  9. Tsumuraya, K., Uehara, M., Adachi, Y.: Semantic search of Japanese sentences using distributed representations. In: CANDAR2023 GCA (2023)

    Google Scholar 

  10. Tohoku NLP Group. BERT base Japanese (unidic-lite with whole word masking, CC-100 and jawiki-20230102). https://huggingface.co/cl-tohoku/bert-base-japanese-v3. Accessed 15 Dec 2023

  11. Yoshikoshi, T., Kawahara, D., Kurohashi, S.: Multilingualization of a natural language inference dataset using machine translation. SIG Technical reports, vol. 2020-NL-244, no. 6 (2020). (in Japanese)

    Google Scholar 

  12. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)

    Google Scholar 

  13. Stanford NLP Group. The Stanford Natural Language Inference (SNLI) Corpus. https://nlp.stanford.edu/projects/snli/. Accessed 1 June 2022

  14. Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York (2002). https://doi.org/10.1007/b98835, ISBN 978-0-387-95442-4

  15. Raschka, S., Patterson, J., Nolet, C.: Machine learning in Python: main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv:2002.04803 (2020)

  16. cuML - GPU Machine Learning Algorithms. https://github.com/rapidsai/cuml. Accessed 31 May 2023

  17. Google Colaboratory. https://colab.research.google.com/. Accessed 31 May 2023

  18. NII, NTCIR (NII Testbeds and Community for Information access Research) project. Accessed 5 Sept 2023

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minoru Uehara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tsumuraya, K., Uehara, M., Adachi, Y. (2024). Performance Improvement of Semantic Search Using Sentence Embeddings by Dimensionality Reduction. In: Barolli, L. (eds) Advanced Information Networking and Applications. AINA 2024. Lecture Notes on Data Engineering and Communications Technologies, vol 201. Springer, Cham. https://doi.org/10.1007/978-3-031-57870-0_11

Download citation

Publish with us

Policies and ethics