Abstract
Semantic search, which searches for sentences with a high similarity in meaning to that of queries, allows a user to search for the desired sentences even when they cannot think of the appropriate keywords for a lexical search. Moreover, the search function can appropriately handle synonyms and spelling variations. We previously reported a semantic search method for Japanese sentences using sentence embeddings that appropriately processed queries in which sentences were combined using the logical operators AND, OR, and NOT. Reducing the dimensionality of sentence embeddings is expected to make semantic search more robust to noise in the embeddings, resulting in improved search accuracy and faster semantic search computation. In this study, we experimentally verified the improvement in semantic search performance by reducing the dimensionality of sentence embeddings generated by Japanese SimCSE. We also evaluated the runtimes for generating sentence embeddings and reducing dimensionality with PCA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lewis, D.D., Jones, K.S.: Natural language processing for information retrieval. Commun. ACM 39(1), 92–101 (1996)
Namazu Project. Namazu: a Full-Text Search Engine. http://www.namazu.org/index.html.en. Accessed 21 June 2023
Groonga Project. Groonga. https://groonga.org/. Accessed 21 June 2023
Hugging Face. Using Sentence Transformers for semantic search. https://huggingface.co/spaces/sentence-transformers/embeddings-semantic-search. Accessed 21 June 2023
Elastic. Accelerate time to insight with Elasticsearch and AI. https://www.elastic.co/. Accessed 21 June 2023
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. arXiv:2104.08821v4 (2022)
Tsukagoshi, H., Sasano, R., Takeda, K.: Japanese SimCSE technical report. arXiv:2310.19349v1 (2023)
Tsukagoshi, H.: Japanese Simple-SimCSE. https://github.com/hppRC/simple-simcse-ja. Accessed 15 Dec 2023
Tsumuraya, K., Uehara, M., Adachi, Y.: Semantic search of Japanese sentences using distributed representations. In: CANDAR2023 GCA (2023)
Tohoku NLP Group. BERT base Japanese (unidic-lite with whole word masking, CC-100 and jawiki-20230102). https://huggingface.co/cl-tohoku/bert-base-japanese-v3. Accessed 15 Dec 2023
Yoshikoshi, T., Kawahara, D., Kurohashi, S.: Multilingualization of a natural language inference dataset using machine translation. SIG Technical reports, vol. 2020-NL-244, no. 6 (2020). (in Japanese)
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)
Stanford NLP Group. The Stanford Natural Language Inference (SNLI) Corpus. https://nlp.stanford.edu/projects/snli/. Accessed 1 June 2022
Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics. Springer, New York (2002). https://doi.org/10.1007/b98835, ISBN 978-0-387-95442-4
Raschka, S., Patterson, J., Nolet, C.: Machine learning in Python: main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv:2002.04803 (2020)
cuML - GPU Machine Learning Algorithms. https://github.com/rapidsai/cuml. Accessed 31 May 2023
Google Colaboratory. https://colab.research.google.com/. Accessed 31 May 2023
NII, NTCIR (NII Testbeds and Community for Information access Research) project. Accessed 5 Sept 2023
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tsumuraya, K., Uehara, M., Adachi, Y. (2024). Performance Improvement of Semantic Search Using Sentence Embeddings by Dimensionality Reduction. In: Barolli, L. (eds) Advanced Information Networking and Applications. AINA 2024. Lecture Notes on Data Engineering and Communications Technologies, vol 201. Springer, Cham. https://doi.org/10.1007/978-3-031-57870-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-57870-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57869-4
Online ISBN: 978-3-031-57870-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)