Abstract
This paper presents a comparative analysis of various methods of embedding search in insurance documents. The evaluation focuses on different SentenceTransformers models integrated within LangChain. Further, we assess the performance of the text-embedding-ada-002, Vicuna-13B, and a fine-tuned variant of the Vicuna-13B within the same pipeline. In an effort to broaden our evaluation, we also investigate a custom HuggingFace pipeline that compares the embeddings generated at the token level. Our findings reveal that the text-embedding-ada-002 model provides the most favorable results. Furthermore, in terms of open-source alternatives, the SentenceTransformers model all-miniLM-L12-v2 outperforms other models. To our knowledge, there is currently no published research addressing retrieval using embeddings on German insurance documents, thus underscoring the unique relevance of this study in this niche domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663 (2021). http://arxiv.org/abs/2104.08663
Ethayarajh, K.: How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. arXiv (2019). http://arxiv.org/abs/1909.00512
Muennighoff, N.: SGPT: GPT Sentence Embeddings for Semantic Search. arXiv:2202.08904 (2022). http://arxiv.org/abs/2202.08904
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Huang, J.-T., et al.: Embedding-based retrieval in facebook search. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, pp. 2553–2561. ACM (2020). https://doi.org/10.1145/3394486.3403305
Li, H., Xu, J.: Semantic matching in search. Found. Trends® Inf. Retrieval 7(5), 343–469 (2014). https://doi.org/10.1561/1500000035. https://www.nowpublishers.com/article/Details/INR-035
Wang, Z., Mi, H., Ittycheriah, A.: Sentence Similarity Learning by Lexical Decomposition and Composition. arXiv (2017). http://arxiv.org/abs/1602.07019
Farouk, M.: Measuring sentences similarity: a survey. Indian J. Sci. Technol. 12(25), 1–11 (2019). https://indjst.org/articles/measuring-sentences-similarity-a-survey
Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C.: Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. arXiv (2016). http://arxiv.org/abs/1605.02276
Reimers, N.: Pretrained Models - Sentence-Transformers documentation (2023). https://www.sbert.net/docs/pretrained_models.html
Reimers, N.: Computing Sentence Embeddings - Sentence-Transformers documentation (2022). https://www.sbert.net/examples/applications/computing-embeddings/README.html. Accessed 26 June 2023
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004). https://aclanthology.org/W04-1013
Hu, E.J., et al.: LoRA: Low-Rank Adaptation of Large Language Models. arXiv (2021). http://arxiv.org/abs/2106.09685
Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. arXiv (2023). http://arxiv.org/abs/2302.13971
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv (2020). http://arxiv.org/abs/2002.10957
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316 (2022). https://doi.org/10.48550/ARXIV.2210.07316
Greene, R., Sanders, T.: New and improved embedding model (2023). https://openai.com/blog/new-and-improved-embedding-model
Yoo, Y., Heo, T.-S., Park, Y., Kim, K.: A novel hybrid methodology of measuring sentence similarity. Symmetry 13(8), 1442 (2021). https://www.mdpi.com/2073-8994/13/8/1442
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv (2023). http://arxiv.org/abs/2210.07316
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aperdannier, R., Koeppel, M., Unger, T., Schacht, S., Barkur, S.K. (2024). Systematic Evaluation of Different Approaches on Embedding Search. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-031-53963-3_36
Download citation
DOI: https://doi.org/10.1007/978-3-031-53963-3_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53962-6
Online ISBN: 978-3-031-53963-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)