Skip to main content

Systematic Evaluation of Different Approaches on Embedding Search

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2024)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 920))

Included in the following conference series:

Abstract

This paper presents a comparative analysis of various methods of embedding search in insurance documents. The evaluation focuses on different SentenceTransformers models integrated within LangChain. Further, we assess the performance of the text-embedding-ada-002, Vicuna-13B, and a fine-tuned variant of the Vicuna-13B within the same pipeline. In an effort to broaden our evaluation, we also investigate a custom HuggingFace pipeline that compares the embeddings generated at the token level. Our findings reveal that the text-embedding-ada-002 model provides the most favorable results. Furthermore, in terms of open-source alternatives, the SentenceTransformers model all-miniLM-L12-v2 outperforms other models. To our knowledge, there is currently no published research addressing retrieval using embeddings on German insurance documents, thus underscoring the unique relevance of this study in this niche domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663 (2021). http://arxiv.org/abs/2104.08663

  2. Ethayarajh, K.: How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. arXiv (2019). http://arxiv.org/abs/1909.00512

  3. Muennighoff, N.: SGPT: GPT Sentence Embeddings for Semantic Search. arXiv:2202.08904 (2022). http://arxiv.org/abs/2202.08904

  4. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  5. Huang, J.-T., et al.: Embedding-based retrieval in facebook search. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, pp. 2553–2561. ACM (2020). https://doi.org/10.1145/3394486.3403305

  6. Li, H., Xu, J.: Semantic matching in search. Found. Trends® Inf. Retrieval 7(5), 343–469 (2014). https://doi.org/10.1561/1500000035. https://www.nowpublishers.com/article/Details/INR-035

  7. Wang, Z., Mi, H., Ittycheriah, A.: Sentence Similarity Learning by Lexical Decomposition and Composition. arXiv (2017). http://arxiv.org/abs/1602.07019

  8. Farouk, M.: Measuring sentences similarity: a survey. Indian J. Sci. Technol. 12(25), 1–11 (2019). https://indjst.org/articles/measuring-sentences-similarity-a-survey

  9. Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C.: Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. arXiv (2016). http://arxiv.org/abs/1605.02276

  10. Reimers, N.: Pretrained Models - Sentence-Transformers documentation (2023). https://www.sbert.net/docs/pretrained_models.html

  11. Reimers, N.: Computing Sentence Embeddings - Sentence-Transformers documentation (2022). https://www.sbert.net/examples/applications/computing-embeddings/README.html. Accessed 26 June 2023

  12. Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004). https://aclanthology.org/W04-1013

  13. Hu, E.J., et al.: LoRA: Low-Rank Adaptation of Large Language Models. arXiv (2021). http://arxiv.org/abs/2106.09685

  14. Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. arXiv (2023). http://arxiv.org/abs/2302.13971

  15. Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv (2020). http://arxiv.org/abs/2002.10957

  16. Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316 (2022). https://doi.org/10.48550/ARXIV.2210.07316

  17. Greene, R., Sanders, T.: New and improved embedding model (2023). https://openai.com/blog/new-and-improved-embedding-model

  18. Yoo, Y., Heo, T.-S., Park, Y., Kim, K.: A novel hybrid methodology of measuring sentence similarity. Symmetry 13(8), 1442 (2021). https://www.mdpi.com/2073-8994/13/8/1442

  19. Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv (2023). http://arxiv.org/abs/2210.07316

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sigurd Schacht .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aperdannier, R., Koeppel, M., Unger, T., Schacht, S., Barkur, S.K. (2024). Systematic Evaluation of Different Approaches on Embedding Search. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-031-53963-3_36

Download citation

Publish with us

Policies and ethics