tutorial

Open access

Vector Database Management Techniques and Systems

Authors:

Guoliang LiAuthors Info & Claims

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

Pages 597 - 604

https://doi.org/10.1145/3626246.3654691

Published: 09 June 2024 Publication History

Abstract

Feature vectors are now mission-critical for many applications, including retrieval-based large language models (LLMs). Traditional database management systems are not equipped to deal with the unique characteristics of feature vectors, such as the vague notion of semantic similarity, large size of vectors, expensive similarity comparisons, lack of indexable structure, and difficulty of answering "hybrid" queries that combine structured attributes with feature vectors. A number of vector database management systems (VDBMSs) have been developed to address these challenges, combining novel techniques for query processing, storage and indexing, and query optimization and execution and culminating in a spectrum of performance and accuracy characteristics and capabilities. In this tutorial, we review the existing vector database management techniques and systems. For query processing, we review similarity score design and selection, vector query types, and vector query interfaces. For storage and indexing, we review various indexes and discuss compression as well as disk-resident indexes. For query optimization and execution, we review hybrid query processing, hardware acceleration, and distibuted search. We then review existing systems, search engines and libraries, and benchmarks. Finally, we present research challenges and open problems.

References

[1]

[n. d.]. http://github.com/facebookresearch/faiss.

[2]

[n. d.]. http://github.com/spotify/annoy.

[3]

[n. d.]. http://qdrant.tech.

[4]

[n. d.]. http://vespa.ai.

[5]

[n. d.]. http://pinecone.io.

[6]

[n. d.]. http://milvus.io.

[7]

[n. d.]. http://github.com/pgvector.

[8]

[n. d.]. http://lucene.apache.org.

[9]

[n. d.]. http://elastic.co.

[10]

[n. d.]. http://vald.vdaas.org.

[11]

[n. d.]. http://marqo.ai.

[12]

[n. d.]. http://github.com/vearch.

[13]

[n. d.]. http://weaviate.io.

[14]

[n. d.]. http://euclidesdb.readthedocs.io.

[15]

[n. d.]. http://trychroma.com.

[16]

[n. d.]. http://nuclia.com.

[17]

[n. d.]. http://singlestore.com.

[18]

[n. d.]. http://opensearch.org.

[19]

[n. d.]. http://solr.apache.org.

[20]

[n. d.]. http://github.com/microsoft/SPTAG.

[21]

Ahmed Abdelkader, Sunil Arya, Guilherme D. da Fonseca, and David M. Mount. 2019. Approximate nearest neighbor searching with non-Euclidean and weighted distances. In SODA. 355--372.

[22]

Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. 2001. On the surprising behavior of distance metrics in high dimensional space. In ICDT.

Digital Library

[23]

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In NeurIPS. 1225--1233.

[24]

Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. 2018. Approximate nearest neighbor search in high dimensions. In ICM. 3287--3318.

[25]

Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors. In STOC. 793--801.

[26]

Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2017. Accelerated Nearest Neighbor Search with Quick ADC. In ICMR.

[27]

Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2021. Quicker ADC: Unlocking the hidden potential of product quantization with SIMD. IEEE Trans. Pattern Anal. and Mach. Intell. 43, 5 (2021), 1666--1677.

[28]

Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based Language Models and Applications. In ACL.

[29]

Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Inform. Syst. 87 (2020).

[30]

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is ?nearest neighbor" meaningful?. In ICDT.

[31]

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training tasks for embedding-based large-scale retrieval. In ICLR.

[32]

Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, Jingdong Wang, Mao Yang, and Jingdong Wang. 2021. SPANN: Highly-efficient billion-scale approximate nearest neighbor search. In NeurIPS.

[33]

Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In STOC. 537--546.

[34]

Sanjoy Dasgupta and Kaushik Sinha. 2013. Randomized partition trees for exact nearest neighbor search. In COLT. 317--337.

[35]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SCG. 253--262.

[36]

Wei Dong, Moses Charikar, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW.

[37]

Karima Echihabi, Kostas Zoumpatianos, and Themis Palpanas. 2021. New trends in high-D vector similarity search: AI-driven, progressive, and distributed. Proc. VLDB Endow. 14, 12 (2021), 3198--3201.

Digital Library

[38]

H. Edelsbrunner and N. R. Shah. 1996. Incremental topological flipping works for regular triangulations. Algorithmica 15 (1996), 223--241.

Digital Library

[39]

Danyel Fisher, Igor Popov, Steven Drucker, and M. C. Schraefel. 2012. Trust me, I'm partially right: Incremental visualization lets analysts explore large datasets faster. In SIGCHI. 1673--1682.

[40]

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast approximate nearest neighbor search with the navigating spreading-out graph. Proc. VLDB Endow. 12, 5 (2019), 461--474.

Digital Library

[41]

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In CVPR. 2946--2953.

[42]

Allen Gersho and Robert M. Gray. 1991. Vector Quantization and Signal Compression. Springer.

Digital Library

[43]

Siddharth Gollapudi, Neel Karia, Varun Sivashankar, Ravishankar Krishnaswamy, Nikit Begwani, Swapnil Raz, Yiyong Lin, Yin Zhang, Neelam Mahapatro, Premkumar Srinivasan, Amit Singh, and Harsha Vardhan Simhadri. 2023. Filtered- DiskANN: Graph algorithms for approximate nearest neighbor search with filters. In WWW.

[44]

Robert M. Gray. 1984. Vector quantization. IEEE ASSP Mag. 1, 2 (1984), 4--29.

[45]

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A cloud native vector database management system. Proc. VLDB Endow. 15, 12 (2022), 3548--3561.

Digital Library

[46]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating large-scale inference with anisotropic vector quantization. In ICML, Vol. 119. 3887--3896.

[47]

Ben Harwood and Tom Drummond. 2016. FANNG: Fast approximate nearest neighbour graphs. In CVPR.

[48]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC. 604--613.

[49]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. and Mach. Intell. 33, 1 (2011), 117--128.

Digital Library

[50]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535--547.

[51]

Yubin Kim. 2022. Applications and future of dense retrieval in industry. In SIGIR. 3373--3374.

[52]

Jon M. Kleinberg. 2000. Navigation in a small world. Nature 406 (2000), 845.

[53]

Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (2010), 35--40.

Digital Library

[54]

Jie Li, Haifeng Liu, Chuanghua Gui, Jianyu Chen, Zhenyuan Ni, Ning Wang, and Yuan Chen. 2018. The design and implementation of a real time visual search system on JD e-commerce platform. In Middleware. 9--16.

[55]

Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2020. Approximate nearest neighbor search on high dimensional data - Experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32, 8 (2020), 1475--1488.

[56]

S. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28, 2 (1982), 129--137.

Digital Library

[57]

Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Inform. Syst. 45 (2014), 61--68.

[58]

Yury Malkov and D. A. Yashunin. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. and Mach. Intell. 42, 4 (2020), 824--836.

Digital Library

[59]

Yusuke Matsui, Yusuke Uchida, Hervé Jégou, and Shin'ichi Satoh. 2018. A survey of product quantization. ITE Trans. Media Technol. and Appl. 6, 1 (2018), 2--10.

[60]

Jingfan Meng, Huayi Wang, Jun Xu, and Mitsunori Ogihara. 2022. ONe Index for All Kernels (ONIAK): A zero re-indexing LSH solution to ANNS-ALT (After Linear Transformation). Proc. VLDB Endow. 15, 13 (2022), 3937--3949.

Digital Library

[61]

Evgeny M. Mirkes, Jeza Allohibi, and Alexander Gorban. 2020. Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality. Entropy 22, 10 (2020).

[62]

Marius Muja and David G Lowe. 2009. FLANN: Fast library for approximate nearest neighbors. In VISAPP.

[63]

Gonzalo Navarro. 2002. Searching in metric spaces by spatial approximation. VLDB J. (2002).

[64]

Mohammad Norouzi and David J. Fleet. 2013. Cartesian k-means. In CVPR.

[65]

Rodrigo Paredes, Edgar Chávez, Karina Figueroa, and Gonzalo Navarro. 2006. Practical construction of k-nearest neighbor graphs in metric spaces. In WEA.

[66]

Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. 2022. Cloud-native transactions and analytics in SingleStore. In SIGMOD. 2340--2352.

[67]

Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang. 2020. Similarity query processing for high-dimensional data. Proc. VLDB Endow. 13, 12 (2020), 3437--3440.

Digital Library

[68]

Jianbin Qin, Wei Wang, Chuan Xiao, Ying Zhang, and Yaoshu Wang. 2021. High-dimensional similarity query processing for data science. In KDD. 4062--4063.

[69]

Parikshit Ram and Kaushik Sinha. 2019. Revisiting kd-tree for nearest neighbor search. In KDD. 1378--1388.

[70]

Aviad Rubinstein. 2018. Hardness of approximate nearest neighbor search. In STOC. 1260--1268.

[71]

R. R. Salakhutdinov and G. E. Hinton. 2007. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS.

[72]

C. Silpa-Anan and R. Hartley. 2008. Optimised KD-trees for fast image descriptor matching. In CVPR.

[73]

Yongye Su, Yinqi Sun, Minjia Zhang, and Jianguo Wang. 2024. Vexless: A Serverless Vector Data Management System Using Cloud Functions. In Proceedings of ACM Conference on Management of Data (SIGMOD).

Digital Library

[74]

Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishnaswamy, and Harsha Simhadri. 2019. DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In NeurIPS.

[75]

Jacopo Tagliabue and Ciro Greco. 2023. (Vector) Space is not the final frontier: Product search as program synthesis. In SIGIR.

[76]

Jiajie Tan, Jinlong Hu, and Shoubin Dong. 2023. Incorporating entity-level knowledge in pretrained language model for biomedical dense retrieval. Comput. Biol. Med. 166 (2023).

[77]

P. M. Vaidya. 1989. An O (n log n) algorithm for the all-nearest-neighbors problem. Discrete Comput. Geom. 4 (1989), 101--115.

Digital Library

[78]

Jing Wang, Jingdong Wang, Gang Zeng, Zhuowen Tu, Rui Gan, and Shipeng Li. 2012. Scalable k-NN graph construction for visual descriptors. In CVPR. 1106--1113.

[79]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan, Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo, Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. 2021. Milvus: A purpose-built vector data management system. In SIGMOD. 2614--2627.

Digital Library

[80]

Jianguo Wang and Qizhen Zhang. 2023. Disaggregated database systems. In SIGMOD. 37--44.

[81]

Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A survey on learning to hash. IEEE Trans. Pattern Anal. and Mach. Intell. 40, 4 (2018), 769--790.

[82]

Wenping Wang, Yunxi Guo, Chiyao Shen, Shuai Ding, Guangdeng Liao, Hao Fu, and Pramodh Karanth Prabhakar. 2023. Integrity and junkiness failure handling for embedding-based retrieval: A case study in social network search. In SIGIR.

[83]

Duncan J. Watts and Steven H. Strogatz. 1998. Collective dynamics of ?small-world' networks. Nature 393 (1998), 440--442.

[84]

Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, and Yuanzhe Cai. 2020. AnalyticDB-V: A hybrid analytical engine towards query fusion for structured and unstructured data. Proc. VLDB Endow. 13, 12 (2020), 3152--3165.

Digital Library

[85]

Yair Weiss, Antonio Torralba, and Rob Fergus. 2008. Spectral hashing. In NeurIPS. 1753--1760.

[86]

Ryan Williams. 2018. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In SODA. 1207--1215.

[87]

Wei Wu, Junlin He, Yu Qiao, Guoheng Fu, Li Liu, and Jin Yu. 2022. HQANN: Efficient and robust similarity search for hybrid queries with structured and unstructured constraints. In CIKM.

[88]

Wenzhuo Xue, Hui Li, Yanguo Peng, Jiangtao Cui, and Yu Shi. 2018. Secure k nearest neighbors query for high-dimensional vectors in outsourced environments. IEEE Trans. Big Data 4, 4 (2018), 586--599.

[89]

Artem Babenko Yandex and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In CVPR. 2055--2063.

[90]

Wen Yang, Tao Li, Gai Fang, and Hong Wei. 2020. PASE: PostgreSQL ultra-high-dimensional approximate nearest neighbor search extension. In SIGMOD. 2241--2253.

[91]

Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner. 2020. Continuously adaptive similarity search. In SIGMOD. 2601--2616.

[92]

Yunan Zhang, Shige Liu, and Jianguo Wang. 2024. Are there fundamental limitations in supporting vector data management in relational databases? A case study of PostgreSQL. In ICDE.

[93]

Zhilin Zhang, Ke Wang, Chen Lin, and Weipeng Lin. 2018. Secure top-k inner product retrieval. In CIKM. 77--86.

Cited By

Choi HJeong J(2025)Domain-Specific Manufacturing Analytics Framework: An Integrated Architecture with Retrieval-Augmented Generation and Ollama-Based Models for Manufacturing Execution Systems EnvironmentsProcesses10.3390/pr1303067013:3(670)Online publication date: 27-Feb-2025
https://doi.org/10.3390/pr13030670
Gou YGao JXu YLong C(2025)SymphonyQG: Towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097303:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709730
Zhu ZFan ZZeng YShi YXu YZhou MDong J(2024)FedSQ: A Secure System for Federated Vector Similarity QueriesProceedings of the VLDB Endowment10.14778/3685800.368589517:12(4441-4444)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685895
Show More Cited By

Index Terms

Vector Database Management Techniques and Systems
1. Information systems
  1. Data management systems

Recommendations

Survey of vector database management systems
Abstract
There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. ...
Advanced techniques for high performance query optimization in database systems
Autonomic View of Query Optimizers in Database Management Systems
SERA '10: Proceedings of the 2010 Eighth ACIS International Conference on Software Engineering Research, Management and Applications

The growing complexity of applications, huge data volume and the data structures to process massive data are becoming challenging issue. Query optimizer is a major component of a Database Management System (DBMS) that executes queries through different ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

Qualifiers

Tutorial

Funding Sources

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
2,052
Total Downloads

Downloads (Last 12 months)2,052
Downloads (Last 6 weeks)348

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Choi HJeong J(2025)Domain-Specific Manufacturing Analytics Framework: An Integrated Architecture with Retrieval-Augmented Generation and Ollama-Based Models for Manufacturing Execution Systems EnvironmentsProcesses10.3390/pr1303067013:3(670)Online publication date: 27-Feb-2025
https://doi.org/10.3390/pr13030670
Gou YGao JXu YLong C(2025)SymphonyQG: Towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/37097303:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709730
Zhu ZFan ZZeng YShi YXu YZhou MDong J(2024)FedSQ: A Secure System for Federated Vector Similarity QueriesProceedings of the VLDB Endowment10.14778/3685800.368589517:12(4441-4444)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685895
Chen CJin CZhang YPodolsky SWu CWang SHanson ESun ZWalzer RWang J(2024)SingleStore-V: An Integrated Vector Database System in SingleStoreProceedings of the VLDB Endowment10.14778/3685800.368580517:12(3772-3785)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685805
Hu YChen ZLin YWang JLiu YLin WZhang LGuo M(2024)MEMOS : Multimodal Educational Mentor and Optimisation System Based on Multi-Agent2024 Artificial Intelligence x Humanities, Education, and Art (AIxHEART)10.1109/AIxHeart62327.2024.00018(58-63)Online publication date: 30-Sep-2024
https://doi.org/10.1109/AIxHeart62327.2024.00018
Tripathi RSingh PSingh S(2024)Revisiting Cuckoo Hashing: re-addressing the challenges of Cuckoo HashingInternational Journal of Information Technology10.1007/s41870-024-02274-217:1(495-512)Online publication date: 20-Nov-2024
https://doi.org/10.1007/s41870-024-02274-2
Akik EVještica MDimitrieski VKordić SRistić S(2024)Towards a Model-Driven Approach to Enable Uniform Access to Vector DatabasesNew Trends in Database and Information Systems10.1007/978-3-031-70421-5_19(225-237)Online publication date: 14-Nov-2024
https://doi.org/10.1007/978-3-031-70421-5_19

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten