research-article

Free Access

Learning Balanced Tree Indexes for Large-Scale Vector Retrieval

Authors:
Wuchao Li

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

0009-0004-8789-2319
View Profile

,
Chao Feng

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

0000-0001-5440-9758
View Profile

,
Defu Lian

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

0000-0002-3507-9607
View Profile

,
Yuxin Xie

Guangdong OPPO Mobile Telecommunications Corp., Ltd, Dongguan, China

Guangdong OPPO Mobile Telecommunications Corp., Ltd, Dongguan, China

0009-0008-4677-6118
View Profile

,
Haifeng Liu

Guangdong OPPO Mobile Telecommunications Corp., Ltd, Dongguan, China

Guangdong OPPO Mobile Telecommunications Corp., Ltd, Dongguan, China

0009-0000-2922-3898
View Profile

,
Yong Ge

University of Arizona, Tucson, AZ, USA

University of Arizona, Tucson, AZ, USA

0000-0002-9630-795X
View Profile

,
Enhong Chen

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

0000-0002-4835-4102
View Profile

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAugust 2023Pages 1353–1362https://doi.org/10.1145/3580305.3599406

Published:04 August 2023Publication History

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1353–1362

ABSTRACT

Vector retrieval focuses on finding the k-nearest neighbors from a bunch of data points, and is widely used in a diverse set of areas such as information retrieval and recommender system. The current state-of-the-art methods represented by HNSW usually generate indexes with a big memory footprint, restricting the scale of data they can handle, except resorting to a hybrid index with external storage. The space-partitioning learned indexes, which only occupy a small memory, have made great breakthroughs in recent years. However, these methods rely on a large amount of labeled data for supervised learning, so model complexity affects the generalization.

To this end, we propose a lightweight learnable hierarchical space partitioning index based on a balanced K-ary tree, called BAlanced Tree Learner (BATL), where the same bucket of data points are represented by a path from the root to the corresponding leaf. Instead of mapping each query into a bucket, BATL classifies it into a sequence of branches (i.e. a path), which drastically reduces the number of classes and potentially improves generalization. BATL updates the classifier and the balanced tree in an alternating way. When updating the classifier, we innovatively leverage the sequence-to-sequence learning paradigm for learning to route each query into the ground-truth leaf on the balanced tree. Retrieval is then boiled down into a sequence (i.e. path) generation task, which can be simply achieved by beam search on the encoder-decoder. When updating a balanced tree, we apply the classifier for navigating each data point into the tree nodes layer by layer under the balance constraints. We finally evaluate BATL with several large-scale vector datasets, where the experimental results show the superiority of the proposed method to the SOTA baselines in the tradeoff among latency, accuracy, and memory cost.

Supplemental Material

rtfp1343-2min-promo.mp4

mp4

4.5 MB

Download

Available for Download

mp4

rtfp1343-20min-video.mp4 (489.8 MB)

References

Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 87 (2020), 101374.Google ScholarDigital Library
Franz Aurenhammer. 1991. Voronoi diagrams-a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR) 23, 3 (1991), 345--405.Google ScholarDigital Library
Artem Babenko and Victor Lempitsky. 2014. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 931--938.Google ScholarDigital Library
Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence 37, 6 (2014), 1247--1260.Google Scholar
Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055--2063.Google Scholar
Dmitry Baranchuk, Dmitry Persiyanov, Anton Sinitsin, and Artem Babenko. 2019. Learning to route in similarity graphs. In International Conference on Machine Learning. PMLR, 475--484.Google Scholar
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509--517.Google ScholarDigital Library
Jon Louis Bentley. 1979. Multidimensional binary search trees in database appli-cations. IEEE Transactions on Software Engineering 4 (1979), 333--340.Google ScholarDigital Library
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 380--388.Google ScholarDigital Library
Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems 34 (2021), 5199--5212.Google Scholar
Sanjoy Dasgupta and Kaushik Sinha. 2013. Randomized partition trees for exact nearest neighbor search. In Conference on learning theory. PMLR, 317--337.Google Scholar
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. 253--262.Google ScholarDigital Library
Yihe Dong, Piotr Indyk, Ilya P. Razenshteyn, and Tal Wagner. 2020. Learning Space Partitions for Nearest Neighbor Search. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=rkenmREFDrGoogle Scholar
Chao Feng, Defu Lian, Xiting Wang, Zheng Liu, Xing Xie, and Enhong Chen. 2023. Reinforcement Routing on Proximity Graph for Efficient Recommendation. ACM Trans. Inf. Syst. 41, 1, Article 8 (jan 2023), 27 pages. https://doi.org/10.1145/ 3512767Google ScholarDigital Library
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2946--2953.Google ScholarDigital Library
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the 37th International Conference on Machine Learning (ICML'20). JMLR.org, Article 364, 10 pages.Google ScholarDigital Library
Gaurav Gupta, Tharun Medini, Anshumali Shrivastava, and Alexander J. Smola. 2022. BLISS: A Billion Scale Index Using Iterative Re-Partitioning (KDD '22). Association for Computing Machinery, New York, NY, USA, 486--495. https: //doi.org/10.1145/3534678.3539414Google ScholarDigital Library
Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chunduri, and Manik Varma. 2019. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 528--536.Google ScholarDigital Library
Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117--128.Google Scholar
Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011. Searching in one billion vectors: re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 861--864.Google ScholarCross Ref
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.Google ScholarCross Ref
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39--48.Google ScholarDigital Library
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Twenty-second international joint conference on artificial intelligence.Google Scholar
Der-Tsai Lee and Bruce J Schachter. 1980. Two algorithms for constructing a Delaunay triangulation. International Journal of Computer & Information Sciences 9, 3 (1980), 219--242.Google ScholarCross Ref
Zheng Liu, Jianxun Lian, Junhan Yang, Defu Lian, and Xing Xie. 2020. Octopus: Comprehensive and elastic user representation for the generation of recommendation candidates. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 289--298.Google ScholarDigital Library
Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: hierarchical graph structure based on voronoi diagrams for solving approximate nearest neighbor search. Proceedings of the VLDB Endowment 15, 2 (2021), 246--258.Google ScholarDigital Library
Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (2014), 61--68.Google ScholarCross Ref
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824--836.Google Scholar
Microsoft. 2021. SpaceV1B. https://github.com/microsoft/SPTAG/tree/main/ datasets/SPACEV1B.Google Scholar
Stanislav Morozov and Artem Babenko. 2018. Non-metric similarity graphs for maximum inner product search. Advances in Neural Information Processing Systems 31 (2018).Google Scholar
Behnam Neyshabur and Nathan Srebro. 2015. On symmetric and asymmetric lshs for inner product search. In International Conference on Machine Learning. PMLR, 1926--1934.Google Scholar
Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd international conference on world wide web. 373--374.Google ScholarDigital Library
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips. cc/paper/2014/file/310ce61c90f3a46e340ee8257bc70e93-Paper.pdfGoogle Scholar
Anshumali Shrivastava and Ping Li. 2015. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) (UAI'15). AUAI Press, Arlington, Virginia, USA, 812--821.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 769--790.Google Scholar
Yair Weiss, Antonio Torralba, and Rob Fergus. 2008. Spectral hashing. Advances in neural information processing systems 21 (2008).Google Scholar
Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N Holtmann-Rice, David Simcha, and Felix Yu. 2017. Multiscale Quantization for Fast Similarity Search. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2017/file/b6617980ce90f637e68c3ebe8b9be745-Paper.pdfGoogle Scholar
Yandex. 2021. Text-to-Image-1B. https://research.yandex.com/datasets/bigannsBGoogle Scholar

Index Terms

Learning Balanced Tree Indexes for Large-Scale Vector Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank
      2. Top-k retrieval in databases

Recommendations

Efficient large-scale multi-class image classification by learning balanced trees

We propose a label-tree based method for efficient large-scale multi-class image classification.The proposed method learns effective and balanced trees by jointly optimizing balance and confusion constraints.We carried out extensive experiments to ...
Read More
Learning Balanced Trees for Large Scale Image Classification
Image Analysis and Processing — ICIAP 2015
Abstract
The label tree is one of the popular approaches for the problem of large scale multi-class image classification in which the number of class labels is large, for example, several tens of thousands of labels. In learning stage, class labels are ...
Read More
Exact and Approximate Maximum Inner Product Search with LEMP
Invited Paper from ICDT 2014, Invited Paper from EDBT 2015, Regular Papers and Technical Correspondence

We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
learn to index
maximum inner product search (mips)
nearest neighbor search (nns)
transformer
tree
vector retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 654
  Total Downloads
- Downloads (Last 12 months)654
- Downloads (Last 6 weeks)98
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Balanced Tree Indexes for Large-Scale Vector Retrieval

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Efficient large-scale multi-class image classification by learning balanced trees

Learning Balanced Trees for Large Scale Image Classification

Exact and Approximate Maximum Inner Product Search with LEMP