Abstract
The rapid growth of AI/ML workloads has outpaced the capabilities of CPU-centric architectures to deliver the required data throughput and compute efficiency. This paper introduces a GPU-centric architecture leveraging GPUDirect Storage (GDS) to transfer data directly from SSDs to GPU memory, bypassing CPU bottlenecks and enabling high-throughput data paths. We propose Embedding from Storage Pipelined Network (ESPN) and its extension, ESPN-LIVE, which employ optimizations like data prefetching and on-demand embedding generation to align storage latency with GPU throughput. Experiments show ESPN reduces query latency by up to \(3.9\times\), cuts memory usage by up to \(16\times\), and improves throughput by up to 68%. ESPN-LIVE eliminates the need to store multi-vector embeddings by dynamically computing document representations, reducing storage costs by up to \(16\times\), and making it particularly effective for single-query systems. These results highlight the potential of SSD-GPU integration for scalable, high-performance AI/ML workloads in information retrieval and LLM applications.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Executing machine learning (ML) models for predictions and inferences has become a critical workload across diverse domains, such as information retrieval, natural language processing (NLP), image recognition, and recommendation systems. These ML services are typically hosted on high-performance servers equipped with multiple CPU cores and specialized accelerators, such as GPUs and TPUs, which are essential for efficiently handling inference tasks [1, 2].
Large Language Models (LLM), which are central to most AI systems today [3,4,5,6,7], often have large memory footprints, as shown in Table 1. Given the limited memory capacity of GPUs and the high memory demands of modern information processing systems, it is essential to optimize memory hierarchy utilization to address these demands while maintaining high-throughput performance. For computations performed on the GPU, datasets stored in secondary storage, such as SSDs, must be transferred to the GPU for processing. Typically, this data transfer is initiated and managed by the CPU, with the dataset first moved from secondary storage to main memory and then from main memory to the GPU. This approach is inefficient and fails to fully leverage the capabilities of the memory hierarchy in heterogeneous systems.
In information retrieval systems, this issue becomes even more pronounced with large embedding tables required for retrieval and re-ranking [8, 9]. Document embeddings, which need to be fetched from storage devices for similarity computation with query vectors, vary based on the query. Consequently, embeddings must be repeatedly retrieved for every new query. If the main memory is involved in every retrieval, valuable CPU cycles are consumed, and latency is introduced during the transfer.
One potential solution is to make the GPU the central command unit for compute-intensive workloads. As shown in Fig. 1, heterogeneous systems featuring both CPUs and GPUs exhibit a distinct memory hierarchy: GPU SRAM is the fastest memory, followed by GPU HBM, CPU DRAM (main memory), and secondary storage such as NVMe SSDs. Each level down the hierarchy offers slower speeds but reduced costs. For instance, SSDs are significantly cheaper and can be configured under different RAID setups to maximize system bandwidth.
GPUDirect technology enables GPUs to transfer data directly from SSDs to GPU HBM with minimal CPU involvement [10]. This direct data path reduces latency by eliminating the intermediate step of copying data from SSDs to main memory [11]. Furthermore, it increases throughput when utilizing multiple SSDs and frees the CPU for other tasks.
In this paper, we propose a novel memory organization and management strategy for heterogeneous systems with compute-intensive workloads, termed the GPU-centric architecture. Our contributions include:
-
1.
Proposing an efficient system design for heterogeneous architectures with a GPU-centric memory organization.
-
2.
Highlighting the impact of direct SSD access on modern workloads by presenting a case study of an information retrieval system based on ESPN [12], demonstrating memory efficiency and reduced latency.
2 Background
Over the past decade, the computational capacity of GPUs for machine learning, measured in FLOPs, has increased nearly 30-fold due to advancements in parallel processing architectures and high-bandwidth memory systems [13]. These developments have shifted compute-intensive tasks from CPUs to GPUs [14, 15] However, in many modern machine learning (ML) workflows, data are still loaded from storage to the CPU before being transferred to the GPU for computation, creating inefficiencies in the data transfer pipeline [14, 16]. This workflow creates significant bottlenecks, primarily due to the reliance on PCIe links for both storage-to-CPU and CPU-to-GPU data transfers. Moreover, while modern SSDs offer a peak sequential bandwidth of 7 GB/s, this is an order of magnitude lower than the maximum bandwidth supported by PCIe, further compounding the problem.
Even with PCIe Gen 5, which offers a theoretical peak bandwidth of 32 GB/s, the transfer rate remains two orders of magnitude slower than the GPU’s computational throughput. Table 2 highlights the memory capacity and bandwidth of various popular GPUs. The memory capacity of modern GPUs is still substantially lower than the demands of many applications, and the memory bandwidth significantly exceeds the limitations of the connecting bus.
This mismatch often leads to suboptimal GPU utilization, a problem that can be exacerbated in multi-GPU systems. Additionally, the CPU is heavily burdened during the data transfers, as it must moderate data movement between components, leaving little opportunity for it to perform other meaningful computations. This inefficiency underscores the need for innovative solutions to optimize data flow and resource utilization in GPU-centric workloads.
Although memory mapping (mmap) allows files to be directly integrated into a process’s virtual address space, it comes with inherent inefficiencies. In scenarios where data are accessed randomly—as is typical in ML workloads—the frequent page faults and associated OS-level management can significantly slow down data transfer, thereby exacerbating the bottlenecks already present in PCIe-based communication [18]. To address this workload challenge, GPUDirect was developed, enabling GPUs to directly manage data transfers from storage devices to their high-bandwidth memory (HBM) with minimal CPU involvement [19]. This approach significantly reduces CPU overhead, allowing GPUs to handle data movement more efficiently. As illustrated in Fig. 2, GPUDirect demonstrates higher efficiency than SSD-to-CPU transfers for smaller I/O sizes, while achieving comparable performance for larger I/O sizes. However, for large I/O sizes, the initial setup and management overhead in GPUDirect can become more pronounced, potentially leading to lower throughput compared to CPU-only transfers [20]. Nevertheless, this overhead is relatively insignificant when transferring large data files, resulting in efficient data movement. By leveraging GPUDirect, systems can fully utilize the high read/write bandwidths of SSDs, freeing the CPU to focus on other tasks.
Given the relatively low cost of storage devices, multiple SSDs can be multiplexed through a high-speed PCIe connection to maximize its bandwidth utilization, enabling data transfer at substantially higher rates. Such a configuration would ensure better utilization of GPU resources while mitigating the bottlenecks inherent in traditional data transfer workflows.
3 Scalable neural information retrieval with SSDs
Neural Information Retrieval models power modern search engines, Retrieval-Augmented Generation systems, and recommendation platforms, delivering accurate and contextually relevant results to complex queries [4, 21, 22]. Traditional Information Retrieval (IR) methods, such as BM25, were designed for keyword matching and optimized for execution on CPUs [23]. Although effective for simple retrieval tasks, these approaches often failed to capture the semantic depth of queries and documents, particularly when dealing with synonyms, contextual nuances, or complex relationships. With the adoption of neural approaches and advanced language models like BERT and ColBERT, modern information retrieval (IR) systems have transitioned to GPU-based architectures to accommodate their complexity and performance requirements [24, 25]. This transition from CPU-based to GPU-based computation has dramatically enhanced retrieval accuracy but has also introduced significant challenges in ensuring memory efficiency and scalability.
3.1 Neural information retrieval systems
Neural IR systems encode text into rich contextual embeddings, capturing deeper semantic nuances and delivering state-of-the-art performance in retrieval tasks [25, 26]. Unlike traditional keyword-based methods, these embeddings necessitate computationally intensive similarity searches, which are efficiently executed on GPUs. The Neural Information Retrieval pipeline employs a multi-stage process optimized for large-scale datasets. First, queries and documents are encoded into dense embeddings using pre-trained or fine-tuned language models. These embeddings are processed by Approximate Nearest Neighbor (ANN) search algorithms, like FAISS or SPANN, which use clustering techniques such as k-means to efficiently identify candidate documents [27, 28]. In the re-ranking stage, computationally intensive similarity functions, like the MaxSim operator, refine the relevance of these candidates [25]. This hierarchical approach balances speed and accuracy, delivering highly relevant results while scaling effectively to large datasets.
Neural Information Retrieval (IR) systems have made significant strides with the adoption of multi-vector models, which encode queries and documents at the token level, achieving state-of-the-art retrieval accuracy. However, these advancements come with substantially increased memory requirements, posing challenges to scalability. Table 3 provides a comprehensive overview of various retrieval systems, showcasing their index size, query latency, and retrieval performance across in-domain (MSMARCO v1) and out-of-domain (MSMARCO v2) datasets [34, 35]. In information retrieval, Recall@K measures the proportion of relevant items retrieved within the top K results, indicating how well the system captures relevant information. Mean Reciprocal Rank (MRR) evaluates the average of the reciprocal ranks of the first relevant item across multiple queries, reflecting how quickly the first relevant result appears. While advanced neural multi-vector models achieve superior retrieval scores, their index sizes increase dramatically compared to traditional BM25. This significant growth in memory requirements underscores the scalability challenges inherent to deploying state-of-the-art retrieval systems in large-scale scenarios.
Conventional approaches that store retrieval indices and document embeddings in system memory become cost-prohibitive for large-scale datasets. To overcome this limitation, ESPN was introduced, leveraging SSDs to store embeddings while ensuring low latency and high throughput by overlapping compute with prefetching, effectively addressing the memory bottleneck in large-scale retrieval tasks [12].
3.2 ESPN: embedding from storage pipelined network
The Embedding from Storage Pipelined Network (ESPN) introduces a highly efficient, GPU-centric architecture tailored for large-scale Neural Information Retrieval (IR), optimizing retrieval performance and scalability. ESPN minimizes query latency by directly transferring data from SSDs to GPU memory using GPUDirect Storage (GDS), bypassing traditional CPU-based file I/O bottlenecks. The system incorporates several key ideas, including software prefetching and early re-ranking to optimize performance as shown in Fig 3. ESPN dynamically manages GPU memory through efficient I/O across the memory hierarchy, optimizing GPU resources for caching LLM parameters by partitioning candidate generation indices (ANN index) in CPU memory and offloading large multi-vector embeddings (re-ranking index) to SSDs.
A naive SSD-based retrieval approach introduces significant storage latency into the critical path of query execution, severely impacting query throughput and scalability, particularly for large datasets. To mitigate this, we propose a flexible software prefetcher for hierarchical clustering-based searches that exploits the characteristics of approximate nearest neighbor (ANN) algorithms. By examining an initial subset of clusters, the prefetcher identifies a likely portion of the true nearest neighbors and generates an approximate list of document IDs. This approach leverages the efficiency of inverted file (IVF)-based ANN algorithms, which balance accuracy and speed by controlling the number of clusters (nprobe) to search. Figure 4 highlights how the number of clusters searched influences recall and search latency. Once \(\delta\) clusters are explored, the prefetcher retrieves embeddings for the top K document IDs using GPUDirect Storage (GDS), while the ANN search continues examining additional clusters \(\lambda\) to refine recall.
This overlap ensures that embedding retrieval is completed in parallel with ANN search, effectively hiding storage latency and minimizing delays in the critical path. The prefetcher’s effectiveness depends on parameters such as the prefetching budget, which quantifies the time available to overlap retrieval and computation, and can be adjusted to accommodate larger datasets or varying query loads. The PrefetchBudget can be computed using Eq. 1. The PrefetchBudget is controlled by the hyperparameter PrefetchStep, which dictates when prefetching occurs. PrefetchStep is defined as a percentage of the total nprobe selected and is computed using Eq. 2. By dynamically managing the trade-off between latency and recall, the prefetcher supports high-throughput query execution while ensuring embeddings are ready for re-ranking as the ANN search concludes.
ESPN further reduces critical path delays through an early re-ranking stage, where the MaxSim operator processes embeddings immediately after retrieval. By utilizing GPU resources efficiently and overlapping computation with ANN search, this step minimizes the re-ranking workload, particularly for batch queries. ESPN combines these techniques to deliver near-memory-level performance for SSD-based retrievals, reduce memory overhead, and effectively scale to large datasets and high query throughput.
To fully hide prefetcher latency, it is essential for the prefetching process to complete before the conclusion of the approximate nearest neighbor (ANN) search. For single-query systems, this is typically feasible due to adequate SSD bandwidth and prefetch budgets. However, as the number of simultaneous or batch queries increases, the bandwidth demand rises, and beyond a certain batch size, the prefetching time may exceed the budget. This introduces latency into the critical path. The maximum query batch size that avoids such latency can be determined using Eq. 3.
3.3 ESPN-lite: lightweight partial re-ranking for bandwidth-efficient IR
Neural IR systems typically re-rank large candidate sets (e.g., 1000+ documents) to generate a ranked list of results. However, with advancements in single-vector neural retrievers, comparable retrieval performance can be achieved by re-ranking only a subset of top candidates. For instance, re-ranking the top 64-128 documents using the MaxSim operator and aggregating the results with the remaining candidates maintains 99\(-\)99.7% of MRR@10 scores for ColBERTv2 and ColBERTer. This approach significantly reduces embedding data transfer (8-16x per query), enabling larger query batch sizes while preserving high retrieval accuracy. Evaluation was performed using first-stage retrievers from Pyserini, as shown in Fig. 5.
3.4 ESPN-LIVE: real-time embedding synthesis
A significant challenge in modern GPU-based workloads is the underutilization of GPU resources due to delays caused by I/O and other dependencies. In the neural retrieval pipeline, GPU-based computation accounts for less than 15% of the total retrieval time, with the majority spent on CPU-driven ANN searches and index retrievals. Leveraging the idle GPU resources during these periods offers an opportunity to enhance both memory and storage efficiency. As GPUs continue to grow in computational power, they can now perform multiple BERT inferences simultaneously through batching. Table 4 illustrates the optimized BERT latency with different batch sizes on various NVIDIA GPUs. This indicates that for small query workloads, document embeddings can be computed on-demand with minimal encoding overhead.
ESPN-Lite introduced a bandwidth-efficient partial re-ranking solution that retrieves only a small subset of top-k document embeddings for re-ranking. Building on this concept, we propose ESPN-LIVE (shown in Fig. 6), which eliminates the need to retrieve precomputed document embeddings from SSDs. Instead, we retrieve pre-tokenized documents and generate embeddings for a small top-k list on-the-fly. In ESPN-LIVE, the prefetching operation is replaced with early document encoding, where embeddings for an initial set of documents are computed and overlapped with the CPU-based ANN search. For any documents missed during this process, embeddings are computed during the critical path. By combining ESPN’s efficient overlapping pipeline with approximate re-ranking and on-demand embedding synthesis, ESPN-LIVE completely removes the need to store multi-vector document embeddings in storage, making it highly effective for small query workloads.
4 Evaluation
In this section, we evaluate the performance and scalability of ESPN and its extension ESPN-LIVE. Our evaluation focuses on demonstrating the effectiveness of ESPN’s efficiency, including SSD-based embedding retrieval, prefetching strategies, and real-time embedding synthesis, across several metrics. The results are based on experiments conducted using the publicly available ColBERTer model on the MS-MARCO v1 and v2 datasets.
4.1 Experimental setup
The ESPN framework integrates approximately 600 lines of C++ and 1100 lines of Python code, leveraging Nvidia GPUDirect Storage for direct data transfers, and FAISS for efficient ANN search. Experiments were conducted on MS-MARCO v1 and v2 datasets using an Intel Xeon W-2255 CPU, an Nvidia A5000 GPU, and a Samsung PM983 SSD. In the MS-MARCO v1 dataset, we used Faiss with an IVFFlat index (4.3 GB) and in MS-MARCO v2, Faiss with IVFPQ (m=128, nbits=8) resulting in a 17.5 GB ANN index. In ESPN and ESPN-Lite, these indices were cached in system memory, while the re-ranking embedding tables (16.8 GB for v1 and 255.4 GB for v2) were stored on SSDs. Embedding retrieval and prefetching were configured to achieve 90% prefetcher hit rates with memory configurations simulated using cgroups [37]. The latency measurements encompassed all retrieval pipeline stages to assess end-to-end query performance. In approximate methods such as ESPN-Lite and ESPN-LIVE, we re-rank only the top 128 documents retrieved from the ANN search, whereas in exact solutions, the top 1024 documents are re-ranked.
4.2 Prefetcher results
Figure 7 illustrates the prefetcher’s hit rate across varying prefetch steps, expressed as a percentage of nprobe. For the v1 dataset (nprobe = 1000 and 3000), the hit rate improves significantly from 74% and 85% at 5% prefetch step to over 90% at 30%. Similarly, for the v2 dataset (nprobe = 160 and 200), the hit rate rises from 68% and 70% at 5% to approximately 89% and 90% at 30%. The results show that we can effectively prefetch the majority of embeddings, reducing the need to access many embeddings in the critical path.
4.3 End-to-end retrieval results
Table 5 and 6 show the average end-to-end query latency with different memory configuration for MS-MARCO v1 and v2, respectively, using the publicly available ColBERTer model. This section includes exact retrieval scores with no approximations using different memory configurations (mmap, Virtual memory, GDS, ESPN) and approximate retrieval results (ESPN-Lite, ESPN-LIVE). The retrieval scores achieved by the memory-based and storage-based (GDS, ESPN) methods are equivalent and are reported in Table 3. ESPN-Lite and ESPN-LIVE incur a negligible quality degradation of only 0.5% on the MRR@10 scores.
When the embedding tables are small enough to fit in memory, mmap performs efficiently by leveraging system memory caching after the initial access. However, as index size exceeds available memory, mmap incurs significant software overhead, as demonstrated in table 5 with a 10-GB memory limit. However, as index sizes exceed available memory, mmap incurs significant software overhead, as demonstrated in Table 5 with a 10-GB memory limit. For MS-MARCO v2, where the index far exceeds physical memory capacity, this overhead becomes substantial, making mmap impractical for large-scale retrieval tasks. ESPN addresses this challenge, achieving 3.1\(-\)3.9\(\times\) faster query latencies than mmap by mitigating these overheads. Using swap space partially alleviates mmap’s inefficiencies by pre-fetching 8 pages per fault, but this approach is limited by the combined size of the memory and swap space.
GDS reduces software overhead but retains SSD latency in the critical path. ESPN achieves near-memory-level query latency without storing BOW embeddings in memory (Table 5) and provides a practical cached solution approximation with \(\le 1ms\) embedding access latency (Table 6). ESPN offers competitive end-to-end latency (\(1.02\times\)) compared to fully cached solutions while storing only 6-19% of retrieval indices in memory, reducing memory usage by \(5-16\times\) depending on ANN index size.
ESPN-Lite improves scalability of the system by re-ranking only the top ranked documents (128) for a marginal cost of 0.5% reduction in the MRR@10 retrieval scores. In contrast, ESPN-LIVE exhibits comparable retrieval performance while obviating the need to store re-ranking embedding tables, thereby alleviating storage constraints. However, this approach is subject to limitations with regard to batch size scalability: increasing query batches incurs higher encoding costs. Notably, the benefits of ESPN-LIVE are most pronounced in systems characterized by large document collections, which necessitate increased memory and storage demands for retrieval indices, but operate at low query rates.
4.4 Large-scale query throughput
The performance gap between memory-based and SSD-based solutions largely stems from embedding access latency in the pipeline’s critical path. If SSD access latency approaches memory levels, similar query throughput can be maintained for large batches. To evaluate ESPN’s scalability, we fixed a prefetch budget and modeled embedding retrieval for exact and approximate solutions. Exact solutions retrieve 1024 document embeddings per query, while approximate solutions retrieve only the top 128 embeddings. SSD latency measurements were obtained using Nvidia’s gdsio tool [38].
Figure 8 shows the end-to-end query throughput with larger batch sizes for the exact retrieval solutions. In the end-to-end system, ESPN is competitive with DRAM-based solution up to a batch size of 12 and improves GDS-based retrieval by 68% in throughput. Figure 9 shows the embedding access latency in the critical path of the query. Naive GDS-based solutions incur 7.7\(\times\) higher embedding access latency compared to DRAM. ESPN minimizes this gap by prefetching the relevant embeddings during the ANN search and only retrieving a small portion of the missed document in the critical path. These results show that ESPN-Lite is competitive with fully cached DRAM-based solutions up to a batch size of 96. ESPN-Lite reduces the bandwidth requirements by 8\(\times\), enabling the prefetcher to stay within its budget even for larger batch sizes.
5 Related work
Several works have explored SSD-based systems to boost memory efficiency in neural information processing systems. For example, DiskANN [39] and SPANN [40] offload parts of the ANN index to storage, yet they focus solely on ANN search rather than the entire neural IR pipeline. In contrast, our work targets end-to-end multi-vector retrieval by offloading a substantial re-ranking index to SSDs while incorporating a flexible prefetcher with early re-ranking to mitigate SSD latencies. Moreover, while many state-of-the-art systems reduce memory demands through token dropping, quantization, and compression [31, 41, 42], they do not directly optimize for multi-vector embedding retrieval. To the best of our knowledge, ESPN and ESPN-LIVE are the first to address this gap. Recent work on multi-vector optimization has explored mapping ColBERT embeddings to a sparse space for efficient retrieval [43] and reducing memory usage with memory mapping while accelerating retrieval via hybrid scoring [44]. Our approach seamlessly integrates with any multi-vector model by modifying the retrieval pipeline without altering the model weights.
6 Limitations and future work
Optimizing retrieval by using larger I/O sizes (e.g., 8KB) and packing more token embeddings per block can enhance throughput. However, PCIe bandwidth remains a bottleneck compared to GPU HBM bandwidth, particularly in GPU-centric architectures using SSDs for large batch sizes. Scaling such systems for higher throughput is constrained by the limited bandwidth and efficiency of current storage interfaces.
Future directions should explore advanced RAID configurations (e.g., RAID 0) to aggregate multiple SSDs and scale bandwidth effectively [45, 46]. Scalability and bandwidth challenges in GPU-centric architectures require dynamic memory management approaches that effectively balance memory across GPU, CPU, and SSDs. CXL interconnects offer a promising avenue for distributed memory solutions, potentially enabling higher bandwidth and lower latency for large-scale retrieval systems [47, 48].
Expanding ESPN’s design principles to LLM inference systems presents an exciting opportunity. By offloading storage-intensive tasks such as key-value caching or embedding retrieval, these systems can handle growing dataset sizes more efficiently while maintaining low latency. Furthermore, overlapping IR latency with text generation in Retrieval-Augmented Generation (RAG) pipelines and dynamically caching frequently accessed embeddings in high-bandwidth memory tiers can enhance system performance.
7 Conclusion
The increasing prevalence of GPU-centric workloads in AI and ML has underscored the critical need to address escalating memory demands while maintaining compute efficiency. This paper demonstrates the significant potential of GPU-centric architectures augmented with SSDs to address the memory and compute challenges posed by modern AI/ML workloads. By leveraging GPUDirect Storage (GDS) and introducing ESPN, the study bridges the gap between storage latency and GPU throughput, effectively reducing memory overhead while maintaining low query latency through advanced ANN-based prefetching. The introduction of ESPN-LIVE further underscores the scalability and efficiency of this approach, particularly in scenarios involving large-scale databases with low query rates. Experimental results validate the viability of these optimizations, showcasing significant gains in scalability, retrieval latency, and system throughput.
References
Gupta U, Wu CJ, Wang X, Naumov M, Reagen B, Brooks D et al (2020) The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation. In: 2020 IEEE International Symposium on High Performance Computer Architectures (HPCA); p. 488–501
Li J, Xu J, Huang S, Chen Y, Li W, Liu J et al Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective. Available from: https://arxiv.org/abs/2410.04466
Wei W, Ren X, Tang J, Wang Q, Su L, Cheng S et al (2024) LLMRec: Large Language Models with Graph Augmentation for Recommendation. In: Proceedings of the 17th ACM International Conference on Web Search and Data Mining. WSDM ’24. Association for Computing Machinery. New York, NY, USA: Association for Computing Machinery; p. 806-815. Available from: https://doi.org/10.1145/3616855.3635853
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N et al (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.; p. 1–16
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L et al (2021) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 44(10):7112–7127. https://doi.org/10.1109/tpami.2021.3095381
Chen M, Tworek J, Jun H, Yuan Q, Pinto J Kaplan, Edwards H et al Evaluating Large Language Models Trained on Code. Available from: https://arxiv.org/abs/2107.03374
Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A et al The Llama 3 Herd of Models. Available from: https://arxiv.org/abs/2407.21783
Shrestha SL, Li Z, Pitchumani R System and method for embeddings retrieval. Google Patents. US Patent App. 18/226,759
Shrestha SL, Li Z, Pitchumani R System and method for processing embeddings. Google Patents. US Patent App. 18/226,758
: Nvidia GPUDirect Storage. NVIDIA Docs. Available from: https://docs.nvidia.com/gpudirect-storage/index.html
Inupakutika D, Davis B, Yang Q, Kim D, Akopian D (2022) Quantifying Performance Gains of GPUDirect Storage. In: 2022 IEEE International Conference on Networking, Architecture and Storage (NAS); p. 1–9
Shrestha S, Reddy N, Li Z (2024) ESPN: Memory-Efficient Multi-vector Information Retrieval. In: Proceedings of the 2024 ACM SIGPLAN International Symposium on Memory Management. New York, NY, USA: Association for Computing Machinery; p. 95–107
Hobbhahn M. Trends in GPU Price-Performance. Available from: https://epoch.ai/blog/trends-in-gpu-price-performance
Carvalho MNL, Simitsis A, Queralt A, Romero O (2024) Workload Placement on Heterogeneous CPU-GPU Systems. Proc VLDB Endow. 17(12):4241–424. https://doi.org/10.14778/3685800.3685845
Girondi M, Scazzariello M, Maguire GQ, Kostić D (2024) Toward GPU-centric Networking on Commodity Hardware. In: Proceedings of the 7th International Workshop on Edge Systems, Analytics and Networking. EdgeSys ’24. New York, NY, USA: Association for Computing Machinery; p. 43-48. Available from: https://doi.org/10.1145/3642968.3654820
Rosenfeld V, Breß S, Markl V (2022) Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput Surv. Jan;55(1https://doi.org/10.1145/3485126
DigitalOcean GPU Memory Bandwidth. Accessed: 2025-01-09. https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth
Crotty A, Leis V, Pavlo A (2022) Are You Sure You Want to Use MMAP in Your Database Management System? In: CIDR 2022, Conference on Innovative Data Systems Research;
Qureshi Z, Mailthody VS, Gelado I, Min S, Masood A, Park J et al (2023) GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery; p. 325-339. Available from: https://doi.org/10.1145/3575693.3575748
: The Micron 9400 NVMe SSD Performance With NVIDIA Magnum IO GPUDirect Storage Platform. Available from: https://www.micron.com/content/dam/micron/global/public/products/white-paper/micron-9400-nvidia-gds-vs-comp-white-paper.pdf
Li X, Jin J, Zhou Y, Zhang Y, Zhang P, Zhu Y et al From Matching to Generation: A Survey on Generative Information Retrieval. Available from: https://arxiv.org/abs/2404.14851
Salemi A, Zamani H Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models. Available from: https://arxiv.org/abs/2405.00175
Robertson S, Zaragoza H (2009) The Probabilistic Relevance Framework: BM25 and Beyond. Found Trends Inf Retr. apr 3(4):333–38. https://doi.org/10.1561/1500000019
Devlin J, Chang MW, Lee K, Toutanova K BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available from: https://arxiv.org/abs/1810.04805
Khattab O, Zaharia M ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Available from: https://arxiv.org/abs/2004.12832
Dai Z, Callan J Deeper Text Understanding for IR with Contextual Neural Language Modeling. ACM. Available from: http://dx.doi.org/10.1145/3331184.3331303
Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré PE et al The Faiss library. Available from: https://arxiv.org/abs/2401.08281
Chen Q, Zhao B, Wang H, Li M, Liu C, Li Z et al SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search. Available from: https://arxiv.org/abs/2111.08566
Mackenzie J, Trotman A, Lin J Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation
Gao L, Dai Z, Callan J (2021) COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics; p. 3030–3042. Available from: https://aclanthology.org/2021.naacl-main.241
Hofstätter S, Khattab O, Althammer S, Sertkan M, Hanbury A (2022) Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions Using Enhanced Reduction. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. CIKM ’22. New York, NY, USA: Association for Computing Machinery; p. 737-747. Available from: https://doi.org/10.1145/3511808.3557367
Santhanam K, Khattab O, Saad-Falcon J, Potts C, Zaharia M (2022) ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics; p. 3715–3734. Available from: https://aclanthology.org/2022.naacl-main.272
Santhanam K, Khattab O, Potts C, Zaharia M (2022) PLAID: An Efficient Engine for Late Interaction Retrieval. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. CIKM ’22. New York, NY, USA: Association for Computing Machinery; p. 1747-1756. Available from: https://doi.org/10.1145/3511808.3557325
Bajaj P, Campos D, Craswell N, Deng L, Gao J, Liu X, et al MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
: TREC 2023 Deep Learning Track. Available from: https://microsoft.github.io/msmarco/TREC-Deep-Learning.html
Developer N AI Inference. Accessed: 2025-01-09. https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference
Kerrisk MJ cgroups.7 - Linux manual page. https://man7.org/linux/man-pages/man7/cgroups.7.html
: NVIDIA GPUDirect Storage Benchmarking and Configuration Guide. NVIDIA Docs. Available from: https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html
Subramanya SJ, Devvrit, Rohan K, Krishnaswamy R, Simhadri H (2019) DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In: NeurIPS 2019;
Chen Q, Zhao B, Wang H, Li M, Liu C, Li Z et al (2021) SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search. In: 35th Conference on Neural Information Processing Systems (NeurIPS 2021);
Jégou H, Douze M, Schmid C (2011) Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence. 33(1):117–12. https://doi.org/10.1109/TPAMI.2010.57
Nardini FM, Rulli C, Venturini R Efficient Multi-Vector Dense Retrieval Using Bit Vectors. Available from: https://arxiv.org/abs/2404.02805
Formal T, Clinchant S, Déjean H, Lassance C (2024) Splate: Sparse late interaction retrieval. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; p. 2635–2640
Huang K, Venkatesh T, Dingankar U, Mallia A, Campos D, Jiao J et al (2025) ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring. In: Proceedings of the 47th European Conference on Information Retrieval (ECIR);
Patterson DA, Gibson GA, Katz RH A Case for Redundant Arrays of Inexpensive Disks (RAID); 1987. UCB/CSD-87-391. Available from: http://www2.eecs.berkeley.edu/Pubs/TechRpts/1987/5853.html
Qin M, Reddy ALN, Gratz PV, Pitchumani R, Ki YS (2021) KVRAID: high performance, write efficient, update friendly erasure coding scheme for KV-SSDs. In: Proceedings of the 14th ACM International Conference on Systems and Storage. SYSTOR ’21. New York, NY, USA: Association for Computing Machinery; Available from: https://doi.org/10.1145/3456727.3463781
Sharma DD, Blankenship R, Berger DS An Introduction to the Compute Express Link (CXL) Interconnect
Tirumalasetty C, Annapareddy NR (2024) Contention aware DRAM caching for CXL-enabled pooled memory. In: Proceedings of the International Symposium on Memory Systems. MEMSYS ’24. New York, NY, USA: Association for Computing Machinery; p. 157-171. Available from: https://doi.org/10.1145/3695794.3695808
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shrestha, S., Gautam, A. & Reddy, N. Storage access optimization for efficient GPU-centric information retrieval. J Supercomput 81, 613 (2025). https://doi.org/10.1007/s11227-025-07118-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07118-9