1 Introduction

Executing machine learning (ML) models for predictions and inferences has become a critical workload across diverse domains, such as information retrieval, natural language processing (NLP), image recognition, and recommendation systems. These ML services are typically hosted on high-performance servers equipped with multiple CPU cores and specialized accelerators, such as GPUs and TPUs, which are essential for efficiently handling inference tasks [1, 2].

Table 1 LLM model and KV Cache Size: 16-bit model weights and 16-bit KV Cache

Large Language Models (LLM), which are central to most AI systems today [3,4,5,6,7], often have large memory footprints, as shown in Table 1. Given the limited memory capacity of GPUs and the high memory demands of modern information processing systems, it is essential to optimize memory hierarchy utilization to address these demands while maintaining high-throughput performance. For computations performed on the GPU, datasets stored in secondary storage, such as SSDs, must be transferred to the GPU for processing. Typically, this data transfer is initiated and managed by the CPU, with the dataset first moved from secondary storage to main memory and then from main memory to the GPU. This approach is inefficient and fails to fully leverage the capabilities of the memory hierarchy in heterogeneous systems.

In information retrieval systems, this issue becomes even more pronounced with large embedding tables required for retrieval and re-ranking [8, 9]. Document embeddings, which need to be fetched from storage devices for similarity computation with query vectors, vary based on the query. Consequently, embeddings must be repeatedly retrieved for every new query. If the main memory is involved in every retrieval, valuable CPU cycles are consumed, and latency is introduced during the transfer.

Fig. 1
figure 1

GPU-Centric Memory Hierarchy

One potential solution is to make the GPU the central command unit for compute-intensive workloads. As shown in Fig. 1, heterogeneous systems featuring both CPUs and GPUs exhibit a distinct memory hierarchy: GPU SRAM is the fastest memory, followed by GPU HBM, CPU DRAM (main memory), and secondary storage such as NVMe SSDs. Each level down the hierarchy offers slower speeds but reduced costs. For instance, SSDs are significantly cheaper and can be configured under different RAID setups to maximize system bandwidth.

GPUDirect technology enables GPUs to transfer data directly from SSDs to GPU HBM with minimal CPU involvement [10]. This direct data path reduces latency by eliminating the intermediate step of copying data from SSDs to main memory [11]. Furthermore, it increases throughput when utilizing multiple SSDs and frees the CPU for other tasks.

In this paper, we propose a novel memory organization and management strategy for heterogeneous systems with compute-intensive workloads, termed the GPU-centric architecture. Our contributions include:

  1. 1.

    Proposing an efficient system design for heterogeneous architectures with a GPU-centric memory organization.

  2. 2.

    Highlighting the impact of direct SSD access on modern workloads by presenting a case study of an information retrieval system based on ESPN [12], demonstrating memory efficiency and reduced latency.

2 Background

Over the past decade, the computational capacity of GPUs for machine learning, measured in FLOPs, has increased nearly 30-fold due to advancements in parallel processing architectures and high-bandwidth memory systems [13]. These developments have shifted compute-intensive tasks from CPUs to GPUs [14, 15] However, in many modern machine learning (ML) workflows, data are still loaded from storage to the CPU before being transferred to the GPU for computation, creating inefficiencies in the data transfer pipeline [14, 16]. This workflow creates significant bottlenecks, primarily due to the reliance on PCIe links for both storage-to-CPU and CPU-to-GPU data transfers. Moreover, while modern SSDs offer a peak sequential bandwidth of 7 GB/s, this is an order of magnitude lower than the maximum bandwidth supported by PCIe, further compounding the problem.

Even with PCIe Gen 5, which offers a theoretical peak bandwidth of 32 GB/s, the transfer rate remains two orders of magnitude slower than the GPU’s computational throughput. Table 2 highlights the memory capacity and bandwidth of various popular GPUs. The memory capacity of modern GPUs is still substantially lower than the demands of many applications, and the memory bandwidth significantly exceeds the limitations of the connecting bus.

This mismatch often leads to suboptimal GPU utilization, a problem that can be exacerbated in multi-GPU systems. Additionally, the CPU is heavily burdened during the data transfers, as it must moderate data movement between components, leaving little opportunity for it to perform other meaningful computations. This inefficiency underscores the need for innovative solutions to optimize data flow and resource utilization in GPU-centric workloads.

Table 2 GPU Specifications [17]

Although memory mapping (mmap) allows files to be directly integrated into a process’s virtual address space, it comes with inherent inefficiencies. In scenarios where data are accessed randomly—as is typical in ML workloads—the frequent page faults and associated OS-level management can significantly slow down data transfer, thereby exacerbating the bottlenecks already present in PCIe-based communication [18]. To address this workload challenge, GPUDirect was developed, enabling GPUs to directly manage data transfers from storage devices to their high-bandwidth memory (HBM) with minimal CPU involvement [19]. This approach significantly reduces CPU overhead, allowing GPUs to handle data movement more efficiently. As illustrated in Fig. 2, GPUDirect demonstrates higher efficiency than SSD-to-CPU transfers for smaller I/O sizes, while achieving comparable performance for larger I/O sizes. However, for large I/O sizes, the initial setup and management overhead in GPUDirect can become more pronounced, potentially leading to lower throughput compared to CPU-only transfers [20]. Nevertheless, this overhead is relatively insignificant when transferring large data files, resulting in efficient data movement. By leveraging GPUDirect, systems can fully utilize the high read/write bandwidths of SSDs, freeing the CPU to focus on other tasks.

Given the relatively low cost of storage devices, multiple SSDs can be multiplexed through a high-speed PCIe connection to maximize its bandwidth utilization, enabling data transfer at substantially higher rates. Such a configuration would ensure better utilization of GPU resources while mitigating the bottlenecks inherent in traditional data transfer workflows.

Fig. 2
figure 2

Comparison of throughput between SSD-to-CPU (CPUONLY) and SSD-to-GPU (GPUD) transfers using GPU Direct across varying I/O sizes, with 16 worker threads utilized for both transfer scenarios

3 Scalable neural information retrieval with SSDs

Neural Information Retrieval models power modern search engines, Retrieval-Augmented Generation systems, and recommendation platforms, delivering accurate and contextually relevant results to complex queries [4, 21, 22]. Traditional Information Retrieval (IR) methods, such as BM25, were designed for keyword matching and optimized for execution on CPUs [23]. Although effective for simple retrieval tasks, these approaches often failed to capture the semantic depth of queries and documents, particularly when dealing with synonyms, contextual nuances, or complex relationships. With the adoption of neural approaches and advanced language models like BERT and ColBERT, modern information retrieval (IR) systems have transitioned to GPU-based architectures to accommodate their complexity and performance requirements [24, 25]. This transition from CPU-based to GPU-based computation has dramatically enhanced retrieval accuracy but has also introduced significant challenges in ensuring memory efficiency and scalability.

3.1 Neural information retrieval systems

Neural IR systems encode text into rich contextual embeddings, capturing deeper semantic nuances and delivering state-of-the-art performance in retrieval tasks [25, 26]. Unlike traditional keyword-based methods, these embeddings necessitate computationally intensive similarity searches, which are efficiently executed on GPUs. The Neural Information Retrieval pipeline employs a multi-stage process optimized for large-scale datasets. First, queries and documents are encoded into dense embeddings using pre-trained or fine-tuned language models. These embeddings are processed by Approximate Nearest Neighbor (ANN) search algorithms, like FAISS or SPANN, which use clustering techniques such as k-means to efficiently identify candidate documents [27, 28]. In the re-ranking stage, computationally intensive similarity functions, like the MaxSim operator, refine the relevance of these candidates [25]. This hierarchical approach balances speed and accuracy, delivering highly relevant results while scaling effectively to large datasets.

Table 3 Neural IR performance, index size, latency, and scaling for MSMARCO v1/v2

Neural Information Retrieval (IR) systems have made significant strides with the adoption of multi-vector models, which encode queries and documents at the token level, achieving state-of-the-art retrieval accuracy. However, these advancements come with substantially increased memory requirements, posing challenges to scalability. Table 3 provides a comprehensive overview of various retrieval systems, showcasing their index size, query latency, and retrieval performance across in-domain (MSMARCO v1) and out-of-domain (MSMARCO v2) datasets [34, 35]. In information retrieval, Recall@K measures the proportion of relevant items retrieved within the top K results, indicating how well the system captures relevant information. Mean Reciprocal Rank (MRR) evaluates the average of the reciprocal ranks of the first relevant item across multiple queries, reflecting how quickly the first relevant result appears. While advanced neural multi-vector models achieve superior retrieval scores, their index sizes increase dramatically compared to traditional BM25. This significant growth in memory requirements underscores the scalability challenges inherent to deploying state-of-the-art retrieval systems in large-scale scenarios.

Conventional approaches that store retrieval indices and document embeddings in system memory become cost-prohibitive for large-scale datasets. To overcome this limitation, ESPN was introduced, leveraging SSDs to store embeddings while ensuring low latency and high throughput by overlapping compute with prefetching, effectively addressing the memory bottleneck in large-scale retrieval tasks [12].

3.2 ESPN: embedding from storage pipelined network

Fig. 3
figure 3

ESPN Retrieval Architecture

The Embedding from Storage Pipelined Network (ESPN) introduces a highly efficient, GPU-centric architecture tailored for large-scale Neural Information Retrieval (IR), optimizing retrieval performance and scalability. ESPN minimizes query latency by directly transferring data from SSDs to GPU memory using GPUDirect Storage (GDS), bypassing traditional CPU-based file I/O bottlenecks. The system incorporates several key ideas, including software prefetching and early re-ranking to optimize performance as shown in Fig 3. ESPN dynamically manages GPU memory through efficient I/O across the memory hierarchy, optimizing GPU resources for caching LLM parameters by partitioning candidate generation indices (ANN index) in CPU memory and offloading large multi-vector embeddings (re-ranking index) to SSDs.

A naive SSD-based retrieval approach introduces significant storage latency into the critical path of query execution, severely impacting query throughput and scalability, particularly for large datasets. To mitigate this, we propose a flexible software prefetcher for hierarchical clustering-based searches that exploits the characteristics of approximate nearest neighbor (ANN) algorithms. By examining an initial subset of clusters, the prefetcher identifies a likely portion of the true nearest neighbors and generates an approximate list of document IDs. This approach leverages the efficiency of inverted file (IVF)-based ANN algorithms, which balance accuracy and speed by controlling the number of clusters (nprobe) to search. Figure 4 highlights how the number of clusters searched influences recall and search latency. Once \(\delta\) clusters are explored, the prefetcher retrieves embeddings for the top K document IDs using GPUDirect Storage (GDS), while the ANN search continues examining additional clusters \(\lambda\) to refine recall.

Fig. 4
figure 4

Recall@1K vs nprobe with ColBERTer in MS-MARCO v1 dataset

$$\begin{aligned} & \begin{aligned} \text {PrefetchBudget}&\cong \text {ANNSearchTime}(nprobe = \eta ) \\&\quad - \text {ANNSearchTime}(nprobe = \delta ) \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} & \quad \text {PrefetchStep} = \frac{\delta }{\eta } \times 100\% \end{aligned}$$
(2)

This overlap ensures that embedding retrieval is completed in parallel with ANN search, effectively hiding storage latency and minimizing delays in the critical path. The prefetcher’s effectiveness depends on parameters such as the prefetching budget, which quantifies the time available to overlap retrieval and computation, and can be adjusted to accommodate larger datasets or varying query loads. The PrefetchBudget can be computed using Eq. 1. The PrefetchBudget is controlled by the hyperparameter PrefetchStep, which dictates when prefetching occurs. PrefetchStep is defined as a percentage of the total nprobe selected and is computed using Eq. 2. By dynamically managing the trade-off between latency and recall, the prefetcher supports high-throughput query execution while ensuring embeddings are ready for re-ranking as the ANN search concludes.

ESPN further reduces critical path delays through an early re-ranking stage, where the MaxSim operator processes embeddings immediately after retrieval. By utilizing GPU resources efficiently and overlapping computation with ANN search, this step minimizes the re-ranking workload, particularly for batch queries. ESPN combines these techniques to deliver near-memory-level performance for SSD-based retrievals, reduce memory overhead, and effectively scale to large datasets and high query throughput.

To fully hide prefetcher latency, it is essential for the prefetching process to complete before the conclusion of the approximate nearest neighbor (ANN) search. For single-query systems, this is typically feasible due to adequate SSD bandwidth and prefetch budgets. However, as the number of simultaneous or batch queries increases, the bandwidth demand rises, and beyond a certain batch size, the prefetching time may exceed the budget. This introduces latency into the critical path. The maximum query batch size that avoids such latency can be determined using Eq. 3.

$$\begin{aligned} \text {Query Batch threshold} = \frac{BW_{SSD} \cdot \text {PrefetchBudget}}{\text {Data size per query}} \end{aligned}$$
(3)

3.3 ESPN-lite: lightweight partial re-ranking for bandwidth-efficient IR

Fig. 5
figure 5

MRR@10 vs re-rank count with ColBERTv2, ColBERTer in MS-MARCO dataset

Neural IR systems typically re-rank large candidate sets (e.g., 1000+ documents) to generate a ranked list of results. However, with advancements in single-vector neural retrievers, comparable retrieval performance can be achieved by re-ranking only a subset of top candidates. For instance, re-ranking the top 64-128 documents using the MaxSim operator and aggregating the results with the remaining candidates maintains 99\(-\)99.7% of MRR@10 scores for ColBERTv2 and ColBERTer. This approach significantly reduces embedding data transfer (8-16x per query), enabling larger query batch sizes while preserving high retrieval accuracy. Evaluation was performed using first-stage retrievers from Pyserini, as shown in Fig. 5.

3.4 ESPN-LIVE: real-time embedding synthesis

Fig. 6
figure 6

ESPN-LIVE

A significant challenge in modern GPU-based workloads is the underutilization of GPU resources due to delays caused by I/O and other dependencies. In the neural retrieval pipeline, GPU-based computation accounts for less than 15% of the total retrieval time, with the majority spent on CPU-driven ANN searches and index retrievals. Leveraging the idle GPU resources during these periods offers an opportunity to enhance both memory and storage efficiency. As GPUs continue to grow in computational power, they can now perform multiple BERT inferences simultaneously through batching. Table 4 illustrates the optimized BERT latency with different batch sizes on various NVIDIA GPUs. This indicates that for small query workloads, document embeddings can be computed on-demand with minimal encoding overhead.

Table 4 Comparison of BERT latency across GPUs and batch sizes using INT8 precision

ESPN-Lite introduced a bandwidth-efficient partial re-ranking solution that retrieves only a small subset of top-k document embeddings for re-ranking. Building on this concept, we propose ESPN-LIVE (shown in Fig. 6), which eliminates the need to retrieve precomputed document embeddings from SSDs. Instead, we retrieve pre-tokenized documents and generate embeddings for a small top-k list on-the-fly. In ESPN-LIVE, the prefetching operation is replaced with early document encoding, where embeddings for an initial set of documents are computed and overlapped with the CPU-based ANN search. For any documents missed during this process, embeddings are computed during the critical path. By combining ESPN’s efficient overlapping pipeline with approximate re-ranking and on-demand embedding synthesis, ESPN-LIVE completely removes the need to store multi-vector document embeddings in storage, making it highly effective for small query workloads.

4 Evaluation

In this section, we evaluate the performance and scalability of ESPN and its extension ESPN-LIVE. Our evaluation focuses on demonstrating the effectiveness of ESPN’s efficiency, including SSD-based embedding retrieval, prefetching strategies, and real-time embedding synthesis, across several metrics. The results are based on experiments conducted using the publicly available ColBERTer model on the MS-MARCO v1 and v2 datasets.

4.1 Experimental setup

The ESPN framework integrates approximately 600 lines of C++ and 1100 lines of Python code, leveraging Nvidia GPUDirect Storage for direct data transfers, and FAISS for efficient ANN search. Experiments were conducted on MS-MARCO v1 and v2 datasets using an Intel Xeon W-2255 CPU, an Nvidia A5000 GPU, and a Samsung PM983 SSD. In the MS-MARCO v1 dataset, we used Faiss with an IVFFlat index (4.3 GB) and in MS-MARCO v2, Faiss with IVFPQ (m=128, nbits=8) resulting in a 17.5 GB ANN index. In ESPN and ESPN-Lite, these indices were cached in system memory, while the re-ranking embedding tables (16.8 GB for v1 and 255.4 GB for v2) were stored on SSDs. Embedding retrieval and prefetching were configured to achieve 90% prefetcher hit rates with memory configurations simulated using cgroups [37]. The latency measurements encompassed all retrieval pipeline stages to assess end-to-end query performance. In approximate methods such as ESPN-Lite and ESPN-LIVE, we re-rank only the top 128 documents retrieved from the ANN search, whereas in exact solutions, the top 1024 documents are re-ranked.

4.2 Prefetcher results

Fig. 7
figure 7

Prefetcher hit rate vs Prefetch Step. v1 = MS-MARCO v1, v2 = MS-MARCO v2 dataset

Figure 7 illustrates the prefetcher’s hit rate across varying prefetch steps, expressed as a percentage of nprobe. For the v1 dataset (nprobe = 1000 and 3000), the hit rate improves significantly from 74% and 85% at 5% prefetch step to over 90% at 30%. Similarly, for the v2 dataset (nprobe = 160 and 200), the hit rate rises from 68% and 70% at 5% to approximately 89% and 90% at 30%. The results show that we can effectively prefetch the majority of embeddings, reducing the need to access many embeddings in the critical path.

4.3 End-to-end retrieval results

Table 5 and 6 show the average end-to-end query latency with different memory configuration for MS-MARCO v1 and v2, respectively, using the publicly available ColBERTer model. This section includes exact retrieval scores with no approximations using different memory configurations (mmap, Virtual memory, GDS, ESPN) and approximate retrieval results (ESPN-Lite, ESPN-LIVE). The retrieval scores achieved by the memory-based and storage-based (GDS, ESPN) methods are equivalent and are reported in Table 3. ESPN-Lite and ESPN-LIVE incur a negligible quality degradation of only 0.5% on the MRR@10 scores.

Table 5 End-to-end Query Latency (ms) with Different Memory Configurations in MS-MARCO v1. * suggests the embedding table was cached in memory. \(^\dagger\)Approximate solutions
Table 6 End-to-end Query Latency (ms) with Different Memory Configurations in MS-MARCO v2. \(^\dagger\)Approximate solutions

When the embedding tables are small enough to fit in memory, mmap performs efficiently by leveraging system memory caching after the initial access. However, as index size exceeds available memory, mmap incurs significant software overhead, as demonstrated in table 5 with a 10-GB memory limit. However, as index sizes exceed available memory, mmap incurs significant software overhead, as demonstrated in Table 5 with a 10-GB memory limit. For MS-MARCO v2, where the index far exceeds physical memory capacity, this overhead becomes substantial, making mmap impractical for large-scale retrieval tasks. ESPN addresses this challenge, achieving 3.1\(-\)3.9\(\times\) faster query latencies than mmap by mitigating these overheads. Using swap space partially alleviates mmap’s inefficiencies by pre-fetching 8 pages per fault, but this approach is limited by the combined size of the memory and swap space.

GDS reduces software overhead but retains SSD latency in the critical path. ESPN achieves near-memory-level query latency without storing BOW embeddings in memory (Table 5) and provides a practical cached solution approximation with \(\le 1ms\) embedding access latency (Table 6). ESPN offers competitive end-to-end latency (\(1.02\times\)) compared to fully cached solutions while storing only 6-19% of retrieval indices in memory, reducing memory usage by \(5-16\times\) depending on ANN index size.

ESPN-Lite improves scalability of the system by re-ranking only the top ranked documents (128) for a marginal cost of 0.5% reduction in the MRR@10 retrieval scores. In contrast, ESPN-LIVE exhibits comparable retrieval performance while obviating the need to store re-ranking embedding tables, thereby alleviating storage constraints. However, this approach is subject to limitations with regard to batch size scalability: increasing query batches incurs higher encoding costs. Notably, the benefits of ESPN-LIVE are most pronounced in systems characterized by large document collections, which necessitate increased memory and storage demands for retrieval indices, but operate at low query rates.

4.4 Large-scale query throughput

The performance gap between memory-based and SSD-based solutions largely stems from embedding access latency in the pipeline’s critical path. If SSD access latency approaches memory levels, similar query throughput can be maintained for large batches. To evaluate ESPN’s scalability, we fixed a prefetch budget and modeled embedding retrieval for exact and approximate solutions. Exact solutions retrieve 1024 document embeddings per query, while approximate solutions retrieve only the top 128 embeddings. SSD latency measurements were obtained using Nvidia’s gdsio tool [38].

Fig. 8
figure 8

Batch Query latency and throughput in MS-MARCO v1 dataset with prefetch step = 10%

Fig. 9
figure 9

Bandwidth-efficient solution: Query Batch size vs Critical path embedding access latency in MS-MARCO v1 dataset with prefetch step = 10%

Figure 8 shows the end-to-end query throughput with larger batch sizes for the exact retrieval solutions. In the end-to-end system, ESPN is competitive with DRAM-based solution up to a batch size of 12 and improves GDS-based retrieval by 68% in throughput. Figure 9 shows the embedding access latency in the critical path of the query. Naive GDS-based solutions incur 7.7\(\times\) higher embedding access latency compared to DRAM. ESPN minimizes this gap by prefetching the relevant embeddings during the ANN search and only retrieving a small portion of the missed document in the critical path. These results show that ESPN-Lite is competitive with fully cached DRAM-based solutions up to a batch size of 96. ESPN-Lite reduces the bandwidth requirements by 8\(\times\), enabling the prefetcher to stay within its budget even for larger batch sizes.

5 Related work

Several works have explored SSD-based systems to boost memory efficiency in neural information processing systems. For example, DiskANN [39] and SPANN [40] offload parts of the ANN index to storage, yet they focus solely on ANN search rather than the entire neural IR pipeline. In contrast, our work targets end-to-end multi-vector retrieval by offloading a substantial re-ranking index to SSDs while incorporating a flexible prefetcher with early re-ranking to mitigate SSD latencies. Moreover, while many state-of-the-art systems reduce memory demands through token dropping, quantization, and compression [31, 41, 42], they do not directly optimize for multi-vector embedding retrieval. To the best of our knowledge, ESPN and ESPN-LIVE are the first to address this gap. Recent work on multi-vector optimization has explored mapping ColBERT embeddings to a sparse space for efficient retrieval [43] and reducing memory usage with memory mapping while accelerating retrieval via hybrid scoring [44]. Our approach seamlessly integrates with any multi-vector model by modifying the retrieval pipeline without altering the model weights.

6 Limitations and future work

Optimizing retrieval by using larger I/O sizes (e.g., 8KB) and packing more token embeddings per block can enhance throughput. However, PCIe bandwidth remains a bottleneck compared to GPU HBM bandwidth, particularly in GPU-centric architectures using SSDs for large batch sizes. Scaling such systems for higher throughput is constrained by the limited bandwidth and efficiency of current storage interfaces.

Future directions should explore advanced RAID configurations (e.g., RAID 0) to aggregate multiple SSDs and scale bandwidth effectively [45, 46]. Scalability and bandwidth challenges in GPU-centric architectures require dynamic memory management approaches that effectively balance memory across GPU, CPU, and SSDs. CXL interconnects offer a promising avenue for distributed memory solutions, potentially enabling higher bandwidth and lower latency for large-scale retrieval systems [47, 48].

Expanding ESPN’s design principles to LLM inference systems presents an exciting opportunity. By offloading storage-intensive tasks such as key-value caching or embedding retrieval, these systems can handle growing dataset sizes more efficiently while maintaining low latency. Furthermore, overlapping IR latency with text generation in Retrieval-Augmented Generation (RAG) pipelines and dynamically caching frequently accessed embeddings in high-bandwidth memory tiers can enhance system performance.

7 Conclusion

The increasing prevalence of GPU-centric workloads in AI and ML has underscored the critical need to address escalating memory demands while maintaining compute efficiency. This paper demonstrates the significant potential of GPU-centric architectures augmented with SSDs to address the memory and compute challenges posed by modern AI/ML workloads. By leveraging GPUDirect Storage (GDS) and introducing ESPN, the study bridges the gap between storage latency and GPU throughput, effectively reducing memory overhead while maintaining low query latency through advanced ANN-based prefetching. The introduction of ESPN-LIVE further underscores the scalability and efficiency of this approach, particularly in scenarios involving large-scale databases with low query rates. Experimental results validate the viability of these optimizations, showcasing significant gains in scalability, retrieval latency, and system throughput.