1 Introduction

The continuous grow of data gathering and processing, which is fired by cheap sensors (e.g., in smart phones and wearables), cheap storage costs, and efficient machine learning algorithms, enables many useful applications and powerful online services. However, this data processing is also a huge risk for the individuals privacy, as users of these services become more and more transparent and reveal possibly sensitive data to an untrusted service provider.

Yet, since the 1980’s [32, 33] it is (theoretically) known that any computation over sensitive data from multiple parties can be performed securely, such that the participating parties do not learn more about the inputs of the other parties from the computation than they can already derive from the output. Consequently, this form of generic secure multi-party computation (MPC) is a powerful privacy-enhancing technology that provides a solution for the aforementioned privacy problems by enabling the computation over sensitive data in untrusted environments. MPC has rapidly developed in recent years, with many new protocols using various different cryptographic primitives, e.g., [1, 9, 13, 32]. Moreover, many theoretical and practical optimizations, e.g., [3, 19, 36], made these protocols ready for practice.

Almost all MPC protocols have in common that they compute functionalities in the circuit computation model. Thus, to compute a function f, the function has to be represented as Boolean or Arithmetic circuit \(C_f\). Unfortunately, every random memory access in this model requires a scan of the complete memory, which renders MPC protocols impractical for any data intensive application. To overcome this performance barrier, Gordon et al. [15] proposed the idea of RAM-SC, later refined by Liu et al. [22], which combines MPC with Oblivious RAM (ORAM) [14]. Thus, RAM-SC partially performs the same MPC computations, yet every RAM access is evaluated (more efficiently) using an ORAM protocol. ORAMs obfuscate each RAM access by producing a sequence of physical accesses that is indistinguishable to a random access pattern.

Many ORAMs using a wide range of constructions have been proposed, e.g., [20, 24, 29], and recently also new or adapted ORAMs optimized for RAM-SC have been presented, e.g., [12], SCORAM [31], Circuit ORAM (C-ORAM) [30], optimized Square-Root ORAM (SQ-ORAM) [37], and FLORAM [11]. Even though RAM-SC is asymptotically more efficient than MPC, it is almost impossible to identify a suitable ORAM that achieves optimal runtime by hand, which can differ by multiple orders of magnitude due to their complex cost models.

For instance, the array size influences the ORAM choice. Namely, all ORAMs have different ranges of use, e.g., SQ-ORAM is very effective for smaller RAMs, whereas C-ORAM is asymptotically the fastest ORAM. Yet, not only the number of accesses is relevant, but also the distinction between the access type: read or write access with private index or public index (i.e., the stored data is still encrypted yet accessed position is known). For example, array accesses with publicly known index can be performed at little cost in FLORAM, but have to be performed with costs similar to an access with private index in C-ORAM. Additionally, the RAM initialization pattern within the program itself influences the RAM-SC runtime. Also, the environment in which the protocol is executed has to be taken into account, as all ORAMs have different communication and computation patterns. This includes the properties of the network connection (bandwidth and latency), but also the computational power of the executing hardware. Concluding, for an optimal ORAM choice it is necessary to consider all aforementioned parameters. Up to know it is a tedious task for a developer to create an efficient RAM-SC program, as this requires an array usage statistic of the input program and in depth knowledge about ORAMs and their deployment costs.

Contribution. To make RAM-SC accessible for non-domain experts, we present an automatized framework that analyzes which ORAMs (if at all) should be used to achieve optimal runtime for a RAM-SC program with a given number of array accesses and a deployment scenario. Moreover, by implementing the framework in the CBMC-GC compiler by Holzer et al. [17], we illustrate a compile-chain from generic ANSI-C code into a RAM-SC program.

In contrast to previous work, such as SCVM [22] and ObliVM [23], which statically decide for or against a single ORAM, our approach is aware of all aforementioned cost dimensions of RAM-SC. Namely, we automatically identify all array accesses (individually for each array in the input code), determine an optimal ORAM choice depending on the access pattern, which includes an optimal selection of ORAM parameters, and automatically partition the code into circuit based computations and ORAM accesses.

For this purpose, we revisit C-ORAM, which is the most efficient tree-based ORAM optimized for MPC, SQ-ORAM, and FLORAM, which both have been developed to outperform C-ORAM for mid-sized arrays, to develop a library with gate-precise costs models. This library allows runtime estimations for arbitrary access patterns, ORAM sizes, and deployment scenarios within seconds, which is multiple orders of magnitude faster than benchmarking all ORAMs in an actual deployment scenario. Moreover, the library can compute multiple different cost metrics, e.g., to determine which ORAM has minimal communication complexity in a given scenario. As a side-product of our studies, we present practical optimizations for all ORAMs that reduce the runtime for each access of up to a factor of two. Furthermore, we present the first extensive study on RAM-SC runtimes for different real world deployment scenarios and show that the use of ORAMs over purely circuit based computations is often only useful for arrays larger than one could assume.

Outline. Preliminaries and related work are described in the next section. We study the different ORAMs and propose optimizations in Sect. 3, before describing the compiler in Sect. 4. Finally, an evaluation of our approach is given in Sect. 5.

2 Preliminaries and Related Work

2.1 Secure Multi-party Computation (MPC)

MPC protocols are cryptographic protocols performed between two or more parties that allow a joint computation of a functionality \(f(x_1,x_2,\dots )\) over the private inputs \(x_1,x_2,\dots \) of the participating parties \(P_1,P_2,\dots \) with guaranteed correctness and privacy, i.e., the parties do not learn more about the other party’s inputs than they could already derive from the observed output. In this work we present our ideas for one of the most researched two-party protocols, namely Yao’s garbled circuits protocol [32, 33]. Moreover, we focus on the semi-honest (passive) adversarial model, yet remark that many ideas presented in this work can be generalized and transferred to other protocols.

Functionalities in Yao’s protocol are expressed as combinatorial Boolean circuits, which consist of a number n of Boolean gates, two sets of input wires and two sets of output wires, one for each party. Boolean circuits for MPC are constructed similarly to circuits in digital hardware design, yet, with the major difference that linear gates (e.g., XOR) are favored over non-linear (e.g., AND) gates. This is because non-linear gates require noticeable more computation and communication to be evaluated in an MPC protocol [19]. Thus, the major goal in circuit design for Yao’s protocol is to minimize the number of non-linear gates.

2.2 Oblivious RAM

Oblivious RAM (ORAM), first introduced by Goldreich and Ostrovsky [14], is a cryptographic primitive that allows to obfuscate the access pattern to an outsourced storage to achieve memory trace obliviousness. Therefore, each logical access on some virtual address space is translated into a sequence of physical accesses on the memory, which appears to be random to observers, resulting in the security guarantee that two sequences of virtual accesses of the same length produce indistinguishable physical access patterns.

ORAMs are commonly modeled as a protocol between an ORAM client, who is the data owner, and an untrusted ORAM server, who provides the physical storage. Typically, an ORAM construction is comprised of two distinct algorithms, the initialization and the access algorithm. An ORAM has a capacity m, which describes the number of data elements it can store. Moreover, most ORAMs require to store metadata for each data element, which in combination with the element itself is referred to as block.

The design goals of standalone ORAM constructions are manifold, e.g., minimizing client side storage, communication or computation costs. Therefore, many optimized ORAMs have been proposed, e.g., [14, 20, 24, 29]. For their combination with MPC (described in Sect. 2.3), a different cost model applies, because the ORAM client has to be evaluated as a circuit. In this work, we study the most efficient known ORAMs for MPC, namely Circuit-ORAM (C-ORAM) [30], optimized Square-Root ORAM (SQ-ORAM) [37], FLORAM [11], and the FLORAM variant CPRG (FCPRG) [11]. A description of all ORAMs is given in Sect. A.1.

2.3 RAM Based MPC (RAM-SC)

MPC protocols evaluate functionalities represented as circuits. Circuits allow to express arbitrary computations, yet random memory accesses have to be expressed as a chain of multiplexers of the complete memory, referred to as linear scan (LS). This limits MPC for applications that rely on dynamic memory accesses. Therefore, Gordon et al. [15] proposed to combine MPC protocols with ORAM to enable dynamic memory accesses with sublinear overhead. The authors describe a RAM machine, where the circuit computes an oblivious machine that evaluates instructions and memory accesses. A complete RAM machine is often not necessary, and thus the so-called RAM-SC model was later refined by Liu et al. [22] for practical efficiency. Its major concepts are described in the following paragraphs.

First, the parties performing the MPC protocol also act as distributed ORAM server, and the ORAM client is implemented as circuit evaluated by the MPC protocol itself. Thus, both roles are shared between the computing parties. Second, a program is evaluated by interweaving the MPC protocol with oblivious ORAM accesses. Consequently, a RAM-SC program consists of many small protocols that either perform a computation or an ORAM access. This behavior is exemplary illustrated in Fig. 1.

Fig. 1.
figure 1

Exemplary and simplified illustration of RAM-SC. A program flow is illustrated that is computed within an MPC protocol, run between two parties \(P_1\) and \(P_2\). At some point, a value is read from an array with virtual index 5. Therefore, a circuit representing the ORAM client functionality is executed that translates the virtual index into multiple physical addresses. These addresses are revealed to both parties, who enter the blocks as input to the MPC protocol.

The construction of RAM-SC as described, is very generic because it allows to combine different MPC protocols and ORAMs. We observe that in one RAM-SC program multiple ORAMs of possibly different type can be used, e.g., one ORAM for each array in the input program. Moreover, as in standalone ORAMs, the blocks stored on the ORAM server have to be encrypted. This can be realized by performing an encryption and decryption within a circuit, which requires (even highly optimized) a substantial amount of gates, e.g., 5000 non-linear gates to encrypt a single block of 128 bits AES [4], using a secret sharing scheme, e.g., XOR sharing [11], or by (re-)soldering the existing garbled labels based on the publicly revealed index [37]. In the XOR sharing approach a physical block is read by entering the shares as input to the MPC protocol, which are then recombined within the protocol. Similarly, to write to one or multiple blocks, the MPC protocol outputs one share for each block to every party. When using the soldering approach, the circuit garbler re-uses the existing wire labels but remaps them according the accessed indices reveled to both parties, similar to a multiplexer (array) access with public index. We also remark, that in RAM-SC, the ORAM access type, i.e., read or write, can be revealed to both parties, as the algorithm description is seen as public knowledge. This access type is also referred to as semi-private access [11].

Security. RAM-SC provides the same privacy and correctness properties as traditional MPC protocols [15]. As in [15, 22] we focus on the semi-honest setting in this work.

Complexity. The computation and communication complexity of a RAM-SC protocol depends on the circuit complexity of the computation, the circuit complexity of the ORAM client, the number of protocol rounds, as well as additional ORAM protocol costs that are performed outside of the MPC protocol. For ORAMs with less than O(m) computations or less than O(m) bandwidth RAM-SC is (asymptotically) more efficient than any circuit based MPC protocol.

Oblivious Data Structures. Related to the work on RAM-SC, is the work on structured memory accesses in MPC. For example, Zahur and Evans [34] as well as Keller and Scholl [18] have studied dedicated data structures, such as oblivious stacks or queues that can outperform the generic ORAM solution for applications with the according access pattern.

2.4 Compilation for MPC and RAM-SC

Jointly with the first practical MPC implementation, Malkhi et al. [25] realized the need for tool support and presented the first compiler for MPC. Subsequently, many compilers for Boolean (and Arithmetic) circuit based MPC have been proposed, e.g., TASTY [16], CBMC-GC [17], or Frigate [26].

The first compiler that combines ORAMs and MPC, named SCVM, has been proposed by Liu et al. [22]. In a follow up work, Liu et al. presented the ObliVM [23] compiler, and also adapted their work to the needs of ORAM supported hardware synthesis [21]. All these compilers translate a domain-specific or annotated language that compiles specially marked arrays into RAM-SC programs using a single ORAM type. Although simplifying the developing effort for RAM-SC, the developer is still required to have expert knowledge in ORAMs. The OblivC compiler by Zahur and Evans [35] is a recent compiler that allows to jointly compile public and private computations, and has therefore been used to implement ORAM protocols. However, it does not primarily target RAM-SC and therefore does not provide any form of automatization for RAM-SC.

3 Analysis and Optimization of ORAMs for Secure Computation

In order to precisely determine the best suiting ORAM for a RAM-SC application, in this section we revisit the most efficient ORAMs for RAM-SC to establish gate-precise cost models. These models allow the approximation of runtime costs in any RAM-SC deployment, which forms the basis for the optimizing compiler in Sect. 4. Since RAM accesses are basic primitives for any algorithm, they should be optimized to the full extent. Therefore, we also propose gate-level optimizations for all ORAMs. We begin with a description of implementation pitfalls observed in previous implementations, which can lead to inefficient RAM-SC.

3.1 Pitfalls of ORAM Implementations for MPC

ORAMs are complex cryptographic primitives, and thus substantial engineering effort is necessary to translate them in efficient circuit representations as required for RAM-SC. Consequently, the majority of ORAM implementations in MPC is written in high-level languages for MPC and translated using compilers for MPC. Unfortunately, due to the lacking maturity of tools, compilers, and programming paradigms, a straight-forward high-level implementation does not automatically translate into an efficient circuit description. Thus, while revising the ORAMs and their implementations we identified the following inefficiencies and provide hints for future implementations:

Overallocation of Internal Variables. Some MPC compilers use fixed bitwidths for all program variables. For example, leaf identifiers for any tree based ORAM scheme can be represented as bit strings of \(\log {(m)}\) bits. Consequently, for small to medium numbers of elements m, e.g., \(m<2^{32}\), a fixed integer bitwidth of 32 bit, introduces a noticeable overhead in the number of used gates, which also propagates to subsequent (possibly recursive) computations. Therefore, it is preferable to either use optimizing compilers, such as CBMC-GC [17] or to adjust the bitwidth accordingly.

Insufficient Constant Propagation. Constants are not always properly identified and propagated by some compilers, especially between multiple functions, which can result in cascading effects of significant circuit size. This especially concerns temporary variables in conditional blocks, which could be expressed by wires without any gate costs, but are often multiplexed with all other variables in the conditional.

Duplicated Multiplexer Blocks. Conditional blocks are represented by multiplexers on the circuit level. When using if/else statements that write the same variable (with different values), some compilers introduce duplicated multiplexer blocks, one for each write. However, both can be merged into a single conditional write, which results in a smaller circuit.

Bound Checking. The most recent MPC ORAM implementations [11, 37] perform an inefficient out-of-bounds check for each array access. To prevent misbehavior, the index is masked using a modulo computation, which additionally increases the number of gates. While there is no perfect solution to this problem, as there is no unified error handling approach in MPC, several other and more efficient approaches exist. For example, an MPC compiler that is able to identify out-of-bounds accesses can be used (if possible), a faster masking scheme can be used, or for some schemes the ORAM’s size can be increased to the next power of two without a noticeable loss in runtime.

3.2 Circuit Models and Optimized ORAM Construction for MPC

To determine an optimal ORAM choice for RAM-SC, we develop parametrized cost models for all schemes, which are composed of hand-crafted cost models for all circuit building blocks, e.g., conditional swap, adder, or shuffle. Using a modular construction of all ORAM schemes, allows to adapt to future improved building blocks, to recombine different ORAM schemes (e.g., for the recursive position map), and to evaluate different implementation options.

The developed models are based on the papers and their implementations [11, 30, 37] and precisely consider the number of non-linear gates, the communication complexity (rounds and bandwidth), and auxiliary computation costs, i.e., computations performed outside of secure computation. We use optimal bitwidths for variables and avoid the earlier described pitfalls. Due to the lack of space, we do not elaborate on the created models, but focus on their optimization. We begin with a study of the trivial circuit solution.

Trivial Circuit Solution. Traditionally, MPC compilers translate a dynamic array access into a linear scan (LS) of the complete memory to hide which position was actually accessed. The most efficient MPC circuit construction for LS read is based on a multiplexer tree that bit-wise encodes the accessed index over the stages of the tree. For write accesses a decoder of \(m-1\) non-linear gates is used to convert the index to a so called One-Hot Code, where each bit of the decoders’ output is connected to a multiplexer, which selects either the element to write or the previous data [7]. In contrast to ORAM schemes, the elements are not shared between the parties but reside inside the garbled circuit. Hence, while LS has a significant circuit size for a growing numbers of elements, it is very efficient in case of networks with high latencies and smaller array sizes, as accesses can be performed in zero rounds and without any initialization.

C-ORAM. C-ORAM [30] is known to achieve almost optimal asymptotic costs, and is thus the best suiting ORAM scheme for larger arrays. Unfortunately, C-ORAM suffers from high initialization costs, as each element has to be initially written in an ordinary ORAM access. Furthermore, C-ORAM is a multi-round protocol, where the number of communication rounds is dominated by the recursive structure of the scheme. Nevertheless, accesses to physical blocks can be performed using the soldering approach (cf. Sect. 2.3), which only requires to transmit the computed public indices.

The most recent implementation of C-ORAM [11] that we are aware has been implemented with OblivC, which neither optimizes the bitwidth of internal variables nor thoroughly eliminates unnecessary multiplexer blocks. This has a significant impact on the number of gates required for the eviction algorithm, where for example variables with bitwidth \(\log {(\log {(m)}+1)}\) are sufficient to represent the tree height. Additionally, an inefficient implementation of LS is used. Furthermore, the ReadAndRemove() operation used in all tree ORAMs to read a path, can be optimized such that only the necessary payload and isDummy flag is accessed.

SQ-ORAM. Optimized SQ-ORAM [37] has been proposed to outperform C-ORAM for moderate array sizes, albeit being asymptotically less efficient. For small numbers of elements the circuit complexity is (surprisingly) small, as the major costs stem from the scan of the stash, i.e., the temporary cache, whose publicly known size is of at most \(\sqrt{m}\). SQ-ORAM has a substantially more efficient initialization phase in comparison to C-ORAM. Physical blocks are efficiently accessed using the soldering approach. However, similar to C-ORAM, the number of communication rounds depends on the number of recursive position maps, which is in \(\log _c{(m)}\) with c being the packing factor.

The implementation of Square-Root ORAM in Obliv-C was done by the original authors of the paper, is highly optimized, and is, to the best of our knowledge, the most efficient implementation of this scheme. For their construction the same low-level optimizations as described for C-ORAM can be applied, while the LS is already using the most efficient version.

FLORAM. FLORAM is the most recent ORAM scheme for RAM-SC. Based on PIR techniques, O(m) server computations are required per access, however, these are performed outside secure computation and lead to very low communication complexity. For the generation of the FSS, the FLORAM algorithm requires \(2\cdot \log _2{(m)}\) AES encryptions that have to be computed inside a circuit, which consists of \(\approx \)5000 non-linear gates each. Being a constant round protocol, FLORAM has a huge advantage over the other ORAMs in high latency settings. Furthermore, in contrast to other ORAMs, it is possible to efficiently perform semi-private accesses with little costs, as the physical addresses of the elements correspond to the virtual addresses used. The implementation of FLORAM uses inefficient modulo operations to compute the element position inside its 128 bit data blocks, which requires additional 6000 non-linear gates upon each access. This checks can be omitted, when using a packing factor c that is a power of two, which is the case when using standard data types.

FCPRG. The CPRG optimization for FLORAM was proposed to remove the expensive computation of the many AES encryptions within MPC, so that both parties are able to compute the encryptions locally and only input their results into the secure computation for each stage of the FSS tree. Hence, it introduces a trade-off by reducing the computational effort within the MPC protocol, yet turns the constant round protocol into a multi-round protocol with \(O(\log _2{(m)})\) rounds. The implementation of the FCPRG scheme can be optimized in the same manner as the original FLORAM.

Optimal Parameter Selection for Recursive ORAMs. Most ORAM schemes come with a set of parameters that can be selected for every instantiation. For example, while maintaining the same level of security, larger buckets in tree based ORAMs allow to use a smaller stash [31], which influences the resulting circuit complexity and thus RAM-SC runtime. Therefore, for an optimal ORAM instantiation in RAM-SC it is desirable to identify optimal parameters. These parameters, i.e., bucket size, stash size, number of levels in ORAMs with recursive position maps, and the eviction strategy are (often) constrained by the desired security level, as well as the failure probability (overflow of the stash). Fortunately, for most ORAM schemes, safe parameter ranges for different security configurations have been proposed [29,30,31]. Within these ranges, we solve the combinatorial optimization problem by exhaustive search over the parameter space, which can be performed in seconds for all schemes.

Although we only described optimizations that lead to constant improvements, in Sect. 5.2 we observe gate reductions up to 70.7% for C-ORAM, 17.9% for SQ-ORAM and up to 35.6% for FCPRG.

4 Automatized RAM-SC

In order to facilitate the broad usage of RAM-SC, we present an automatized compilation approach from ANSI-C to RAM-SC that is able to detect dynamic memory accesses in a high-level input language and that places the corresponding arrays into ORAMs without the need of any interaction, e.g., by annotations, from the programmer.

To achieve this goal, we follow a two-step approach. First, an input code analysis and transformation is performed, that identifies arrays and enumerates array usage statistics. Second, an optimizer is invoked that identifies a suitable scheme for each array in the input code for a selected runtime environment, using the analysis result of the first step, as well as the cost models developed in Sect. 3.

4.1 Input Code Analysis and Transformation

To transform an input source code into a RAM-SC program, a naïve compilation can be performed by iterating over the abstract syntax tree of the input source code and by translating each array and access into an equivalent RAM access. However, this approach leads to very inefficient RAM-SC programs, as not every access requires full memory trace-obliviousness. For example, arrays can also be accessed purely with public indexes or with a mix of public and private indices. Moreover, the number of accesses, as well as the initialization of the array, play an important role for the performance of RAM-SC (cf. Sect. 5.1). Also the order of accesses is of relevance, e.g., in the case of semi-private accesses, the stash size in FLORAM only depends on the number of writes. Therefore, for an optimized compilation it is important to create precise array usage statistics.

We implemented such a more advanced compilation approach for the CBMC-GC [17] compiler, which provides the most powerful symbolic execution (SE), required for the analysis, of all currently available compilers for MPC. For example, CBMC-GC performs a powerful constant propagation, which allows to separate private and semi-private array accesses. Internally, CBMC-GC unrolls the input program and translates it into a single-static assignment form. This form is then used for a SE of the source code. During SE, every expression of the unrolled code is visited and partial evaluation is performed. Therefore, by extending the SE interface for array accesses, it is possible (i) to maintain a list of all allocated arrays, (ii) to track each access, and (iii) to distinguish semi-private and private accesses. This approach allows to create a detailed usage statistic for each array, which consists of array size m, element bitwidth b, an enumeration of all (semi-)private reads and writes, and an initialization pattern. Namely, we distinguish the case that an array is initialized (i) by only one party, (ii) by using only public indices, e.g., by iterating over the array, or (iii) in a random manner purely based on private writes.

To compile a RAM-SC program the existing LS interface, which is CBMC-GC ’s traditional approach to handle array accesses, is overwritten, such that each array read or write is replaced by input and output wires of the circuit. Using this approach the compiler does not need to be aware of the concept of RAM-SC, as it is only concerned about the computations performed in the circuit model. Consequently, the remaining code is compiled into a circuit using the existing compilation chain of CBMC-GC. This ensures to profit from all implemented gate-level optimizations. To execute a compiled RAM-SC program, the inputs and outputs have to be connected to ORAM client circuits, which are selected in the second compilation step. We remark, that the implementation of the ORAM protocols is outside the scope of this work and mostly an engineering task.

4.2 Optimal ORAM Selection

Given a detailed array access description, an ORAM scheme should be selected that achieves minimal costs, e.g., provides optimal runtime. For a given array description, the compiler computes a model of all ORAM schemes with the help of the ORAM library developed in Sect. 3. Furthermore, for a desired security level and each ORAM scheme, the possible parameter space is identified, i.e., the secure parameter configurations, discussed in Sect. 3.2. Finally, this combinatorial optimization problem is solved by enumerating the complete search space, consisting of all ORAMs and their possible configurations, which is manageable in seconds on commodity hardware. The optimal choice then depends on the desired evaluation metric, which currently is either the runtime or the number of transferred bits. Next, we describe how to predict the runtime in RAM-SC and remark that these ideas can also be transferred to other metrics, e.g., cloud computing costs (cf. [28]), with little engineering effort.

Runtime Estimation. Using the library developed in the previous section, the runtime of all RAM accesses within a RAM-SC program can be estimated efficiently for a computing environment specified by the developer. Namely, taking the type of array usage description and the security parameter \(\kappa \) into account, the library returns a gate count, the number of communication rounds, the number of OTs, and additional local costs, e.g., such as the FSS evaluation for FLORAM. The environment is described by three parameters, i.e., the computational power (as the non-linear gate throughput, the number of OTs that can be performed per second, and the time to evaluate a FSS scheme), the available bandwidth, and the round trip time.

For runtime approximation we assume a computing time that is linear in the number of non-linear gates and the number of OTs, which is a reasonable assumption as in practice both depend on the throughput of the AES-NI hardware extension. Thus, assuming perfect resource allocation and parallel generation of garbled tables and their transmission (known as streaming), the runtime is estimated as the sum of the time until the last gate has been evaluated (assuming a constant garbling throughput) by the circuit evaluator, the time to perform OTs with OT Extension (assuming a constant OT throughput), and number of communication rounds times the latency. The runtime for the circuit initialization can be estimated in a similar manner.

Although simplifying the RAM-SC computation, we observed moderate deviations (\(\le \)20%) that are decreasing with increasing RAM size, when comparing to experimentally measured runtimes, which is especially acceptable as only the relation between different ORAM schemes is of major relevance.

Optimizing Multidimensional Arrays. Multidimensional arrays can be represented in a single or in multiple (hierarchical structured) ORAMs, where one ORAM scheme is used per dimension. The latter can be more efficient, if one dimension is predominately accessed using static indexes. Therefore, our compiler studies both cases, i.e., using multiple or a singular ORAM separately to identify the optimal choice.

5 Evaluation

We give a threefold evaluation of our approach for automatized RAM-SC. First, we evaluate the parameter space that influences the choice for a suitable ORAM when implementing a RAM-SC application. Second, we study the circuit optimizations presented in Sect. 3.2. Finally, we illustrate the compilation approach introduced in Sect. 4 for an exemplary use case.

Experimental Setup. Our evaluation is based on the runtime estimation, described in the previous section. Assuming a state of the art implementation of Yao’s protocol and a commodity CPU, at least 10 million (M) non-linear gates can be garbled per second per core (fixed-key garbling [3]), where two wire labels per non-linear gate have to be transmitted (cf. two halve gates [36]). We use a security level of \(\kappa =80\) bit. Thus, each label has length \(\kappa _{gc}=80\) bit. The computation of XOR is assumed to be for free (free-XOR [19]). Similarly, we assume an efficient OT Extension implementation with a throughput of 10 millions (correlated) OTs per seconds [2]. Two values with length \(\kappa _{ot}=80\) bit have to be transmitted per OT. We remark that in practice, these numbers could be probed in the executing environment for better accuracy, yet also observe that these (conservative) estimates, easily exceed the capacity of a 1 Gbit link. The time to compute base OTs is left of out scope, as these only need to be computed once and have practically negligible costs for any larger RAM-SC application. The computational effort for the local computations in FLORAM are taken from [11], assuming a parallelization onto four cores.

We investigate three exemplary network settings. First, for comparison purposes with [37] we use a data center (DC) setting, a scenario with 1.03 Gbit connectivity a low latency 0.5 ms. Second, a local area network (LAN) scenario, typical for the internal network of a larger company, with a 1 Gbit bandwidth and 5 ms latency is studied. Finally, we study a wide area network (WAN) setting as it can be found in nowadays Internet, i.e., servers located on different continents, with 200 Mbit bandwidth and 50 ms latency.

5.1 RAM-SC Parameter Dimensions

We give a quantitative evaluation of the different parameter dimensions of ORAM schemes. The results of this analysis are given in Fig. 2, where the average ORAM access runtime is shown for different network settings, block sizes b, and number of accesses.

Network Settings. In the first row of Fig. 2, the runtime to perform a typical integer access with \(b=32\) bit for different ORAM sizes m is shown in the three different network settings without considering initialization costs. We observe that for latencies above or equal to 5 ms (LAN), LS is superior to all other schemes for ORAM sizes of up to \(m\approx 2^{12}\) elements, afterwards, FLORAM becomes more efficient. The efficiency of LS and FLORAM stems from the fact that they are constant (or zero) round protocols, whereas the other recursive schemes are multi round protocols. SQ-ORAM outperforms the other schemes for a mid-sized RAM sizes, yet its advantages decreases with increasing latency.

Blocksize. The runtime of a single ORAM access without considering initialization costs for three different block sizes, namely \(b=64,128,1024\,\mathrm{bit}\), in the DC setting is shown in the second row. In general we observe that the range of use of all ORAM schemes shifts towards smaller RAM sizes with only marginal changes in their relation to each other. Moreover, with increasing block sizes LS becomes more inefficient, because all blocks are scanned to the full extent for every access.

Number of Accesses and Initialization Amortization. The ORAM schemes have different initialization costs, which have not been considered in the previous analyses. Shown in the last row of Fig. 3 is the total time to initialize a RAM with m values and to perform n accesses afterwards in the DC setting. We observe that LS and FLORAM have none or negligible initialization costs, whereas SQ-ORAM and C-ORAM require a certain number of accesses to amortize their asymptotic costs. In Fig. 3g and h, the amortization of SQ-ORAM ’s initialization costs is shown, which is achieved with \(n \ll m\) accesses. Whereas C-ORAM requires almost \(n \approx m\) accesses for its amortization, cf. Fig. 3i, albeit being around 10 times faster per access than the second best ORAM, i.e., FCPRG, with a total amortization time of 2900 days.

Summary. For small blocksizes and elements, LS is the recommendation of choice in any network setting, SQ-ORAM is effective in fast networks and for larger blocksizes, yet has a very short range of use that must be carefully studied before deployment. In all other settings, FLORAM is the most promising ORAM. With its constant rounds and the ability to parallelize the server workload, it is significantly less constrained by the network resources that are often the limiting factor in practice. In fast networks FCPRG slightly outperforms FLORAM, but also has a comparably high round complexity (logarithmic to the power of two, and not logarithmic to the packing factor c, as SQ-ORAM and C-ORAM). We were unable to identify a scenario where C-ORAM amortizes its high initialization costs with less than one month total runtime to outperform FLORAM or FCPRG.

Fig. 2.
figure 2

Parameter space of RAM-SC. Illustrated is the runtime of a one or multiple RAM-SC access in seconds (or in days Fig. 3i) for ORAMs of different size m in different configurations.

5.2 ORAM Optimizations

We evaluate the ORAM optimizations presented in Sect. 3.2 by comparing the optimized ORAMs with the latest implementation given in [11] in the number of non-linear gates. The resulting circuit sizes are shown for an exemplary single write access for elements of size \(b = 32\) bit and different ORAM sizes m in Fig. 3. We observe that the break-even points between different schemes shift. For example, both FLORAM variants outperform LS for a lager number of elements than previously assumed. The improvements of the individual schemes are discussed in the following paragraph.

We observe a difference in form of a factor of two in the number of (non-linear) gates between the optimized LS and the LS based on equality comparators, as it has often been used in the past. This has a noticeable impact on the break-even points with the other ORAM schemes, as LS is more efficient than previously assumed. The difference between the two LS implementations becomes smaller with an increasing block size. The circuit size of C-ORAM is reduced by 40%–70%. Yet, we remark that the difference between the two implementations slightly decreases when increasing m, as all overly allocated resources are decreasingly used. The existing SQ-ORAM implementation is already highly optimized and therefore, only marginal improvements are observed, i.e., for up to \(m=2^{11}\) elements, on average 12.5% non-linear gates are saved. We only observe marginal relative improvements for FLORAM with savings of up to 20.8% in non-linear gates. This is because the majority of FLORAMs circuit consists of already highly optimized AES circuits. This is not the case in FCPRG, where only two AES circuits are used per access and therefore, an improvement of up to 35.7% of non-linear gates is observed.

Fig. 3.
figure 3

Circuit Optimization. Comparison of the circuit size (in the number of non-linear gates) between the ORAM schemes for RAM-SC in [11], illustrated with dotted lines, and the optimized circuits described in Sect. 3.2, illustrated with solid lines, for one write access of bitwidth \(b = 32\) bit and different array sizes m.

5.3 Use Case – Dijkstra Shortest Path Algorithm

We illustrate our compilation approach for an exemplary use case that has previously been studied in RAM-SC research, namely Dijkstra’s single-source shortest path algorithm [22, 23]. One party inputs a set of weighted edges between the nodes in the graph, representing the distances, as a two-dimensional array (INPUT_A_e) and the other party inputs the source and destination node, represented by the indices of the respective nodes. The algorithm (given in Sect. A.2) consists of multiple arrays that are accessed in a semi- and private manner.

In the first step of the compilation, constants are propagated, such that unnecessary array access are removed. Afterwards, the array usage statistic is generated, which is illustrate for \(m = 8\) nodes in Table 1. The code uses two one dimensional arrays, namely the (vis) array to store visited nodes and the (dis) array to store the shortest path to the source node, as well as the two dimensional INPUT_A_e array. Shown is the analysis result when separating the two dimensions. The inner dimension of the array is always accessed using a public index, whereas the outer dimension is accessed with private indices only. Moreover, the arrays vis and dis are first written during the algorithm, whereas, the weighted graph is already pre-initialized with values from Party A. In the next compilation step, the statistics are handed to the optimizer, who selects the most suitable scheme for a user chosen deployment scenario. The runtime estimated by the ORAM library in the DC setting for the two most compute intensive arrays is illustrated in Fig. 4 for different graph sizes m. We note that the compiler is only able to compute absolute array usage statistics, yet not parametrized formulas. Therefore, the results are based on multiple compiler runs, one for each size m. Shown is the total runtime in seconds to perform all semi- and private array accesses for the two most efficient ORAM choices for each array. The array dis is best stored as a LS for up to \(m=2^8\) nodes, then SQ-ORAM becomes most efficient. For the INPUT_A_e array, a decomposition in two dimensions l0 and l1 is more efficient the placing it in a single ORAM. For \(m\le 2^9\) a SQ-ORAM representation of INPUT_A_e_l0 is most efficient. Albeit being a small array, the significant blocksize to store the second layer of the array makes LS inefficient. For \(m>2^9\) nodes, FCPRG becomes most efficient.

Table 1. Exemplary array usage statistics for \(m=8\). Statistics gathered by the compiler extension after symbolic execution.
Fig. 4.
figure 4

Total runtime for all accesses to the arrays dis and INPUT_A_l0 in Dijkstra’s algorithm.

We observe, that even for simple algorithms, an automatized approach is highly beneficial, as many factors need to be considered when manually selecting ORAMs. In total we observe runtime of more than an hour for a moderately sized array, e.g., \(2^{10}\).

6 Conclusion and Future Work

We conclude our work with two insights. First, further automatization, i.e., tool support, is a necessity for the efficiency and thus the widespread use of RAM-SC. We presented such a tool that compiles RAM-SC programs from ANSI-C, allowing also non-domain experts to profit from RAM-SC. Our approach is also beneficial when deciding whether RAM-SC is sufficient solution for an application or whether dedicated protocols are needed. Second, RAM-SC is only at the verge of being practical. Even in fast networks, RAM accesses create noticeable costs. As it impossible to perform a RAM access faster than the latency, the round complexity becomes a sincere bottleneck in any intercontinental deployment scenario. Consequently, for future work the automatized compilation of oblivious algorithms is promising and also parallel RAM-SC [6, 8, 27] becomes necessary to overcome the performance barrier of multi-round RAM-SC protocols.