Extending the limits for big data RSA cracking: Towards cache-oblivious TU decomposition

https://doi.org/10.1016/j.jpdc.2019.12.016Get rights and content

Highlights

  • We investigate prospects for a cache oblivious adaptation of the TURBO algorithm for solving linear systems over finite fields, necessary for adversarial attacks on RSA.

  • We map four different matrix layouts against each other and provide theoretical analysis of the schemes addressing the cost of bit operations and accessing table lookups.

  • Our findings show that the conversions for the Morton and the Morton-hybrid layouts to Cartesian mapping incur the least number of operations, with the former beating the latter in practical experiments.

Abstract

Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. RSA takes its security from the hardness of the Integer Factorisation Problem. As the size of the modulus of an RSA key grows with the number of bytes to be encrypted, the corresponding linear system to be solved in the adversary integer factorisation algorithm also grows. In the age of big data this makes it compelling to redesign linear solvers over finite fields so that they exploit the memory hierarchy. To this end, we examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of Dumas and Roche (2002) requires index conversion routines for which the cost to encode and decode the chosen curve is significant. Using a detailed analysis of the number of bit operations required for the encoding and decoding procedures, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we show that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition as compared to the Hilbert, Peano, or Morton orders. The motivation lies in that cache efficient parallel adaptations for which the natural sequential evaluation order demonstrates lower cache miss rate result in overall faster performance on parallel machines with private or shared caches and on GPU’s.

Introduction

Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The requirements of an intelligence-driven security system requires an ability to analyse vast streams of data from numerous sources to produce actionable information. To that end, a research uptake on how to leverage security mechanisms and models to handle large data is in required.

The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. The RSA algorithm belongs to a class of algorithms that produce asymmetric keys. Those are easy to exchange securely, but have limited message size. In contrast, symmetric keys have an unlimited message size, but are difficult to exchange securely. The nature of the RSA algorithm is such that it is only able to encrypt a limited amount of plaintext. This is because the size of the modulus of an RSA key grows with the number of bytes to be encrypted. RSA takes its security from the hardness of the Integer Factorisation Problem. RSA crypto-algorithm is secure only if deducing the RSA key, i.e., the factorisation of the carefully chosen sufficiently large two prime numbers comprising the key, requires an enormous (super-polynomial) amount of time with respect to the size of the modulus number. As a result, much more research has been done to find ways to quickly factor such integer numbers, so a benchmark against the security of RSA can be established.

Leading integer factorisation algorithms based on sieving require solution of linear systems over the binary field (triangulating exact matrices) that grow with the size of the modulus of an RSA key. Despite that these systems begin as sparse, they evolve to be dense very early on in the process. For large linear systems over the binary field, there emerges a need to develop efficient exact direct solvers. A leading cause for inefficient mathematical software in the age of big data revolves around its ability to exploit the memory hierarchy. Big data platforms such as for Apache Spark require the development of cache-efficient algorithms and data structures as one of a handful of techniques required to bring its performance to push its performance closer to the limits of modern hardware [21]. Cache-efficient algorithms have been indispensable for targeting highly efficient mathematical kernels for numerical linear algebra, as we see in LAPACK and BLAS. However, to the best of our knowledge, similar advances in the Symbolic community has been extremely rare. In this work, we address a major design question to render a cache-efficient adaptation of the TURBO algorithm of Dumas et al. [9] for exact LU decomposition. This algorithm recurses on rectangular and potentially singular matrices, which makes it possible to take advantage of cache effects. It improves on other expensive methods for handling singular matrices, which otherwise have to dynamically adjust the submatrices so that they become invertible. Particularly, TURBO significantly reduces the volume of communication on distributed architectures, and retains optimal work and linear span. TURBO can also compute the rank in an exact manner. As benchmarked against some of the most efficient exact elimination algorithms in the literature, TURBO incurs low synchronisation costs and reduces the communication cost featured by [14], [15] by a factor of one third when used with only one level of recursion on 4 processors. In TURBO, local TU factorisations are performed until the sub-matrices reach a given threshold, and so one can take advantage of cache effects. A cache friendly adaptation of the serial version of TURBO bears impact on all possible forms of parallel or distributed deployment of the algorithm. For one, nested parallel algorithms with low depth and for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches [5]. Locality of reference on distributed systems is also being advocated by the Databricks group initiated by founders of Apache Spark. In their own terms, when profiling Spark user applications on distributed clusters, a large fraction of the CPU time was spent waiting for data to be fetched from main memory. Locality of reference is also of concern on GPUs. Although one does not have full control over optimising locality of reference on such machines, and despite that GPUs rely on thread-level parallelism to hide long latencies associated with memory access, the memory hierarchy remains critical for many applications. Finally, applications that are cache aware are also deemed to be more energy aware, as remarked by the Green computing community. This preamble motivates our work on trying to improve the cache performance of the serial version of TURBO.

Our contributions can be summarised as follows:

  • 1.

    We investigate prospects for a cache oblivious adaptation of the TURBO algorithm by mapping four different matrix layouts against each other: the Hilbert order [8], [16], the Peano order [3], [4], the Morton order [20], [26], and the Morton-hybrid order [2]. Whilst matrices on which we want to perform matrix–matrix multiplication or LU decomposition without pivoting can be serialised no matter what layout is used, the recursive TU decomposition considered in this work consistently requires permutation steps that require one to traverse the matrix in a row-wise or column-wise manner, thus eliciting index conversion from the Cartesian scheme to the recursive scheme and vice versa.

  • 2.

    Our analysis of the four schemes addresses the cost of bit operations and accessing table lookups when applicable. Our findings show the following:

    • (a)

      The overhead for using the Peano layout will be compelling as index conversion invokes operations modulo 3.

    • (b)

      Whilst the Hilbert layout has been promising for improving memory performance of matrix algorithms in general, and despite that the operations for encoding and decoding in this layout can be performed using bit shifts and bit masks, we will still require m iterations for a 2m×2m matrix for each single invocation of encoding or decoding.

    • (c)

      In contrast, we find that the conversions for the Morton and the Morton-hybrid layouts incur a constant number of operations assuming the matrix is of dimensions at most 2α×2α, where α is the machine word-size. For the typical value α=64, such matrix sizes are sufficiently large for many applications.

    • (d)

      Furthermore, despite that the Morton order can be encoded and decoded faster than the Morton-hybrid order, the factor of improvement is constant: ten less operations. In return, the Morton-hybrid layout allows the recursion to stop when the blocks being divided are of some prescribed size equal to T×T, thus decreasing the recursion overhead. These T×T blocks are stored in a row-major order, which allows for benefiting from compiler optimisations that have already been designed for this layout. The row-major ordering of the block at the base case also makes accessing the entries within the blocks at the base case of the inversion, multiplication, and decomposition steps of the algorithm faster and easier because no index conversion is required.

  • 3.

    The present manuscript is an indispensable precursor for our work in [1], where we introduce the concepts of alignment of sub-matrices with respect to the cache lines and their containment within proper blocks under the Morton-hybrid layout, and describe the problems associated with the recursive subdivisions of TURBO under this scheme.

Section snippets

Space-filling curves

It is well established that traditional row-major or column-major layouts of matrices in compilers lead to extremely poor temporal and spatial locality of matrix algorithms. Instead, several matrix layouts based on space-filling curves have yielded cache-oblivious adaptations of matrix algorithms such as matrix–matrix multiplication [4], [7] and matrix factorisation [3], [11], [27]. A space-filling curve is a linear traversal of a discrete, multi-dimensional space. For example, space-filling

Cross-analysis of index conversion overhead

The encoding and decoding procedures are those processes needed to revert between a space-filling curve index z and the corresponding Cartesian index (x,y). To illustrate, let Θ denote a subscript associated with one of the four layouts named above. Given a Cartesian index (i,j), encoding it in the order Θ corresponds to calculating its index zΘ in the resulting matrix layout under Θ. Given an index zΘ of a matrix entry under order Θ, decoding zΘ corresponds to calculating the Cartesian index (i

Computation overhead for the Hilbert order

Each of encoding and decoding requires m iterations. In each iteration, the encoding operation uses six bit operations and two table look-ups and the decoding algorithm uses eight bit operations and two table look-ups. The first of each table look-up incurs a random cache miss. The tables are small enough to fit in internal memory. As row and column permutations swaps in TURBO take place consecutively in one batch, so do the conversion routines, each of which requires access to the look-up

Security analysis and concluding remarks

The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. RSA takes its security from the hardness of the Integer Factorisation problem or solving the discrete logarithm for a composite moduli. Both of these problems require solving large systems of linear equations over finite fields [23], [25]. As the size of the modulus of an RSA key grows with the number of bytes to be encrypted, the corresponding linear system to be solved in the

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jpdc.2019.12.016.

Fatima K. Abu Salem received the B.S. degree and M.S. degree in Mathematics from the American University of Beirut, Beirut, Lebanon, and the Ph.D. degree in Computing Science from The University of Oxford, Oxford, England. She is an Associate Professor at American University of Beirut, Beirut, Lebanon. Her research interests include Algorithm Engineering in Computer Algebra, Cache-efficient and Parallel Algorithms, as well as Data Science for the social good.

References (27)

  • ChenN. et al.

    A new algorithm for encoding and decoding the hilbert order

    Softw. Pract. Exp.

    (2007)
  • FisherA.J.

    A new algorithm for generating Hilbert curves

    Softw. - Pract. Exp.

    (1986)
  • FrensJ.D. et al.

    QR Factorization with morton-ordered quadtree matrices for memory re-use and parallelism

  • Cited by (2)

    • Preface: Security & privacy in social big data

      2020, Journal of Parallel and Distributed Computing
    • RSA Prime Factorization on IBM Qiskit

      2023, Journal of Internet Services and Information Security

    Fatima K. Abu Salem received the B.S. degree and M.S. degree in Mathematics from the American University of Beirut, Beirut, Lebanon, and the Ph.D. degree in Computing Science from The University of Oxford, Oxford, England. She is an Associate Professor at American University of Beirut, Beirut, Lebanon. Her research interests include Algorithm Engineering in Computer Algebra, Cache-efficient and Parallel Algorithms, as well as Data Science for the social good.

    Mira Al Arab received her B.S. and M.S. degrees in Computer Science from the American University of Beirut. She currently is associate manager of software development at Vilo.ai, Lebanon.

    Laurence Tianruo Yang received the B.E. degree in Computer Science and Technology from Tsinghua University, Beijing, China, and the Ph.D. degree in computer science from the University of Victoria, Victoria, BC, Canada. He is currently a Professor at St. Francis Xavier University, Antigonish, NS, Canada. His research interests include Parallel and Distributed Computing, Embedded and Ubiquitous/Pervasive Computing, and Big Data.

    View full text