Elsevier

Computers & Graphics

Volume 40, May 2014, Pages 1-9
Computers & Graphics

Technical Section
Parallel L-BFGS-B algorithm on GPU

https://doi.org/10.1016/j.cag.2014.01.002Get rights and content

Highlights

  • We approximate the generalized Cauchy point with much less calculation while maintaining a similar rate of convergence.

  • We propose several new GPU-friendly expressions to compute the maximal possible step-length for backtracking and line searching.

  • We demonstrate the speedup of L-BFGS-B enabled by our parallel implementation with extensive testings.

Abstract

Due to the rapid advance of general-purpose graphics processing unit (GPU), it is an active research topic to study performance improvement of non-linear optimization with parallel implementation on GPU, as attested by the much research on parallel implementation of relatively simple optimization methods, such as the conjugate gradient method. We study in this context the L-BFGS-B method, or the limited memory Broyden–Fletcher–Goldfarb–Shanno with boundaries, which is a sophisticated yet efficient optimization method widely used in computer graphics as well as general scientific computation. By analyzing and resolving the inherent dependencies of some of its search steps, we propose an efficient GPU-based parallel implementation of L-BFGS-B on the GPU. We justify our design decisions and demonstrate significant speed-up by our parallel implementation in solving the centroidal Voronoi tessellation (CVT) problem as well as some typical computing problems.

Introduction

Nonlinear energy minimization is at the core of many algorithms in graphics, engineering and scientific computing. Due to their features of rapid convergence and moderate memory requirement for large-scale problems [1], the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm and its variant, the L-BFGS-B algorithm [2], [3], [4] are efficient alternatives to other frequently used energy minimization algorithms such as the conjugate gradient (CG) [5] and Levenberg–Marquardt (LM) [6] algorithm. Furthermore, L-BFGS-B is favored as the core of many state-of-the-art algorithms in graphics, such as the computation of centroidal Voronoi tessellation (CVT) [7], the mean-shift image segmentation [8], the medical image registration [9], the face tracking for animation [10], and the composition of vector textures [11]. Among these applications, the computation of CVT is the basis of numerous applications in graphics including flow visualization [12], image compression or segmentation [13], [14], [15], surface remeshing [16], [17], [18], object distribution [19], and stylized rendering [20], [21], [22]. Hence, an L-BFGS-B solver of high performance is desired by the graphics community for its wide applications.

L-BFGS-B is an iterative algorithm. After initialized with a starting point and boundary constraints, it iterates through five phases: (1) gradient projection; (2) generalized Cauchy point calculation; (3) subspace minimization; (4) line searching; and (5) limited-memory Hessian approximation. Recently, there has been a trend towards the usage of parallel hardware such as the GPU for acceleration of energy minimization algorithms. Successful examples including the GPU-based CG [23], [24] and GPU-based LM [25] have demonstrated the clear advantages of parallelization. However, such parallelization for L-BFGS-B is challenging since there is strong dependency in some key steps, such as (2) generalized Cauchy point calculation, (3) subspace minimization, and (4) line searching. In this paper, we tackle this problem and make the following contributions:

  • We approximate the generalized Cauchy point with much less calculation while maintaining a similar rate of convergence. By doing so, we remove the dependency in the computation to make the algorithm suitable for parallel implementation on the GPU.

  • We propose several new GPU-friendly expressions to compute the maximal possible step-length for backtracking and line searching, making it possible to be calculated with parallel reduction.

  • We demonstrate the speedup of L-BFGS-B enabled by our parallel implementation with extensive testings and present example applications to solve some typical non-linear optimization problems in both graphics and scientific computing.

In the remainder of this paper, we first briefly review the BFGS family and optimization algorithms on the GPU in Section 2. Next, we review the L-BFGS-B algorithm in Section 3, and introduce our adaptation on the GPU in Section 4. Experimental results are given in Section 5, comparing our implementation with the latest L-BFGS-B implementation on the CPU [26] using two examples from different fields: the centroidal Voronoi tessellation (CVT) problem [7], [27] in graphics, as well as the Elastic–Plastic Torsion problem in the classical MINPACK-2 test problem set [28] in scientific computing for generality. Finally, Section 6 discusses the limitation of our GPU implementation and Section 7 concludes the paper with possible future work. Our prototype is open source and can be free downloaded from Google Code (http://code.google.com/p/lbfgsb-on-gpu/).

Section snippets

Related work

We briefly review the previous work on Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm and its extensions, as well as the previous work on GPU-based nonlinear optimization.

Algorithm

The L-BFGS-B algorithm is introduced by Byrd et al. [3]. We follow the notation in their paper to briefly introduce the algorithm in this section.

The L-BFGS-B algorithm is an iterative algorithm that minimizes an objective function x in Rn subject to some boundary constraints lxu, where l,x,uRn. In the k-th iteration, the objective function is approximated by a quadratic model at a point xk:mk(x)=f(xk)+gkT(xxk)+12(xxk)TBk(xxk),where gk is the gradient at point xk and Bk is the limited

Our modifications

In the following, we explain our modifications for finding the generalized Cauchy point and subspace minimization, which makes the L-BFGS-B algorithm suitable for current GPU architecture.

Applications

We compare the efficacy and robustness of our GPU-based L-BFGS-B algorithm and the original CPU-based L-BFGS-B algorithm using two applications described below. All experiments were performed with an Intel Xeon W5590 at 3.33 GHz and an NVIDIA GTX 580 in double precision. The CUBLAS Library and the Thrust Library used are included in CUDA Toolkit version 4.2.

Limitations

Currently, the performance of our method is limited by the memory bandwidth between the global video memory and the on-chip memory (shared memory, registers, etc.). We have also tested our implementation on a Tesla C2050. Although the Tesla C2050 has a much higher peak performance on the calculation in double precision (515GFlops) than the GTX580 (193GFlops), its performance on running our GPU L-BFGS-B algorithm is lower. More specifically, the ratio of the performance of the two cards is

Conclusion and future work

In this paper, we presented the first parallel implementation of the L-BFGS-B algorithm on the GPU. Our experiments show that our approach makes the L-BFGS-B algorithm GPU-friendly and easily parallelized, so the time spent on solving large-scale optimizations is radically reduced. Future work includes breaking the bottleneck of memory bandwidth and exploring the parallelism of L-BFGS-B on multiple GPUs or even clusters for problems of larger scales.

Acknowledgments

This project was supported by the National Basic Research Program of China (2011CB302400, 2010CB328001), the Research Grant Council of Hong Kong (718209, 718010), the National Science Foundation of China (60933008, 61373071), and the National High-tech R&D Program (2012AA040902). The authors would like to thank Ciyou Zhu, Sylvain Henry, and Brett M. Averick for sharing their code.

References (60)

  • J. Morales

    A numerical study of limited memory BFGS methods

    Appl Math Lett

    (2002)
  • ALGLIB Project. Unconstrained optimization: L-BFGS and CG. 2013....
  • D.C. Liu et al.

    On the limited memory BFGS method for large scale optimization

    Math Program

    (1989)
  • R.H. Byrd et al.

    A limited memory algorithm for bound constrained optimization

    SIAM J Sci Comput

    (1995)
  • C. Zhu et al.

    Algorithm 778L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

    ACM Trans Math Softw

    (1997)
  • Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems;...
  • D.W. Marquardt

    An algorithm for least-squares estimation of nonlinear parameters

    SIAM J Soc Ind Appl Math

    (1963)
  • Y. Liu et al.

    On centroidal Voronoi tessellation—energy smoothness and fast computation

    ACM Trans Graph

    (2009)
  • Yang C, Duraiswami R, DeMenthon D, Davis L. Mean-shift analysis using quasi-Newton methods. In: Proceedings of ICIP...
  • Chen Y.W, Xu R, Tang SY, Morikawa S, Kurumi Y. Non-rigid MR-CT image registration for MR-guided liver cancer surgery....
  • Hyneman W, Itokazu H, Williams L, Zhao X. Human face project. In: ACM SIGGRAPH ׳05 courses. ACM; 2005, p....
  • L. Wang et al.

    Vector solid textures

    ACM Trans Graph

    (2010)
  • Du Q, Wang X. Centroidal Voronoi tessellation based algorithms for vector fields visualization and segmentation. In:...
  • Q. Du et al.

    Centroidal Voronoi tessellationsapplications and algorithms

    SIAM Rev

    (1999)
  • Q. Du et al.

    Centroidal Voronoi tessellation algorithms for image compression, segmentation, and multichannel restoration

    J Math Imaging Vis

    (2006)
  • J. Wang et al.

    An edge-weighted centroidal Voronoi tessellation model for image segmentation

    IEEE Trans Image Process

    (2009)
  • Alliez P, De Verdire E, Devillers O, Isenburg M. Isotropic surface remeshing. In: Proceedings of SMI ׳03, 2003. p....
  • Q. Du et al.

    Anisotropic centroidal Voronoi tessellations and their applications

    SIAM J Sci Comput

    (2005)
  • B. Lévy et al.

    Lp centroidal Voronoi tessellation and its applications

    ACM Trans Graph

    (2010)
  • S. Hiller et al.

    Beyond stippling methods for distributing objects on the plane

    Comput Graph Forum

    (2003)
  • Secord A. Weighted Voronoi stippling. In: Proceedings of NPAR ׳02. ACM; 2002. p....
  • Battiato S, Di Blasi G, Farinella GM, Gallo G. Digital mosaic frameworks – an overview. In: Comput graph forum, vol....
  • Deussen O, Isenberg T. Halftoning and stippling. In: Image and video-based artistic stylisation. Springer; 2013, p....
  • Cevahir A, Nukada A, Matsuoka S. Fast conjugate gradients with multiple GPUs. In: Proceedings of ICCS ׳09, 2009. p....
  • J. Bolz et al.

    Sparse matrix solvers on the GPUconjugate gradients and multigrid

    ACM Trans Graph

    (2003)
  • Li B, Young AA, Cowan BR. GPU accelerated non-rigid registration for the evaluation of cardiac function. In:...
  • J.L. Morales et al.

    Remark on “Algorithm 778L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization”

    ACM Trans Math Softw

    (2011)
  • G. Rong et al.

    GPU-assisted computation of centroidal Voronoi tessellation

    IEEE Trans Vis Comput Graph

    (2011)
  • Averick BM, Carter RG, Moré JJ, Xue GL. The MINPACK-2 test problem collection. Technical Report MCS-P153-0692. Argonne...
  • C. Broyden et al.

    On the local and superlinear convergence of quasi-Newton methods

    IMA J Appl Math

    (1973)
  • Cited by (52)

    View all citing articles on Scopus

    This article was recommended for publication by Shi-Min Hu.

    View full text