Technical SectionParallel L-BFGS-B algorithm on GPU☆
Graphical abstract
Introduction
Nonlinear energy minimization is at the core of many algorithms in graphics, engineering and scientific computing. Due to their features of rapid convergence and moderate memory requirement for large-scale problems [1], the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm and its variant, the L-BFGS-B algorithm [2], [3], [4] are efficient alternatives to other frequently used energy minimization algorithms such as the conjugate gradient (CG) [5] and Levenberg–Marquardt (LM) [6] algorithm. Furthermore, L-BFGS-B is favored as the core of many state-of-the-art algorithms in graphics, such as the computation of centroidal Voronoi tessellation (CVT) [7], the mean-shift image segmentation [8], the medical image registration [9], the face tracking for animation [10], and the composition of vector textures [11]. Among these applications, the computation of CVT is the basis of numerous applications in graphics including flow visualization [12], image compression or segmentation [13], [14], [15], surface remeshing [16], [17], [18], object distribution [19], and stylized rendering [20], [21], [22]. Hence, an L-BFGS-B solver of high performance is desired by the graphics community for its wide applications.
L-BFGS-B is an iterative algorithm. After initialized with a starting point and boundary constraints, it iterates through five phases: (1) gradient projection; (2) generalized Cauchy point calculation; (3) subspace minimization; (4) line searching; and (5) limited-memory Hessian approximation. Recently, there has been a trend towards the usage of parallel hardware such as the GPU for acceleration of energy minimization algorithms. Successful examples including the GPU-based CG [23], [24] and GPU-based LM [25] have demonstrated the clear advantages of parallelization. However, such parallelization for L-BFGS-B is challenging since there is strong dependency in some key steps, such as (2) generalized Cauchy point calculation, (3) subspace minimization, and (4) line searching. In this paper, we tackle this problem and make the following contributions:
- •
We approximate the generalized Cauchy point with much less calculation while maintaining a similar rate of convergence. By doing so, we remove the dependency in the computation to make the algorithm suitable for parallel implementation on the GPU.
- •
We propose several new GPU-friendly expressions to compute the maximal possible step-length for backtracking and line searching, making it possible to be calculated with parallel reduction.
- •
We demonstrate the speedup of L-BFGS-B enabled by our parallel implementation with extensive testings and present example applications to solve some typical non-linear optimization problems in both graphics and scientific computing.
In the remainder of this paper, we first briefly review the BFGS family and optimization algorithms on the GPU in Section 2. Next, we review the L-BFGS-B algorithm in Section 3, and introduce our adaptation on the GPU in Section 4. Experimental results are given in Section 5, comparing our implementation with the latest L-BFGS-B implementation on the CPU [26] using two examples from different fields: the centroidal Voronoi tessellation (CVT) problem [7], [27] in graphics, as well as the Elastic–Plastic Torsion problem in the classical MINPACK-2 test problem set [28] in scientific computing for generality. Finally, Section 6 discusses the limitation of our GPU implementation and Section 7 concludes the paper with possible future work. Our prototype is open source and can be free downloaded from Google Code (http://code.google.com/p/lbfgsb-on-gpu/).
Section snippets
Related work
We briefly review the previous work on Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm and its extensions, as well as the previous work on GPU-based nonlinear optimization.
Algorithm
The L-BFGS-B algorithm is introduced by Byrd et al. [3]. We follow the notation in their paper to briefly introduce the algorithm in this section.
The L-BFGS-B algorithm is an iterative algorithm that minimizes an objective function x in Rn subject to some boundary constraints , where . In the k-th iteration, the objective function is approximated by a quadratic model at a point :where is the gradient at point and is the limited
Our modifications
In the following, we explain our modifications for finding the generalized Cauchy point and subspace minimization, which makes the L-BFGS-B algorithm suitable for current GPU architecture.
Applications
We compare the efficacy and robustness of our GPU-based L-BFGS-B algorithm and the original CPU-based L-BFGS-B algorithm using two applications described below. All experiments were performed with an Intel Xeon W5590 at 3.33 GHz and an NVIDIA GTX 580 in double precision. The CUBLAS Library and the Thrust Library used are included in CUDA Toolkit version 4.2.
Limitations
Currently, the performance of our method is limited by the memory bandwidth between the global video memory and the on-chip memory (shared memory, registers, etc.). We have also tested our implementation on a Tesla C2050. Although the Tesla C2050 has a much higher peak performance on the calculation in double precision (515GFlops) than the GTX580 (193GFlops), its performance on running our GPU L-BFGS-B algorithm is lower. More specifically, the ratio of the performance of the two cards is
Conclusion and future work
In this paper, we presented the first parallel implementation of the L-BFGS-B algorithm on the GPU. Our experiments show that our approach makes the L-BFGS-B algorithm GPU-friendly and easily parallelized, so the time spent on solving large-scale optimizations is radically reduced. Future work includes breaking the bottleneck of memory bandwidth and exploring the parallelism of L-BFGS-B on multiple GPUs or even clusters for problems of larger scales.
Acknowledgments
This project was supported by the National Basic Research Program of China (2011CB302400, 2010CB328001), the Research Grant Council of Hong Kong (718209, 718010), the National Science Foundation of China (60933008, 61373071), and the National High-tech R&D Program (2012AA040902). The authors would like to thank Ciyou Zhu, Sylvain Henry, and Brett M. Averick for sharing their code.
References (60)
A numerical study of limited memory BFGS methods
Appl Math Lett
(2002)- ALGLIB Project. Unconstrained optimization: L-BFGS and CG. 2013....
- et al.
On the limited memory BFGS method for large scale optimization
Math Program
(1989) - et al.
A limited memory algorithm for bound constrained optimization
SIAM J Sci Comput
(1995) - et al.
Algorithm 778L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization
ACM Trans Math Softw
(1997) - Hestenes MR, Stiefel E. Methods of conjugate gradients for solving linear systems;...
An algorithm for least-squares estimation of nonlinear parameters
SIAM J Soc Ind Appl Math
(1963)- et al.
On centroidal Voronoi tessellation—energy smoothness and fast computation
ACM Trans Graph
(2009) - Yang C, Duraiswami R, DeMenthon D, Davis L. Mean-shift analysis using quasi-Newton methods. In: Proceedings of ICIP...
- Chen Y.W, Xu R, Tang SY, Morikawa S, Kurumi Y. Non-rigid MR-CT image registration for MR-guided liver cancer surgery....
Vector solid textures
ACM Trans Graph
Centroidal Voronoi tessellationsapplications and algorithms
SIAM Rev
Centroidal Voronoi tessellation algorithms for image compression, segmentation, and multichannel restoration
J Math Imaging Vis
An edge-weighted centroidal Voronoi tessellation model for image segmentation
IEEE Trans Image Process
Anisotropic centroidal Voronoi tessellations and their applications
SIAM J Sci Comput
centroidal Voronoi tessellation and its applications
ACM Trans Graph
Beyond stippling methods for distributing objects on the plane
Comput Graph Forum
Sparse matrix solvers on the GPUconjugate gradients and multigrid
ACM Trans Graph
Remark on “Algorithm 778L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization”
ACM Trans Math Softw
GPU-assisted computation of centroidal Voronoi tessellation
IEEE Trans Vis Comput Graph
On the local and superlinear convergence of quasi-Newton methods
IMA J Appl Math
Cited by (52)
A new predictive model for the outlet turbidity in micro-irrigation sand filters fed with effluents using Gaussian process regression
2020, Computers and Electronics in AgricultureA Numerical Study of Codimension-Two Bifurcations of an SIR-Type Model for COVID-19 and Their Epidemiological Implications
2023, Communication in Biomathematical SciencesProbably Approximately Correct Nonlinear Model Predictive Control (PAC-NMPC)
2023, IEEE Robotics and Automation LettersEfficient FPGA-Based Accelerator of the L-BFGS Algorithm for IoT Applications
2023, Proceedings - IEEE International Symposium on Circuits and Systems
- ☆
This article was recommended for publication by Shi-Min Hu.