Abstract
This work aims to improve the GPU performance for solving the 0/1 knapsack problem, which is a well-known combinatorial optimization problem found in many practical applications, including cryptography, financial decision, electronic design automation, computing resource management, etc. The knapsack problem is NP-hard, but it can be solved efficiently by dynamic programming (DP) algorithms in pseudo-polynomial runtime. The DP knapsack algorithm on GPUs has been presented. However, as the modern GPU architecture provides much higher computing throughput than its memory bandwidth, previous work is bounded by the data access time on GPU memory because its CGMA (Compute to Global Memory Access) ratio is 1, which means every computing operation involves one memory access on average. To address the problem, an innovative approach called Multi-Class 0/1 Knapsack Problem (MCKP), whose items can be classified into groups with equal values or weights is proposed in this paper. By reconstructing the DP equations for solving MCKP, it is able to explore data parallelism and reusability across threads. This made it possible to optimize the computation across iterations (i.e., items), and significantly improve the CGMA ratio by 5-fold after exploring the use of GPU shared memory and registers for reused data. We extensively analyze the performance of our approach on two modern GPU models, NVIDIA Tesla V100 and RTX 3070. Compared to the runtime of previous work, our approach achieves up to 8x and 18x speedup on V100 and RTX 3070 respectively, the latter one being a GPU with lower memory bandwidth. In addition, by comparing the two speedups, we found that we are able to achieve more efficient computing usage when the memory bandwidth is limited such as RTX 3070.







Similar content being viewed by others
References
Bellman R (1966) Dynamic programming. Science 153(3731):34–37. https://doi.org/10.1126/science.153.3731.34
Boukedjar A, Lalami ME, El-Baz D (2012) Parallel branch and bound on a cpu-gpu system. In: 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 392–398. https://doi.org/10.1109/PDP.2012.23
Boyer V, El Baz D, Elkihel M (2012) Solving knapsack problems on gpu. Comput Op Res 39(1):42–47
Carneiro T, Muritiba AE, Negreiros M, Lima de Campos GA (2011) A new parallel schema for branch-and-bound algorithms using gpgpu. In: 2011 23rd International Symposium on Computer Architecture and High Performance Computing, pp. 41–47. https://doi.org/10.1109/SBAC-PAD.2011.20
Ding N, Williams S (2019) An instruction roofline model for gpus. In: 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 7–18. https://doi.org/10.1109/PMBS49563.2019.00007
Garey MR, Johnson DS (1990) Computers and intractability; a guide to the theory of NP-completeness. W. H Freeman & Co., New York
Hajarian M, Shahbahrami A, Hoseini F (2016) A parallel solution for the 0-1 knapsack problem using firefly algorithm. In: 1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), pp. 25–30. https://doi.org/10.1109/CSIEC.2016.7482134
HPC Advisory Council: The Top 500 List (2021). https://www.top500.org/lists/top500/2021/06/
Huang S, Xiao S, Feng W (2009) On the energy efficiency of graphics processing units for scientific computing. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–8. https://doi.org/10.1109/IPDPS.2009.5160980
Kelly T (2005) Generalized knapsack solvers for multi-unit combinatorial auctions: Analysis and application to computational resource allocation. In: P. Faratin, J.A. Rodríguez-Aguilar (eds.) Agent-Mediated Electronic Commerce VI. Theories for and Engineering of Distributed Mechanisms and Systems, pp. 73–86. Springer Berlin Heidelberg, Berlin, Heidelberg
Konstantinidis E, Cotronis Y (2015) A practical performance model for compute and memory bound gpu kernels. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 651–658. https://doi.org/10.1109/PDP.2015.51
Kumaraguruparan N, Sivaramakrishnan H, Sapatnekar SS (2012) Residential task scheduling under dynamic pricing using the multiple knapsack method. In: 2012 IEEE PES Innovative Smart Grid Technologies (ISGT), pp. 1–6. https://doi.org/10.1109/ISGT.2012.6175656
Lalami ME, El-Baz D (2012) Gpu implementation of the branch and bound method for knapsack problems. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, pp. 1769–1777. https://doi.org/10.1109/IPDPSW.2012.219
Lee J, Shragowitz E, Sahni S (1988) A hypercube algorithm for the 0/1 knapsack problem. J Parallel Distrib Comput 5(4):438–456. https://doi.org/10.1016/0743-7315(88)90007-X
Lin J, Storer JA (1991) Processor-efficient hypercube algorithms for the knapsack problem. J Parallel Distrib Comput 13(3):332–337. https://doi.org/10.1016/0743-7315(91)90080-S
Liu H, Shao Z, Wang M, Du J, Xue CJ, Jia Z (2009) Combining coarse-grained software pipelining with dvs for scheduling real-time periodic dependent tasks on multi-core embedded systems. J Signal Process Syst 57(2):249–262. https://doi.org/10.1007/s11265-008-0315-2
National Center for High-performance Computing: TAIWANIA2 (2018). https://www.nchc.org.tw/
Nawaz Z, Stefanov T, Bertels K (2009) Efficient hardware generation for dynamic programming problems. In: 2009 International Conference on Field-Programmable Technology, pp. 348–352. https://doi.org/10.1109/FPT.2009.5377618
NVIDIA: NVIDIA A100 datasheet (2020). https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf
NVIDIA: Cuda c++ programming guide (2021). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Oak Ridge National Laboratory: SUMMIT (2018). https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
O’Connell JF, Mumford CL (2014) An exact dynamic programming based method to solve optimisation problems using gpus. In: Second International Symposium on Computing and Networking, pp. 347–353. https://doi.org/10.1109/CANDAR.2014.27
Odlyzko AM (1990) The rise and fall of knapsack cryptosystems. In: In Cryptology and Computational Number Theory, pp. 75–88. A.M.S
O’Leary DE (1995) Financial planning with 0–1 knapsack problems, part i: domination results. Adv Math Program Financ Plan 4:139–150
Pospichal P, Schwarz J, Jaros J (2010) Parallel genetic algorithm solving 0/1 knapsack problem running on the gpu. In: Proceedings of the 16th International Conference on Soft Computing (MENDEL), pp. 64–70
Schryen G (2020) Parallel computational optimization in operations research: a new integrative framework, literature review and research directions. Eur J Oper Res 287(1):1–18. https://doi.org/10.1016/j.ejor.2019.11.033
Shen J, Shigeoka K, Ino F, Hagihara K (2017) An out-of-core branch and bound method for solving the 0-1 knapsack problem on a gpu. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 254–267. https://doi.org/10.1007/978-3-319-65482-9_17
Shen J, Shigeoka K, Ino F, Hagihara K (2019) Gpu-based branch-and-bound method to solve large 0–1 knapsack problems with data-centric strategies. Concurr Comput Pract Exp 31(4):e4954
Sun X, Wu CC, Chen LR, Lin JY (2018) Using inter-block synchronization to improve the knapsack problem on gpus. Int J Grid High Perform Comput (IJGHPC) 10(4):83–98
Suri B, Bordoloi UD, Eles P (2012) A scalable gpu-based approach to accelerate the multiple-choice knapsack problem. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1126–1129. https://doi.org/10.1109/DATE.2012.6176665
Thant Sin ST (2021) The parallel processing approach to the dynamic programming algorithm of knapsack problem. In: 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), pp. 2252–2256. https://doi.org/10.1109/ElConRus51938.2021.9396489
Toth P (1980) Dynamic programming algorithms for the zero-one knapsack problem. Computing 25:29–45
Ulm DR, Baker JW (1996) Solving a 2d knapsack problem on an associative computer augmented with a linear network. In: in Proc. of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 29–32
Wang Q, Chu X (2020) Gpgpu performance estimation with core and memory frequency scaling. IEEE Trans Parallel Distrib Syst 31(12):2865–2881. https://doi.org/10.1109/TPDS.2020.3004623
Wen H, Zhang W (2015) Exploring shared memory and cache to improve gpu performance and energy efficiency. In: Sixteenth International Symposium on Quality Electronic Design, pp. 402–405. https://doi.org/10.1109/ISQED.2015.7085459
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Xiao S, Feng Wc (2010) Inter-block gpu communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12. https://doi.org/10.1109/IPDPS.2010.5470477
You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, pp. 1–10. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3225058.3225069
Acknowledgements
We thank to National Center for High-performance Computing (NCHC) for providing computational and storage resources. We also thank to Prof. Ing-Jer Huang from National Sun Yat-sen University for providing valuable insights and comments to our work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, EM., Chou, J. Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency. J Supercomput 78, 13653–13679 (2022). https://doi.org/10.1007/s11227-022-04425-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04425-3