Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency

Huang, En-Ming; Chou, Jerry

doi:10.1007/s11227-022-04425-3

Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency

Published: 22 March 2022

Volume 78, pages 13653–13679, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

En-Ming Huang¹ &
Jerry Chou¹

397 Accesses
2 Citations
Explore all metrics

Abstract

This work aims to improve the GPU performance for solving the 0/1 knapsack problem, which is a well-known combinatorial optimization problem found in many practical applications, including cryptography, financial decision, electronic design automation, computing resource management, etc. The knapsack problem is NP-hard, but it can be solved efficiently by dynamic programming (DP) algorithms in pseudo-polynomial runtime. The DP knapsack algorithm on GPUs has been presented. However, as the modern GPU architecture provides much higher computing throughput than its memory bandwidth, previous work is bounded by the data access time on GPU memory because its CGMA (Compute to Global Memory Access) ratio is 1, which means every computing operation involves one memory access on average. To address the problem, an innovative approach called Multi-Class 0/1 Knapsack Problem (MCKP), whose items can be classified into groups with equal values or weights is proposed in this paper. By reconstructing the DP equations for solving MCKP, it is able to explore data parallelism and reusability across threads. This made it possible to optimize the computation across iterations (i.e., items), and significantly improve the CGMA ratio by 5-fold after exploring the use of GPU shared memory and registers for reused data. We extensively analyze the performance of our approach on two modern GPU models, NVIDIA Tesla V100 and RTX 3070. Compared to the runtime of previous work, our approach achieves up to 8x and 18x speedup on V100 and RTX 3070 respectively, the latter one being a GPU with lower memory bandwidth. In addition, by comparing the two speedups, we found that we are able to achieve more efficient computing usage when the memory bandwidth is limited such as RTX 3070.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient GPU-based parallel tabu search algorithm for hardware/software co-design

Article 16 March 2020

An Out-of-Core Branch and Bound Method for Solving the 0-1 Knapsack Problem on a GPU

Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters

Article 27 July 2020

References

Bellman R (1966) Dynamic programming. Science 153(3731):34–37. https://doi.org/10.1126/science.153.3731.34
Article MATH Google Scholar
Boukedjar A, Lalami ME, El-Baz D (2012) Parallel branch and bound on a cpu-gpu system. In: 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 392–398. https://doi.org/10.1109/PDP.2012.23
Boyer V, El Baz D, Elkihel M (2012) Solving knapsack problems on gpu. Comput Op Res 39(1):42–47
Article MathSciNet Google Scholar
Carneiro T, Muritiba AE, Negreiros M, Lima de Campos GA (2011) A new parallel schema for branch-and-bound algorithms using gpgpu. In: 2011 23rd International Symposium on Computer Architecture and High Performance Computing, pp. 41–47. https://doi.org/10.1109/SBAC-PAD.2011.20
Ding N, Williams S (2019) An instruction roofline model for gpus. In: 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 7–18. https://doi.org/10.1109/PMBS49563.2019.00007
Garey MR, Johnson DS (1990) Computers and intractability; a guide to the theory of NP-completeness. W. H Freeman & Co., New York
MATH Google Scholar
Hajarian M, Shahbahrami A, Hoseini F (2016) A parallel solution for the 0-1 knapsack problem using firefly algorithm. In: 1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), pp. 25–30. https://doi.org/10.1109/CSIEC.2016.7482134
HPC Advisory Council: The Top 500 List (2021). https://www.top500.org/lists/top500/2021/06/
Huang S, Xiao S, Feng W (2009) On the energy efficiency of graphics processing units for scientific computing. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–8. https://doi.org/10.1109/IPDPS.2009.5160980
Kelly T (2005) Generalized knapsack solvers for multi-unit combinatorial auctions: Analysis and application to computational resource allocation. In: P. Faratin, J.A. Rodríguez-Aguilar (eds.) Agent-Mediated Electronic Commerce VI. Theories for and Engineering of Distributed Mechanisms and Systems, pp. 73–86. Springer Berlin Heidelberg, Berlin, Heidelberg
Konstantinidis E, Cotronis Y (2015) A practical performance model for compute and memory bound gpu kernels. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 651–658. https://doi.org/10.1109/PDP.2015.51
Kumaraguruparan N, Sivaramakrishnan H, Sapatnekar SS (2012) Residential task scheduling under dynamic pricing using the multiple knapsack method. In: 2012 IEEE PES Innovative Smart Grid Technologies (ISGT), pp. 1–6. https://doi.org/10.1109/ISGT.2012.6175656
Lalami ME, El-Baz D (2012) Gpu implementation of the branch and bound method for knapsack problems. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, pp. 1769–1777. https://doi.org/10.1109/IPDPSW.2012.219
Lee J, Shragowitz E, Sahni S (1988) A hypercube algorithm for the 0/1 knapsack problem. J Parallel Distrib Comput 5(4):438–456. https://doi.org/10.1016/0743-7315(88)90007-X
Article Google Scholar
Lin J, Storer JA (1991) Processor-efficient hypercube algorithms for the knapsack problem. J Parallel Distrib Comput 13(3):332–337. https://doi.org/10.1016/0743-7315(91)90080-S
Article Google Scholar
Liu H, Shao Z, Wang M, Du J, Xue CJ, Jia Z (2009) Combining coarse-grained software pipelining with dvs for scheduling real-time periodic dependent tasks on multi-core embedded systems. J Signal Process Syst 57(2):249–262. https://doi.org/10.1007/s11265-008-0315-2
Article Google Scholar
National Center for High-performance Computing: TAIWANIA2 (2018). https://www.nchc.org.tw/
Nawaz Z, Stefanov T, Bertels K (2009) Efficient hardware generation for dynamic programming problems. In: 2009 International Conference on Field-Programmable Technology, pp. 348–352. https://doi.org/10.1109/FPT.2009.5377618
NVIDIA: NVIDIA A100 datasheet (2020). https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf
NVIDIA: Cuda c++ programming guide (2021). https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Oak Ridge National Laboratory: SUMMIT (2018). https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
O’Connell JF, Mumford CL (2014) An exact dynamic programming based method to solve optimisation problems using gpus. In: Second International Symposium on Computing and Networking, pp. 347–353. https://doi.org/10.1109/CANDAR.2014.27
Odlyzko AM (1990) The rise and fall of knapsack cryptosystems. In: In Cryptology and Computational Number Theory, pp. 75–88. A.M.S
O’Leary DE (1995) Financial planning with 0–1 knapsack problems, part i: domination results. Adv Math Program Financ Plan 4:139–150
Google Scholar
Pospichal P, Schwarz J, Jaros J (2010) Parallel genetic algorithm solving 0/1 knapsack problem running on the gpu. In: Proceedings of the 16th International Conference on Soft Computing (MENDEL), pp. 64–70
Schryen G (2020) Parallel computational optimization in operations research: a new integrative framework, literature review and research directions. Eur J Oper Res 287(1):1–18. https://doi.org/10.1016/j.ejor.2019.11.033
Article MathSciNet MATH Google Scholar
Shen J, Shigeoka K, Ino F, Hagihara K (2017) An out-of-core branch and bound method for solving the 0-1 knapsack problem on a gpu. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 254–267. https://doi.org/10.1007/978-3-319-65482-9_17
Shen J, Shigeoka K, Ino F, Hagihara K (2019) Gpu-based branch-and-bound method to solve large 0–1 knapsack problems with data-centric strategies. Concurr Comput Pract Exp 31(4):e4954
Article Google Scholar
Sun X, Wu CC, Chen LR, Lin JY (2018) Using inter-block synchronization to improve the knapsack problem on gpus. Int J Grid High Perform Comput (IJGHPC) 10(4):83–98
Article Google Scholar
Suri B, Bordoloi UD, Eles P (2012) A scalable gpu-based approach to accelerate the multiple-choice knapsack problem. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1126–1129. https://doi.org/10.1109/DATE.2012.6176665
Thant Sin ST (2021) The parallel processing approach to the dynamic programming algorithm of knapsack problem. In: 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), pp. 2252–2256. https://doi.org/10.1109/ElConRus51938.2021.9396489
Toth P (1980) Dynamic programming algorithms for the zero-one knapsack problem. Computing 25:29–45
Article MathSciNet Google Scholar
Ulm DR, Baker JW (1996) Solving a 2d knapsack problem on an associative computer augmented with a linear network. In: in Proc. of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 29–32
Wang Q, Chu X (2020) Gpgpu performance estimation with core and memory frequency scaling. IEEE Trans Parallel Distrib Syst 31(12):2865–2881. https://doi.org/10.1109/TPDS.2020.3004623
Article Google Scholar
Wen H, Zhang W (2015) Exploring shared memory and cache to improve gpu performance and energy efficiency. In: Sixteenth International Symposium on Quality Electronic Design, pp. 402–405. https://doi.org/10.1109/ISQED.2015.7085459
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Xiao S, Feng Wc (2010) Inter-block gpu communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12. https://doi.org/10.1109/IPDPS.2010.5470477
You Y, Zhang Z, Hsieh CJ, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, pp. 1–10. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3225058.3225069

Download references

Acknowledgements

We thank to National Center for High-performance Computing (NCHC) for providing computational and storage resources. We also thank to Prof. Ing-Jer Huang from National Sun Yat-sen University for providing valuable insights and comments to our work.

Author information

Authors and Affiliations

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
En-Ming Huang & Jerry Chou

Authors

En-Ming Huang
View author publications
You can also search for this author inPubMed Google Scholar
Jerry Chou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jerry Chou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, EM., Chou, J. Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency. J Supercomput 78, 13653–13679 (2022). https://doi.org/10.1007/s11227-022-04425-3

Download citation

Accepted: 22 February 2022
Published: 22 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11227-022-04425-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An efficient GPU-based parallel tabu search algorithm for hardware/software co-design

An Out-of-Core Branch and Bound Method for Solving the 0-1 Knapsack Problem on a GPU

Ignite-GPU: a GPU-enabled in-memory computing architecture on clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now