Abstract
Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip, thus can exploit the advantages and avoid disadvantages of those compute units. We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU–GPU heterogeneous processor by using 956 sparse matrices. Five characteristics, i.e., load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics are our major considerations. The experimental results show that although the CPU and GPU parts access the same DRAM, very different performance behaviors are observed. For example, though the GPU part in general outperforms the CPU part, it cannot achieve the best performance in all cases given by the CPU part. Moreover, the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.










Similar content being viewed by others
Notes
Since the Linux GPU driver of this integrated GPU is not officially available yet, we have to do all benchmarks on Microsoft Windows.
References
Agarwal, N., Nellans, D., Ebrahimi, E., Wenisch, T.F., Danskin, J., Keckler, S.W.: Selective GPU caches to eliminate CPU–GPU HW cache coherence. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 494–506 (2016)
Barik, R., Kaleem, R., Majeti, D., Lewis, B.T., Shpeisman, T., Hu, C., Ni, Y., Adl-Tabatabai, A.R.: Efficient mapping of irregular C++ applications to integrated GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 33, ACM (2014)
Boggs, D., Brown, G., Tuck, N., Venkatraman, K.S.: Denver: Nvidia’s first 64-bit ARM processor. IEEE Micro 35(2), 46–55 (2015)
Branover, A., Foley, D., Steinman, M.: AMD Fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012)
Buluç, A., Gilbert, J.: Parallel sparse matrix–matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. 34(4), C170–C191 (2012)
Daga, M., Aji, A.M., Feng, W.C: On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In: 2011 Symposium on Application Accelerators in High-Performance Computing, pp. 141–149 (2011)
Daga, M., Nutter, M., Meswani, M.: Efficient breadth-first search on a heterogeneous processor. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 373–382 (2014)
Daga, M., Nutter, M.: Exploiting coarse-grained parallelism in B+ tree searches on an APU. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 240–247 (2012)
Dashti, M., Fedorova, A.: Analyzing Memory Management Methods on Integrated CPU–GPU Systems. In: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, pp. 59–69 (2017)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
Doerksen, M., Solomon, S., Thulasiraman, P.: Designing APU oriented scientific computing applications in opencl. In: High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pp. 587–592, IEEE (2011)
Doweck, J., Kao, W., Lu, A.K., Mandelblat, J., Rahatekar, A., Rappoport, L., Rotem, E., Yasin, A., Yoaz, A.: Inside 6th-generation intel core: new microarchitecture code-named skylake. IEEE Micro 37(2), 52–62 (2017)
Duff, I.S., Heroux, M.A., Pozo, R.: An overview of the sparse basic linear algebra subprograms: the new standard from the BLAS technical forum. ACM Trans. Math. Softw. (TOMS) 28(2), 239–267 (2002)
Garzón, E.M., Moreno, J., Martínez, J.: An approach to optimise the energy efficiency of iterative computation on integrated GPU–CPU systems. J. Supercomput. 73(1), 114–125 (2017)
Greathouse, J.L., Daga, M.: Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 769–780 (2014)
Gregg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software, pp. 134–144 (2011)
Kaleem, R., Barik, R., Shpeisman, T., Hu, C., Lewis, B.T., Pingali, K.: Adaptive heterogeneous scheduling for integrated GPUs. In: Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pp. 151–162. IEEE (2014)
Krishnan, G., Bouvier, D., Naffziger, S.: Energy-efficient graphics and multimedia in 28-nm Carrizo accelerated processing unit. IEEE Micro 36(2), 22–33 (2016)
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp. 451–460 (2010)
Lee, K., Lin, H., Feng, Wc: Performance characterization of data-intensive kernels on AMD fusion architectures. Comput. Sci. Res. Dev. 28(2–3), 175–184 (2013)
Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 26:1–26:14 (2017a)
Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Locality-aware CTA clustering for modern GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, pp. 297–311 (2017b)
Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. Ph.D. thesis, University of Copenhagen (2015)
Liu, W., Vinter, B.: Ad-heap: an efficient heap data structure for asymmetric multicore processors. In: Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pp. 54:54–54:63 (2014)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, pp. 339–350 (2015a)
Liu, W., Vinter, B.: A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015b)
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)
Liu, T., Chen, C.C., Kim, W., Milor, L.: Comprehensive reliability and aging analysis on SRAMs within microprocessor systems. Microelectron. Reliab. 55(9), 1290–1296 (2015)
Liu, T., Chen, C.C., Wu, J., Milor, L.: SRAM stability analysis for different cache configurations due to bias temperature instability and hot carrier injection. In: Computer Design (ICCD), 2016 IEEE 34th International Conference on, pp. 225–232, IEEE (2016)
Liu, W., Li, A., Hogg, J.D., Duff, I.S., Vinter, B.: Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)
Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 407–408 (2018)
Liu, J., He, X., Liu, W., Tan, G.: Register-aware optimizations for parallel sparse matrix–matrix multiplication. Int. J. Parallel Program. (2019). https://doi.org/10.1007/s10766-018-0604-
Mekkat, V., Holey, A., Yew, P.C., Zhai, A.: Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pp. 225–234, IEEE Press (2013)
Merrill, D., Garland, M.: Merge-based parallel sparse matrix–vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 58. IEEE Press (2016)
Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.Y.: GraphBIG: understanding graph computing in the context of industrial solutions. In: High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, pp. 1–12, IEEE (2015)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Pandit, P., Govindarajan, R.: Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 273, ACM (2014)
Power, J., Basu, A., Gu, J., Puthoor, S., Beckmann, B.M., Hill, M.D., Reinhardt, S.K., Wood, D.A.: Heterogeneous system coherence for integrated CPU–GPU systems. In: 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 457–467 (2013)
Puthoor, S., Aji, A.M., Che, S., Daga, M., Wu, W., Beckmann, B.M., Rodgers, G.: Implementing directed acyclic graphs with the heterogeneous system architecture. In: Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU ’16, pp. 53–62 (2016)
Said, I., Fortin, P., Lamotte, J., Calandra, H.: Leveraging the accelerated processing units for seismic imaging: a performance and power efficiency comparison against CPUs and GPUs. Int. J. High Perform. Comput. Appl. 32(6), 819–837 (2017)
Schulte, M.J., Ignatowski, M., Loh, G.H., Beckmann, B.M., Brantley, W.C., Gurumurthi, S., Jayasena, N., Paul, I., Reinhardt, S.K., Rodgers, G.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)
Shen, J., Varbanescu, A.L., Sips, H., Arntzen, M., Simons, D.G.: Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In: Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pp. 14:1–14:10 (2013)
Shen, J., Varbanescu, A.L., Zou, P., Lu, Y., Sips, H.: Improving performance by matching imbalanced workloads with heterogeneous platforms. In: Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, pp. 241–250 (2014)
Shen, J., Varbanescu, A.L., Lu, Y., Zou, P., Sips, H.: Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 27(9), 2766–2780 (2016)
Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., Vetter, J.S.: The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In: Proceedings of the 9th conference on Computing Frontiers, pp. 103–112, ACM (2012)
Tang, S., He, B., Zhang, S., Niu, Z.: Elastic multi-resource fairness: balancing fairness and efficiency in coupled CPU–GPU architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 75. IEEE Press (2016)
Vijayaraghavan, T., Eckert, Y., Loh, G.H., Schulte, M.J., Ignatowski, M., Beckmann, B.M., Brantley, W.C., Greathouse, J.L., Huang, W., Karunanithi, A., Kayiran, O., Meswani, M., Paul, I., Poremba, M., Raasch, S., Reinhardt, S.K., Sadowski, G., Sridharan, V.: Design and analysis of an APU for Exascale computing. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 85–96 (2017)
Wang, H., Liu, W., Hou, K., Feng, W.C.: Parallel transposition of sparse data structures. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pp. 33:1–33:13 (2016)
Wang, X., Liu, W., Xue, W., Wu, L.: swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 338–353 (2018)
Wang, H., Geng, L., Lee, R., Hou, K., Zhang, Y., Zhang, X.: Sep-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, pp. 38–52 (2019)
Yang, Y., Xiang, P., Mantor, M., Zhou, H.: CPU-assisted GPGPU on fused CPU–GPU architectures. In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12 (2012)
Zakharenko, V., Aamodt, T., Moshovos, A.: Characterizing the performance benefits of fused CPU/GPU systems using FusionSim. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 685–688, EDA Consortium (2013)
Zakharenko, V.: FusionSim: characterizing the performance benefits of fused CPU/GPU systems. Ph.D. thesis (2012)
Zhang, F., Zhai, J., Chen, W., He, B., Zhang, S.: To co-run, or not to co-run: a performance study on integrated architectures. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 89–92 (2015)
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: FinePar: Irregularity-aware Fine-grained Workload Partitioning on Integrated Architectures. In: Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pp. 27–38 (2017a)
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017b)
Zhang, F., Lin, H., Zhai, J., Cheng, J., Xiang, D., Li, J., Chai, Y., Du, X.: An Adaptive Breadth-First Search Algorithm on Integrated Architectures. The Journal of Supercomputing (2018)
Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Understanding co-run degradations on integrated heterogeneous processors. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 82–97. Springer (2014)
Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Co-run scheduling with power cap on integrated CPU–GPU systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 967–977 (2017a)
Zhu, Q., Wu, B., Shen, X., Shen, K., Shen, L., Wang, Z.: Understanding co-run performance on CPU–GPU integrated processors: observations, insights, directions. Front. Comput. Sci. 11(1), 130–146 (2017b)
Acknowledgements
This work has been partly supported by the National Natural Science Foundation of China (Grant nos. 61732014, 61802412, 61671151), Beijing Natural Science Foundation (no. 4172031), and SenseTime Young Scholars Research Fund.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, F., Liu, W., Feng, N. et al. Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors. CCF Trans. HPC 1, 131–143 (2019). https://doi.org/10.1007/s42514-019-00008-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-019-00008-6