Abstract
In this work, we present a tool for solving large scattering problems with several acoustic source configurations. These problems entail a large matrix multiplication where the matrices must be generated on demand so that problems can be solved using systems with less memory than that required to store the whole matrices. We have analysed and developed different versions: one based on multiple matrix-vector products, two different approaches built on tiled matrix multiplication, and one heterogeneous implementation for using a GPU and a Xeon Phi simultaneously. To test these implementations, we have used different devices: multicore CPUs, a Xeon Phi accelerator, and a Tesla GPU. When compared to our initial work, the peak speedup of the new solutions is \(25\times \) for CPU, \(17\times \) for Phi, \(20\times \) for GPU, and \(20\times \) for the heterogeneous GPU + Phi implementation. Finally, it is worth mentioning that the tool presented in this work can be adapted and applied to other fields whenever the problem to solve requires a large matrix multiplication where the elements must be generated on demand (e.g. the inverse scattering problem in electromagnetics).




Similar content being viewed by others
Notes
8 cores at 2.0 GHz (Hyper-Threading and Turbo Boost disabled).
2,496 CUDA cores at 706 MHz and 5 GB of device memory.
60 cores at 1.053 GHz (4 threads/core) and 8 GB of RAM.
References
Intel Corporation (2014) Intel Math Kernel Library documentation. http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation
NVIDIA Corporation (2015) cuBLAS Library. http://docs.nvidia.com/cuda/cublas/
Innovative Computing Laboratory (ICL) (2015) Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA). http://icl.cs.utk.edu/plasma/
Innovative Computing Laboratory (ICL) (2015) MAGMA. http://icl.cs.utk.edu/magma/
Quintana-Ortí G et al (2012) A runtime system for programming out-of-core matrix algorithms-by-tiles on multithreaded architectures. ACM Trans Math Softw 38(4):1–25
Hu FQ (2013) An efficient solution of time domain boundary integral equations for acoustic scattering and its acceleration by Graphics Processing Units. In: 19th AIAA/CEAS Aeroacoustics Conference. American Institute of Aeronautics and Astronautics
López-Portugués M et al (2014) Aircraft noise scattering prediction using different accelerator architectures. J Supercomp 70(2):612–622
El-Shenawee M, Miller EL (2004) Multiple-incidence and multifrequency for profile reconstruction of random rough surfaces using the 3-D electromagnetic fast multipole model. IEEE Trans Geosci Remote Sens 42(11):2499–2510
Álvarez-López Y et al (2010) Geometry reconstruction of metallic bodies using the sources reconstruction method. IEEE Antennas Wirel Propag Lett 9:1197–1200
Guan J, Yan S, Jin JM (2013) An OpenMP-CUDA implementation of multilevel fast multipole algorithm for electromagnetic simulation on multi-GPU computing systems. IEEE Trans Antennas Propag 61(7):3607–3616
Nguyen QM et al (2013) Parallelizing fast multipole method for large-scale electromagnetic problems using GPU clusters. IEEE Antennas Wirel Propag Lett 12:868–871
Dang V, Nguyen Q, Kilic O (2013) Fast multipole method for large-scale electromagnetic scattering problems on GPU cluster and FPGA-accelerated platforms. Appl Comput Electromagn Soc J 28(12):1187–1198
López-Portugués M et al (2012) Acoustic scattering solver based on single level FMM for multi-GPU systems. J Parallel Distrib Comp 72(9):1057–1064
López-Portugués M et al (2013) Parallelization of the FMM on distributed-memory GPGPU. J Supercomp 64(1):17–27
López-Portugués M et al. (2015) Solving noise prediction problems with several noise source configurations using multicore and manycore architectures. In: Proceedings of the 15th International Conference on Computational and Mathematical Methods in Science and Engineering. CMMSE. http://cmmse.usal.es/cmmse2015/images/stories/congreso/Proceedings_CMMSE_2015.pdf
Wu TW (2000) Boundary element acoustics: fundamentals and computer codes. WIT Press, Southampton
Anderson E et al (1995) LAPACK users’ guide. second. Society for Industrial and Applied Mathematics, Philadelphia
NVIDIA Corporation (2014) CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Vladimirov A (2012) Auto-vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge and Knights Corner? Stanford University for Colfax International. http://research.colfaxinternational.com/file.axd?file=2012/3/Colfax_Sandy_Bridge_AVX.pdf
Intel Corporation (2013) Compiler methodology for Intel MIC architecture vectorization essentials, data alignment to assist vectorization. https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization
Gannon D, Jalby W, Gallivan K (1988) Strategies for cache and local memory management by global program transformation. J Parallel Distrib Comp 5(5):587–616
Lebeck AR, Wood DA (1994) Cache profiling and the SPEC benchmarks: a case study. IEEE Comp 27(10):15–26
Intel Corporation (2014) Memory management for optimal performance on Intel Xeon Phi coprocessor: alignment and prefetching. https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and
Acknowledgments
This work has been partially supported by the “Ministerio de Economía y Competitividad” of Spain / FEDER under grants TEC2012-38142-C04-04 and TEC2015-67387-C4-3-R; and by the “Gobierno del Principado de Asturias” / FEDER under project FC-15-GRUPIN14-114.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
López-Portugués, M., López-Fernández, J. ., Ranilla, J. et al. Using heterogeneous computing for scattering prediction in scenarios with several source configurations . J Supercomput 73, 57–74 (2017). https://doi.org/10.1007/s11227-015-1618-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1618-2