A method for efficient radio astronomical data gridding on multi-core vector processor

doi:10.1016/j.parco.2022.102972

Parallel Computing

Volume 113, October 2022, 102972

https://doi.org/10.1016/j.parco.2022.102972 Get rights and content

Abstract

Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high power consumption becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task parallelization and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.

M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.

Introduction

The challenge to fully exploit the potential of existing and upcoming scientific instruments, such as large single-dish radio telescopes, is to process the large amount of data collected effectively and efficiently. For instance, Five-hundred-meter Aperture Spherical Telescope (FAST) [1], [2], [3], [4], the world’s largest single-dish radio telescope, has been in operation since 2020. FAST generate a massive volume of radio astronomical data at a rate of 10–20 PB-size per year [3]. To obtain the correct sky images from such data, gridding is one of the most computationally intensive and time-consuming step in the data reduction pipeline, which maps non-uniform data samples onto a uniformly distributed grid for further analysis [5], [6], [7]. Like the 2D stencil computation with “Moore neighborhood pattern” [8], the kernel of gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Such sums are most commonly expressed as gather or scatter operations [9]. In addition, existing work demonstrates that much more neighboring points would be used in gridding than in stencil computation [10]. For example, in some gridding applications with high sampling density, the number of neighboring points may reach nearly 90,000 [11]. That is, gridding is a type of scientific computing with high-density computing characteristics.

In recent years, many hardware processors have been utilized to accelerate astronomical data gridding in the high-performance computing (HPC) field. Cygrid [10] is one of the most popular and effective gridding methods. It accelerates gridding using CPUs multi-core parallelization and has been applied to the Effelsberg-Bonn HI Survey and the Galactic All-Sky Survey [12]. However, the single-instruction, multiple data stream (SIMD) computational characteristics of gridding make it more suitable for accelerating on processors with a large number of parallel processing cores. In [13], FPGA is used to implement gridding with high energy efficiency. The limited hardware resources on the FPGA lead to poor computational performance, and a single acceleration circuit cannot support multiple convolutional computations simultaneously. In [6] and our previous work [11], HCGrid, speed up gridding through parallel computation on GPUs, and they also achieve excellent computational performance in real-world applications. For high-density computational tasks, the computing power of a single computing unit of the GPUs is weaker than that of a single core of the vector processor, which makes the update of one target cell require a longer runtime on the GPUs [14]. In addition, the high power consumption of GPUs is also an issue that needs to be concerned.

As accelerators of supercomputers, such as the Matrix-2000 and Matrix-2000+ processors deployed in Tianhe-2 A and Tianhe pre-exascale supercomputer systems, vector processors with vector-SIMD architectures offer powerful computing performance while maintaining low power consumption [15]. M-DSP is a novel vector processor for high-density computation, which is a high-performance floating-point multi-core vector processor with the vector-SIMD architecture developed for the next-generation exascale supercomputer by the national university of defense technology [14], [16], [17]. From the computing characteristic and hardware architectures, the high-density computing nature of gridding provides the potential to accelerate it on the vector processor with vector-SIMD architectures. According to our analysis, we have two observations about the current gridding methods implemented on other architectures (deployed on CPUs or GPUs). First, the current methods’ task parallelization strategies cannot fully utilize the computational resources of vector accelerators. Such as the current gridding methods implemented on the GPUs, group scalar cores into groups (thread warp) with specific data layout on-chip to execute the parallel task. It is challenging to match well with the computational resources of the vector processor. Second, current gridding methods do not consider the data transfer methods between computing cores and memory, which incur a lot of I/O overhead and additional computational overhead. Furthermore, this simple data transfer strategy also performs inefficiently on the vector processor.

To this end, we present the first effort in the field of HPC to achieve efficient radio astronomical data gridding on the multi-core vector processor M-DSP. The main contributions are as follows:

(1)
We design the gridding workflow on M-DSP and present a vectorized version of the gridding convolution algorithm in gather computational pattern to fully exploit the computational power of the processor.
(2)
We propose multiple task-based parallelization strategies, including block computing and line computing, to improve data reuse and parallel execution performance.
(3)
We present and explore different data loading strategies, including ring loading strategy and computation-transfer overlapping strategy, to improve data transfer efficiency on the vector processor.
(4)
We propose principles as guidelines for designing efficient parallel algorithms on M-DSP to fully utilize its computational power.

The rest of the paper is organized as follows. Section 2 discusses necessary background information, including the need of gridding and the gridding algorithm, and the architectural detail of M-DSP. Section 3 explains in detail the contributions of this paper. Performance is evaluated in Section 4. Section 5 contains an overview of other relevant related work. Section 6 concludes.

Section snippets

The need of gridding

Radio telescope consists of antenna and array receivers that detect radio signals from astronomical sources in the sky. Given the size of the large single-dish radio telescope, deploying the telescope in a “drift scan” is usually needed as a feasible and near-optimum sky survey strategy. The “drift scan” means moving the telescope’s receivers to a target azimuth and then fixing the telescope. Since the earth rotates once in 24 h, various celestial objects enter the receiver’s field of view

M-DSP workflow

Fig. 4 shows the schematic diagram of the process for creating the sky image. Before the gridding, the raw data points need to be pre-processed. The pre-processing step sorts the data points and builds a lookup table based on the HEALPix¹ ring [22] to enable efficient access to the

Experiments evaluation

In this section, we perform a detailed evaluation of our methods in aspects of overall performance, performance of parallelization, and performance of the data loading strategy. In addition, we give some principles about the parallel algorithm design on M-DSP based on the experimental analysis.

In our experiments, simulated datasets with different sampling densities are generated based on the relevant observational parameters of FAST in order to perform the most comprehensive analysis possible.

Related work

In recent years, the choice of using different processors to accelerate the gridding of radio astronomical data is becoming a trend in the high-performance computing field. Winkel et al. [10] proposed Cygrid, which is one of the most popular and effective gridding methods. Cygrid accelerates gridding by implementing parallelization on CPU multi-core and has been applied to the Effelsberg-Bonn HI Survey and the Galactic All-Sky Survey [12]. However, the single instruction, multiple streams of

Conclusions

Gridding is a performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for scientific analysis. This paper presents for the first time a method for efficient radio astronomical data gridding on a novel multi-core vector processor with the vector-SIMD architecture, M-DSP, designed for the next-generation exascale supercomputer. In addition, we explore the task parallelization and data transfer strategies on

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.parco.2022.102972.

Acknowledgments

This work is sponsored by the Joint Research Fund in Astronomy (grant Nos. U1931130) under a cooperative agreement between the National Natural Science Foundation of China (NSFC) and the Chinese Academy of Sciences, NSFC grant No. 11903056; as well as the National Natural Science Foundation of China (grant Nos.61972277).

References (27)

TräffJ.L. et al.
MPI collective communication through a single set of interfaces: A case for orthogonality
Parallel Comput.
(2021)
LiuZ. et al.
Optimizing convolutional neural networks on multi-core vector accelerator
Parallel Comput.
(2022)
CárcamoM. et al.
Multi-GPU maximum entropy image synthesis for radio astronomy
Astron. Comput.
(2018)
MerryB.
Faster GPU-based convolutional gridding via thread coarsening
Astron. Comput.
(2016)
ZhuY. et al.
Processing data of correlation on GPU
Bigot-SazyM.-A. et al.
HI intensity mapping with FAST
(2015)
DunningA. et al.
Design and laboratory testing of the five hundred meter aperture spherical telescope (FAST) 19 beam L-band receiver
LiD. et al.
FAST in space: considerations for a multibeam, multipurpose survey using China’s 500-m aperture spherical radio telescope (FAST)
IEEE Microw. Mag.
(2018)
YueY. et al.
FAST low frequency pulsar survey
GriffinA. et al.
End-to-end modelling of the imaging pipeline in radio astronomy

VeenboerB. et al.

Image-domain gridding on graphics processors

WangR. et al.

Processing full-scale square kilometre array data on the summit supercomputer

T. Zhao, P. Basu, S. Williams, M. Hall, H. Johansen, Exploiting reuse and vectorization in blocked stencil computations...

Cited by (0)

View full text

A method for efficient radio astronomical data gridding on multi-core vector processor

Abstract

Introduction

Section snippets

The need of gridding

M-DSP workflow

Experiments evaluation

Related work

Conclusions

Declaration of Competing Interest

Acknowledgments

Parallel Comput.

Parallel Comput.

Astron. Comput.

Astron. Comput.

HI intensity mapping with FAST

Design and laboratory testing of the five hundred meter aperture spherical telescope (FAST) 19 beam L-band receiver

FAST in space: considerations for a multibeam, multipurpose survey using China’s 500-m aperture spherical radio telescope (FAST)

IEEE Microw. Mag.

FAST low frequency pulsar survey

End-to-end modelling of the imaging pipeline in radio astronomy

Image-domain gridding on graphics processors

Processing full-scale square kilometre array data on the summit supercomputer