Full length articleIndexing data cubes for content-based searches in radio astronomy
Introduction
Modern astronomy is characterized by the accumulation of novel methods for observing astronomical objects. Many of these have shifted toward methods based on recording emissions on millimeter-scale wavelength ranges, observations that enable the determination of the chemical compositions of detected astronomical objects. The interest in exploring the existence of substances such as CO in galactic or extra-galactic observations has spurred the development of numerous projects, such as the Very Long Baseline Array (VLBA) (Very Long Baseline Array, VLBA, 2016) and the Atacama Large Millimeter/submillimeter Array (ALMA) (Atacama Large mm/submm Array, 2016), which are capable of recording the frequencies and velocities of emission lines from objects at astronomical distances producing a growing data archive for astronomy research purposes. Furthermore, it is projected that when the Square Kilometer Array (SKA) (Square Kilometre Array, SKA, 2016) begins operation in 2020 more than 60 PB of archived data will be accessible to astronomers (Berriman and Groom, 2011). This will produce a high amount of daily data traffic, creating a need for access to data processing methods that are capable of contending with such large data sets. For example, it is estimated that when it is running at full capacity, ALMA will generate more than 750 GB of data every day (approximately 250 TB a year) (Atacama Large mm/submm Array, 2016). Therefore, the astronomical community will require the use of a high-speed data transmission system to archive the data of interest and analyze them to extract information relevant to their needs. Because of the enormous volume of data that will be generated, it will be impractical for analytical procedures to be performed on the entire data set. The need for more and better automatic detection, indexing, recording, and cataloging methods is thus a key factor in the continued growth of astronomy in the 21st century.
The ultimate objective of developing automatic detection, indexing, recording, and cataloging systems is to provide integrated systems with virtual search software, also known as virtual observatories (Araya et al., 2015). Virtual observatory development initiatives are coordinated through the International Virtual Observatory Alliance (IVOA) (International Virtual Observatory Alliance, 2016), which catalogs virtual observatories around the world that provide access to and search methods for extensive collections of astronomical objects. In particular, the emergence of the Chilean Virtual Observatory (ChiVO) (Chilean Virtual Observatory, 2015) and its incorporation into the IVOA has spurred the development of a system for detecting and recording regions of interest (ROIs) in radio astronomy data, which allows for content-based searches as part of ChiVO.
In this article, we present the methods and techniques used to develop the data cube indexing system for content-based searches as part of ChiVO. The indexing system was designed under efficiency constraints and therefore incorporates computationally lightweight processes that are capable of single-pass data processing–thus facilitating its implementation–and the handling of large amounts of data while significantly reducing the cost needed for content-based searches. Our goal is to build an effective indexing system, trying to absorb computational costs at the indexing step reducing the time involved in data recovery. We will show in our experiments that our system is capable to create an index at 33:1 compression ratio. The index helps us to process spatial queries in fraction of seconds, showing that the indexing step works as a shock-absorber of the computational time involved in data processing. Thus, the key factor of our system is the process of creation of the index, where a number of design decision are taken to address the tradeoff between quality of approximation and computational time involved in data processing.
The article documents, step by step, the methods used for data processing, the strategies used for signal/noise processing, the techniques used for spectrogram processing and for obtaining summary spectrograms that enable the determination of velocity ranges of interest, the methods used for stacking data slices within ranges of interest over which morphological structuration processes are applied, and the methods that assist in identifying the localization parameters for objects.
The contributions of this article include providing a detailed description of the data indexing system to serve as a potential model for future work in the field of data cube indexing. The article also presents solutions to the problem of large-scale data processing through the use of low-cost computational operations, thus addressing the requirement for online processing under high-demand conditions. This article is specifically directed to the community of astroinformatics software developers, but the concepts presented herein are also generally useful for all types of software development with a need to solve problems of large-scale data processing and indexing, particularly for multiway data.
This article is organized into the following sections. Section 2 presents a review of related works. Section 3 discusses the concepts of data cubes, morphological structuration, and shape detection. Section 4 presents the general architecture of the spectrographic cube indexing system. Sections 5 Spectral processing, 6 Detection and indexing of objects in ROIs discuss the system components related to spectrographic processing and ROI detection and indexing, respectively. Section 7 presents a comprehensive set of experiments to validate our proposal. Conclusions are presented in Section 8.
Section snippets
Related work
The field of radio astronomy software development has been quite active in the past few decades. Current software has been developed primarily for the manipulation, visualization, and post-processing analysis of data; this is significantly different from our system, which was designed for the online indexing and recording of data cubes. Consequently, the former systems, designed for radio astronomer end users, are restricted not by the computational costs of the methods used but rather by the
Data cubes
Radio observatories provide data in the form of three-dimensional images called data cubes for many types of radio observations as for instance single-dish observations, point-and-shoot mappings, or interferometry observations. In hyperspectral data cubes each element of the cube represents a point in space by a set of coordinates, which requires two entries to be defined (for example, galactic latitude and longitude), and a third entry that indicates the spectrographic wavelength or velocity
General system architecture
The system we present in this article takes spectrograph data cubes, obtained from radio astronomy observations, as inputs. The cubes are stored in FITS format, as described in Section 3.1. The output of the system returns indexed records to a database, which pairs the coordinates of the detected objects with the processed cubes, thus facilitating coordinate (PP) and velocity field searches. The architectural diagram of the system is shown in Fig. 1.
As shown in Fig. 1, the system consists of
Spectral processing
Here, we introduce the system element that produces a sketch of the spectrogram along the velocity axis. This sketch is meant to enable the detection of the components of highest energy in the spectrogram, thus allowing for the identification of velocity ranges with relevant information. Using these ranges, spatial projections (position–position, or PP, projections) are obtained from cube slice stacks. Each object detected in the PP projection is characterized in terms of its centroid and
Detection and indexing of objects in ROIs
The detection and indexing of objects in PP slices makes use of binary images during the shape recognition step. We begin our description of this component of the system by reviewing a number of properties of the thresholding/opening operation. This is the first step in the detection and indexing of ROIs using the stacked PP projections obtained from the spectrographic processing phase.
Experiments
In this section we will report a comprehensive set of experiments for the validation of our indexing method. We start by exploring the performance of the first component of our system, exploring the impact of the sampling rate on spectra processing and velocity field extraction. In a second section of experiments, we explore the multiscale representation performance, illustrating the effect over data reduction, a key factor for an effective indexing method.
Here, we present the results of
Conclusions
In this article, we presented a system for data cube indexing. The implementation of this system in ChiVO will allow for content-based searches in cubes provided by the virtual observatory and will minimize the time involved in data recovery from large-scale projects such as ALMA. The methods we presented are simple, effective, and incur low computational costs, an essential requirement for processing the enormous volumes of data produced by observatories. The code used to implement the various
Reproducible research
We release our system for software development and research purposes under the GitHub open code platform. The system code can be cloned using git clone at:
The system code allows access full open codes. Basic instructions and commands are also included. We include scripts that allow to reproduce our results. The system is licensed using GNU GENERAL PUBLIC LICENSE (GPL) terms of use.
Acknowledgments
This research was possible due to CONICYT-Chile fundings, specifically through the project FONDEF D11I1060 and the project ICHAA 79130008. Mr. Mendoza was supported by Basal Project FB-0821.
References (38)
- et al.
A brief survey on the virtual observatory
New Astron.
(2015) - Araya, M., Solar, M., Mardones, D., Hochfarber, T., 2014. Exorcising the ghost in the machine: synthetic spectral data...
- Atacama Large mm/submm Array, ALMA, 2016....
- ASYDO, Astronomical Synthetic Data Observations, 2015....
- Berkeley Illinois Maryland Association, BIMA, 2004....
- et al.
How will astronomy archives survive the data tsunami?
ACM Queue
(2011) - Berry, D.S., 2007. CUPID: A clump identification and analysis package. Astronomical data analysis software and systems....
FellWalker - a clump identification algorithm
Astron. Comput.
(2015)- Chilean Virtual Observatory, ChiVO, 2015....
- et al.
Spextool: a spectral extraction package for spex, a 0.8-5.5μm cross-dispersed spectrograph
Publ. Astron. Soc. Pac.
(2004)
Definition of the flexible image transport system (FITS)
Astron. Astrophys.
Textural features for image classification
IEEE Trans. Syst. Man Cybern.
Least squares deconvolution of the stellar intensity and polarization spectra
Astron. Astrophys.
Multiscale morphological segmentation of gray-scale images
IEEE Trans. Image Process.
Cited by (6)
TensorFit a tool to analyse spectral cubes in a tensor mode
2018, Astronomy and ComputingCitation Excerpt :Working with cubes of astronomical data is complex. On the one hand, we have the problem of data size that has been extensively studied in recent times (Araya et al., 2016; Law et al., 2016; Hassan et al., 2013, 2011); but there is another equally relevant problem that has not had the same scientific attention, as the dimensionality of these cubes. This problem in computer science is known as the curse of dimensionality, a term coined by Bellman (1961).
JOVIAL: Notebook-based astronomical data analysis in the cloud
2018, Astronomy and ComputingCitation Excerpt :While Section 3 presents an architecture to scale in terms of users and notebooks, one of the main advantages of bringing code to data is task distribution across the data center infrastructure. We developed a proof of concept of a distributed pipeline for notebooks that finds regions of interest using a fast algorithm called RoISE (Araya et al., 2016) that is implemented in the ACAlib python package (Araya et al., 2018c). The main objective of this pipeline is to show that a large number of data products can be processed despite the different resolutions, signal-to-noise ratios, densities, morphologies, imaging parameters, among others (Araya et al., 2018a).
Unsupervised learning of structure in spectroscopic cubes
2018, Astronomy and ComputingCitation Excerpt :This can be combined with a low-pass filter to smooth the signal and select regions of interest rather than isolated pixels. More advanced methods use morphological transformations (structured elements or kernel-density functions) and edge detection techniques such as in Araya et al. (2016). An interesting family of detection methods is pixel-based clumping algorithms for spectroscopic data cubes, which separate the signal not only from background, but clusterize pixels in different emission sources.
Cloud services on an astronomy data center
2016, Proceedings of SPIE - The International Society for Optical Engineering