Indexing data cubes for content-based searches in radio astronomy

doi:10.1016/j.ascom.2016.01.002

Astronomy and Computing

Volume 14, January 2016, Pages 23-34

https://doi.org/10.1016/j.ascom.2016.01.002 Get rights and content

Abstract

Methods for observing space have changed profoundly in the past few decades. The methods needed to detect and record astronomical objects have shifted from conventional observations in the optical range to more sophisticated methods which permit the detection of not only the shape of an object but also the velocity and frequency of emissions in the millimeter-scale wavelength range and the chemical substances from which they originate. The consolidation of radio astronomy through a range of global-scale projects such as the Very Long Baseline Array (VLBA) and the Atacama Large Millimeter/submillimeter Array (ALMA) reinforces the need to develop better methods of data processing that can automatically detect regions of interest (ROIs) within data cubes (position–position–velocity), index them and facilitate subsequent searches via methods based on queries using spatial coordinates and/or velocity ranges. In this article, we present the development of an automatic system for indexing ROIs in data cubes that is capable of automatically detecting and recording ROIs while reducing the necessary storage space. The system is able to process data cubes containing megabytes of data in fractions of a second without human supervision, thus allowing it to be incorporated into a production line for displaying objects in a virtual observatory. We conducted a set of comprehensive experiments to illustrate how our system works. As a result, an index of 3% of the input size was stored in a spatial database, representing a compression ratio equal to 33:1 over an input of 20.875 GB, achieving an index of 773 MB approximately. On the other hand, a single query can be evaluated over our system in a fraction of second, showing that the indexing step works as a shock-absorber of the computational time involved in data cube processing. The system forms part of the Chilean Virtual Observatory (ChiVO), an initiative which belongs to the International Virtual Observatory Alliance (IVOA) that seeks to provide the capability of content-based searches on data cubes to the astronomical community.

Introduction

Modern astronomy is characterized by the accumulation of novel methods for observing astronomical objects. Many of these have shifted toward methods based on recording emissions on millimeter-scale wavelength ranges, observations that enable the determination of the chemical compositions of detected astronomical objects. The interest in exploring the existence of substances such as CO in galactic or extra-galactic observations has spurred the development of numerous projects, such as the Very Long Baseline Array (VLBA) (Very Long Baseline Array, VLBA, 2016) and the Atacama Large Millimeter/submillimeter Array (ALMA) (Atacama Large mm/submm Array, 2016), which are capable of recording the frequencies and velocities of emission lines from objects at astronomical distances producing a growing data archive for astronomy research purposes. Furthermore, it is projected that when the Square Kilometer Array (SKA) (Square Kilometre Array, SKA, 2016) begins operation in 2020 more than 60 PB of archived data will be accessible to astronomers (Berriman and Groom, 2011). This will produce a high amount of daily data traffic, creating a need for access to data processing methods that are capable of contending with such large data sets. For example, it is estimated that when it is running at full capacity, ALMA will generate more than 750 GB of data every day (approximately 250 TB a year) (Atacama Large mm/submm Array, 2016). Therefore, the astronomical community will require the use of a high-speed data transmission system to archive the data of interest and analyze them to extract information relevant to their needs. Because of the enormous volume of data that will be generated, it will be impractical for analytical procedures to be performed on the entire data set. The need for more and better automatic detection, indexing, recording, and cataloging methods is thus a key factor in the continued growth of astronomy in the 21st century.

The ultimate objective of developing automatic detection, indexing, recording, and cataloging systems is to provide integrated systems with virtual search software, also known as virtual observatories (Araya et al., 2015). Virtual observatory development initiatives are coordinated through the International Virtual Observatory Alliance (IVOA) (International Virtual Observatory Alliance, 2016), which catalogs virtual observatories around the world that provide access to and search methods for extensive collections of astronomical objects. In particular, the emergence of the Chilean Virtual Observatory (ChiVO) (Chilean Virtual Observatory, 2015) and its incorporation into the IVOA has spurred the development of a system for detecting and recording regions of interest (ROIs) in radio astronomy data, which allows for content-based searches as part of ChiVO.

In this article, we present the methods and techniques used to develop the data cube indexing system for content-based searches as part of ChiVO. The indexing system was designed under efficiency constraints and therefore incorporates computationally lightweight processes that are capable of single-pass data processing–thus facilitating its implementation–and the handling of large amounts of data while significantly reducing the cost needed for content-based searches. Our goal is to build an effective indexing system, trying to absorb computational costs at the indexing step reducing the time involved in data recovery. We will show in our experiments that our system is capable to create an index at 33:1 compression ratio. The index helps us to process spatial queries in fraction of seconds, showing that the indexing step works as a shock-absorber of the computational time involved in data processing. Thus, the key factor of our system is the process of creation of the index, where a number of design decision are taken to address the tradeoff between quality of approximation and computational time involved in data processing.

The article documents, step by step, the methods used for data processing, the strategies used for signal/noise processing, the techniques used for spectrogram processing and for obtaining summary spectrograms that enable the determination of velocity ranges of interest, the methods used for stacking data slices within ranges of interest over which morphological structuration processes are applied, and the methods that assist in identifying the localization parameters for objects.

The contributions of this article include providing a detailed description of the data indexing system to serve as a potential model for future work in the field of data cube indexing. The article also presents solutions to the problem of large-scale data processing through the use of low-cost computational operations, thus addressing the requirement for online processing under high-demand conditions. This article is specifically directed to the community of astroinformatics software developers, but the concepts presented herein are also generally useful for all types of software development with a need to solve problems of large-scale data processing and indexing, particularly for multiway data.

This article is organized into the following sections. Section 2 presents a review of related works. Section 3 discusses the concepts of data cubes, morphological structuration, and shape detection. Section 4 presents the general architecture of the spectrographic cube indexing system. Sections 5 Spectral processing, 6 Detection and indexing of objects in ROIs discuss the system components related to spectrographic processing and ROI detection and indexing, respectively. Section 7 presents a comprehensive set of experiments to validate our proposal. Conclusions are presented in Section 8.

Section snippets

Related work

The field of radio astronomy software development has been quite active in the past few decades. Current software has been developed primarily for the manipulation, visualization, and post-processing analysis of data; this is significantly different from our system, which was designed for the online indexing and recording of data cubes. Consequently, the former systems, designed for radio astronomer end users, are restricted not by the computational costs of the methods used but rather by the

Data cubes

Radio observatories provide data in the form of three-dimensional images called data cubes for many types of radio observations as for instance single-dish observations, point-and-shoot mappings, or interferometry observations. In hyperspectral data cubes each element of the cube represents a point in space by a set of coordinates, which requires two entries to be defined (for example, galactic latitude and longitude), and a third entry that indicates the spectrographic wavelength or velocity

General system architecture

The system we present in this article takes spectrograph data cubes, obtained from radio astronomy observations, as inputs. The cubes are stored in FITS format, as described in Section 3.1. The output of the system returns indexed records to a database, which pairs the coordinates of the detected objects with the processed cubes, thus facilitating coordinate (PP) and velocity field searches. The architectural diagram of the system is shown in Fig. 1.

As shown in Fig. 1, the system consists of

Spectral processing

Here, we introduce the system element that produces a sketch of the spectrogram along the velocity axis. This sketch is meant to enable the detection of the components of highest energy in the spectrogram, thus allowing for the identification of velocity ranges with relevant information. Using these ranges, spatial projections (position–position, or PP, projections) are obtained from cube slice stacks. Each object detected in the PP projection is characterized in terms of its centroid and

Detection and indexing of objects in ROIs

The detection and indexing of objects in PP slices makes use of binary images during the shape recognition step. We begin our description of this component of the system by reviewing a number of properties of the thresholding/opening operation. This is the first step in the detection and indexing of ROIs using the stacked PP projections obtained from the spectrographic processing phase.

Experiments

In this section we will report a comprehensive set of experiments for the validation of our indexing method. We start by exploring the performance of the first component of our system, exploring the impact of the sampling rate on spectra processing and velocity field extraction. In a second section of experiments, we explore the multiscale representation performance, illustrating the effect over data reduction, a key factor for an effective indexing method.

Here, we present the results of

Conclusions

In this article, we presented a system for data cube indexing. The implementation of this system in ChiVO will allow for content-based searches in cubes provided by the virtual observatory and will minimize the time involved in data recovery from large-scale projects such as ALMA. The methods we presented are simple, effective, and incur low computational costs, an essential requirement for processing the enormous volumes of data produced by observatories. The code used to implement the various

Reproducible research

We release our system for software development and research purposes under the GitHub open code platform. The system code can be cloned using git clone at:

The system code allows access full open codes. Basic instructions and commands are also included. We include scripts that allow to reproduce our results. The system is licensed using GNU GENERAL PUBLIC LICENSE (GPL) terms of use.

Acknowledgments

This research was possible due to CONICYT-Chile fundings, specifically through the project FONDEF D11I1060 and the project ICHAA 79130008. Mr. Mendoza was supported by Basal Project FB-0821.

References (38)

M. Araya et al.
A brief survey on the virtual observatory
New Astron.
(2015)
Araya, M., Solar, M., Mardones, D., Hochfarber, T., 2014. Exorcising the ghost in the machine: synthetic spectral data...
Atacama Large mm/submm Array, ALMA, 2016....
ASYDO, Astronomical Synthetic Data Observations, 2015....
Berkeley Illinois Maryland Association, BIMA, 2004....
B. Berriman et al.
How will astronomy archives survive the data tsunami?
ACM Queue
(2011)
Berry, D.S., 2007. CUPID: A clump identification and analysis package. Astronomical data analysis software and systems....
D.S. Berry
FellWalker - a clump identification algorithm
Astron. Comput.
(2015)
Chilean Virtual Observatory, ChiVO, 2015....
M. Cushing et al.
Spextool: a spectral extraction package for spex, a 0.8-5.5μm cross-dispersed spectrograph
Publ. Astron. Soc. Pac.
(2004)

Dame, T.M., 2011. Optimization of Moment Masking for CO Spectral Line, eprint...

R. Hanish et al.

Definition of the flexible image transport system (FITS)

Astron. Astrophys.

(2001)

R. Haralick et al.

Textural features for image classification

IEEE Trans. Syst. Man Cybern.

(1973)

International Virtual Observatory Alliance, IVOA, 2016....

O. Kochukhov et al.

Least squares deconvolution of the stellar intensity and polarization spectra

Astron. Astrophys.

(2010)

S. Mukhopadhyay et al.

Multiscale morphological segmentation of gray-scale images

IEEE Trans. Image Process.

(2003)

National Radio Astronomy Observatory, CASA, the Common Astronomy Software Applications package, 2016....

National Radio Astronomy Observatory, GBTIDL: Data reduction for the GBT using IDL, 2005....

National Radio Astronomy Observatory, Green Bank Site, GBT, 2016....

Cited by (6)

TensorFit a tool to analyse spectral cubes in a tensor mode
2018, Astronomy and Computing
Citation Excerpt :
Working with cubes of astronomical data is complex. On the one hand, we have the problem of data size that has been extensively studied in recent times (Araya et al., 2016; Law et al., 2016; Hassan et al., 2013, 2011); but there is another equally relevant problem that has not had the same scientific attention, as the dimensionality of these cubes. This problem in computer science is known as the curse of dimensionality, a term coined by Bellman (1961).
As it is already known, modern observatories like the Atacama Large Millimeter/submillimeter Array (ALMA) and the Very Long Baseline Array (VLBA) generate large-scale data, which will be accentuated with the incorporation of new observatories, such as the Square Kilometre Array (SKA). It is projected by 2020 to obtain an archived astronomical data in a PB-scale ( $\approx$ 60 Petabyte). The Chilean Virtual Observatory (ChiVO) has stored the spectral cubes of ALMA and seeks to offer these data openly to the community, but downloading and processing these data should be done in its facilities. To this end, our proposal considers the cubes as a high order tensor, specifically 3-way tensor with 2 spatial dimensions (galactic latitude and longitude), and a velocity dimension. This opens a new approach and opportunity for computational prohibitive massive analysis of these cubes. Based on this premise, we propose TensorFit, a natural and scalable library to handle spectral cubes in a tensor mode. The implementation is built on parallel oriented frameworks, and distributed processing of n-arrays on PyTorch (GPU and CPU). To verify the impact of this proposal, our focus is on showing the benefits of tensor compression, in particular to Tucker implementations. These have demonstrated outstanding results of dimensionality reduction of multidimensional data in other scientific domains.
JOVIAL: Notebook-based astronomical data analysis in the cloud
2018, Astronomy and Computing
Citation Excerpt :
While Section 3 presents an architecture to scale in terms of users and notebooks, one of the main advantages of bringing code to data is task distribution across the data center infrastructure. We developed a proof of concept of a distributed pipeline for notebooks that finds regions of interest using a fast algorithm called RoISE (Araya et al., 2016) that is implemented in the ACAlib python package (Araya et al., 2018c). The main objective of this pipeline is to show that a large number of data products can be processed despite the different resolutions, signal-to-noise ratios, densities, morphologies, imaging parameters, among others (Araya et al., 2018a).
Performing astronomical data analysis using only personal computers is becoming impractical for the very large datasets produced nowadays. As analysis is not a task that can be automatized to its full extent, the idea of moving processing where the data is located means also moving the whole scientific process towards the archives and data centers. Using Jupyter Notebooks as a remote service is a recent trend in data analysis that aims to deal with this problem, but harnessing the infrastructure to serve the astronomer without increasing the complexity of the service is a challenge. In this paper we present the architecture and features of JOVIAL, a Cloud service where astronomers can safely use Jupyter notebooks over a personal space designed for high-performance processing under the high-availability principle. We show that features existing only in specific packages can be adapted to run in the notebooks, and that algorithms can be adapted to run across the data center without necessarily redesigning them.
Unsupervised learning of structure in spectroscopic cubes
2018, Astronomy and Computing
Citation Excerpt :
This can be combined with a low-pass filter to smooth the signal and select regions of interest rather than isolated pixels. More advanced methods use morphological transformations (structured elements or kernel-density functions) and edge detection techniques such as in Araya et al. (2016). An interesting family of detection methods is pixel-based clumping algorithms for spectroscopic data cubes, which separate the signal not only from background, but clusterize pixels in different emission sources.
We consider the problem of analyzing the structure of spectroscopic cubes using unsupervised machine learning techniques. We propose representing the target’s signal as a homogeneous set of volumes through an iterative algorithm that separates the structured emission from the background while not overestimating the flux. Besides verifying some basic theoretical properties, the algorithm is designed to be tuned by domain experts, because its parameters have meaningful values in the astronomical context. Nevertheless, we propose a heuristic to automatically estimate the signal-to-noise ratio parameter of the algorithm directly from data. The resulting light-weighted set of samples ( $\leq 1 %$ compared to the original data) offer several advantages. For instance, it is statistically correct and computationally inexpensive to apply well-established techniques of the pattern recognition and machine learning domains; such as clustering and dimensionality reduction algorithms. We use ALMA science verification data to validate our method, and present examples of the operations that can be performed by using the proposed representation. Even though this approach is focused on providing faster and better analysis tools for the end-user astronomer, it also opens the possibility of content-aware data discovery by applying our algorithm to big data.
JOVIAL: Notebook-based Astronomical Data Analysis in the Cloud
2018, arXiv
Unsupervised learning of structure in spectroscopic cubes
2018, arXiv
Cloud services on an astronomy data center
2016, Proceedings of SPIE - The International Society for Optical Engineering

View full text

Full length articleIndexing data cubes for content-based searches in radio astronomy

Abstract

Introduction

Section snippets

Related work

Data cubes

General system architecture

Spectral processing

Detection and indexing of objects in ROIs

Experiments

Conclusions

Reproducible research

Acknowledgments

New Astron.

How will astronomy archives survive the data tsunami?

ACM Queue

FellWalker - a clump identification algorithm

Astron. Comput.

Spextool: a spectral extraction package for spex, a 0.8-5.5μm cross-dispersed spectrograph

Publ. Astron. Soc. Pac.

Definition of the flexible image transport system (FITS)

Astron. Astrophys.

Textural features for image classification

IEEE Trans. Syst. Man Cybern.

Least squares deconvolution of the stellar intensity and polarization spectra

Astron. Astrophys.

Multiscale morphological segmentation of gray-scale images

IEEE Trans. Image Process.

Full length article
Indexing data cubes for content-based searches in radio astronomy