Elsevier

Astronomy and Computing

Volume 14, January 2016, Pages 23-34
Astronomy and Computing

Full length article
Indexing data cubes for content-based searches in radio astronomy

https://doi.org/10.1016/j.ascom.2016.01.002Get rights and content

Abstract

Methods for observing space have changed profoundly in the past few decades. The methods needed to detect and record astronomical objects have shifted from conventional observations in the optical range to more sophisticated methods which permit the detection of not only the shape of an object but also the velocity and frequency of emissions in the millimeter-scale wavelength range and the chemical substances from which they originate. The consolidation of radio astronomy through a range of global-scale projects such as the Very Long Baseline Array (VLBA) and the Atacama Large Millimeter/submillimeter Array (ALMA) reinforces the need to develop better methods of data processing that can automatically detect regions of interest (ROIs) within data cubes (position–position–velocity), index them and facilitate subsequent searches via methods based on queries using spatial coordinates and/or velocity ranges. In this article, we present the development of an automatic system for indexing ROIs in data cubes that is capable of automatically detecting and recording ROIs while reducing the necessary storage space. The system is able to process data cubes containing megabytes of data in fractions of a second without human supervision, thus allowing it to be incorporated into a production line for displaying objects in a virtual observatory. We conducted a set of comprehensive experiments to illustrate how our system works. As a result, an index of 3% of the input size was stored in a spatial database, representing a compression ratio equal to 33:1 over an input of 20.875 GB, achieving an index of 773 MB approximately. On the other hand, a single query can be evaluated over our system in a fraction of second, showing that the indexing step works as a shock-absorber of the computational time involved in data cube processing. The system forms part of the Chilean Virtual Observatory (ChiVO), an initiative which belongs to the International Virtual Observatory Alliance (IVOA) that seeks to provide the capability of content-based searches on data cubes to the astronomical community.

Introduction

Modern astronomy is characterized by the accumulation of novel methods for observing astronomical objects. Many of these have shifted toward methods based on recording emissions on millimeter-scale wavelength ranges, observations that enable the determination of the chemical compositions of detected astronomical objects. The interest in exploring the existence of substances such as CO in galactic or extra-galactic observations has spurred the development of numerous projects, such as the Very Long Baseline Array (VLBA) (Very Long Baseline Array, VLBA, 2016) and the Atacama Large Millimeter/submillimeter Array (ALMA) (Atacama Large mm/submm Array, 2016), which are capable of recording the frequencies and velocities of emission lines from objects at astronomical distances producing a growing data archive for astronomy research purposes. Furthermore, it is projected that when the Square Kilometer Array (SKA) (Square Kilometre Array, SKA, 2016) begins operation in 2020 more than 60 PB of archived data will be accessible to astronomers (Berriman and Groom, 2011). This will produce a high amount of daily data traffic, creating a need for access to data processing methods that are capable of contending with such large data sets. For example, it is estimated that when it is running at full capacity, ALMA will generate more than 750 GB of data every day (approximately 250 TB a year) (Atacama Large mm/submm Array, 2016). Therefore, the astronomical community will require the use of a high-speed data transmission system to archive the data of interest and analyze them to extract information relevant to their needs. Because of the enormous volume of data that will be generated, it will be impractical for analytical procedures to be performed on the entire data set. The need for more and better automatic detection, indexing, recording, and cataloging methods is thus a key factor in the continued growth of astronomy in the 21st century.

The ultimate objective of developing automatic detection, indexing, recording, and cataloging systems is to provide integrated systems with virtual search software, also known as virtual observatories (Araya et al., 2015). Virtual observatory development initiatives are coordinated through the International Virtual Observatory Alliance (IVOA) (International Virtual Observatory Alliance, 2016), which catalogs virtual observatories around the world that provide access to and search methods for extensive collections of astronomical objects. In particular, the emergence of the Chilean Virtual Observatory (ChiVO) (Chilean Virtual Observatory, 2015) and its incorporation into the IVOA has spurred the development of a system for detecting and recording regions of interest (ROIs) in radio astronomy data, which allows for content-based searches as part of ChiVO.

In this article, we present the methods and techniques used to develop the data cube indexing system for content-based searches as part of ChiVO. The indexing system was designed under efficiency constraints and therefore incorporates computationally lightweight processes that are capable of single-pass data processing–thus facilitating its implementation–and the handling of large amounts of data while significantly reducing the cost needed for content-based searches. Our goal is to build an effective indexing system, trying to absorb computational costs at the indexing step reducing the time involved in data recovery. We will show in our experiments that our system is capable to create an index at 33:1 compression ratio. The index helps us to process spatial queries in fraction of seconds, showing that the indexing step works as a shock-absorber of the computational time involved in data processing. Thus, the key factor of our system is the process of creation of the index, where a number of design decision are taken to address the tradeoff between quality of approximation and computational time involved in data processing.

The article documents, step by step, the methods used for data processing, the strategies used for signal/noise processing, the techniques used for spectrogram processing and for obtaining summary spectrograms that enable the determination of velocity ranges of interest, the methods used for stacking data slices within ranges of interest over which morphological structuration processes are applied, and the methods that assist in identifying the localization parameters for objects.

The contributions of this article include providing a detailed description of the data indexing system to serve as a potential model for future work in the field of data cube indexing. The article also presents solutions to the problem of large-scale data processing through the use of low-cost computational operations, thus addressing the requirement for online processing under high-demand conditions. This article is specifically directed to the community of astroinformatics software developers, but the concepts presented herein are also generally useful for all types of software development with a need to solve problems of large-scale data processing and indexing, particularly for multiway data.

This article is organized into the following sections. Section  2 presents a review of related works. Section  3 discusses the concepts of data cubes, morphological structuration, and shape detection. Section  4 presents the general architecture of the spectrographic cube indexing system. Sections  5 Spectral processing, 6 Detection and indexing of objects in ROIs discuss the system components related to spectrographic processing and ROI detection and indexing, respectively. Section  7 presents a comprehensive set of experiments to validate our proposal. Conclusions are presented in Section  8.

Section snippets

Related work

The field of radio astronomy software development has been quite active in the past few decades. Current software has been developed primarily for the manipulation, visualization, and post-processing analysis of data; this is significantly different from our system, which was designed for the online indexing and recording of data cubes. Consequently, the former systems, designed for radio astronomer end users, are restricted not by the computational costs of the methods used but rather by the

Data cubes

Radio observatories provide data in the form of three-dimensional images called data cubes for many types of radio observations as for instance single-dish observations, point-and-shoot mappings, or interferometry observations. In hyperspectral data cubes each element of the cube represents a point in space by a set of coordinates, which requires two entries to be defined (for example, galactic latitude and longitude), and a third entry that indicates the spectrographic wavelength or velocity

General system architecture

The system we present in this article takes spectrograph data cubes, obtained from radio astronomy observations, as inputs. The cubes are stored in FITS format, as described in Section  3.1. The output of the system returns indexed records to a database, which pairs the coordinates of the detected objects with the processed cubes, thus facilitating coordinate (PP) and velocity field searches. The architectural diagram of the system is shown in Fig. 1.

As shown in Fig. 1, the system consists of

Spectral processing

Here, we introduce the system element that produces a sketch of the spectrogram along the velocity axis. This sketch is meant to enable the detection of the components of highest energy in the spectrogram, thus allowing for the identification of velocity ranges with relevant information. Using these ranges, spatial projections (position–position, or PP, projections) are obtained from cube slice stacks. Each object detected in the PP projection is characterized in terms of its centroid and

Detection and indexing of objects in ROIs

The detection and indexing of objects in PP slices makes use of binary images during the shape recognition step. We begin our description of this component of the system by reviewing a number of properties of the thresholding/opening operation. This is the first step in the detection and indexing of ROIs using the stacked PP projections obtained from the spectrographic processing phase.

Experiments

In this section we will report a comprehensive set of experiments for the validation of our indexing method. We start by exploring the performance of the first component of our system, exploring the impact of the sampling rate on spectra processing and velocity field extraction. In a second section of experiments, we explore the multiscale representation performance, illustrating the effect over data reduction, a key factor for an effective indexing method.

Here, we present the results of

Conclusions

In this article, we presented a system for data cube indexing. The implementation of this system in ChiVO will allow for content-based searches in cubes provided by the virtual observatory and will minimize the time involved in data recovery from large-scale projects such as ALMA. The methods we presented are simple, effective, and incur low computational costs, an essential requirement for processing the enormous volumes of data produced by observatories. The code used to implement the various

Reproducible research

We release our system for software development and research purposes under the GitHub open code platform. The system code can be cloned using git clone at:

The system code allows access full open codes. Basic instructions and commands are also included. We include scripts that allow to reproduce our results. The system is licensed using GNU GENERAL PUBLIC LICENSE (GPL) terms of use.

Acknowledgments

This research was possible due to CONICYT-Chile fundings, specifically through the project FONDEF D11I1060 and the project ICHAA 79130008. Mr. Mendoza was supported by Basal Project FB-0821.

References (38)

  • M. Araya et al.

    A brief survey on the virtual observatory

    New Astron.

    (2015)
  • Araya, M., Solar, M., Mardones, D., Hochfarber, T., 2014. Exorcising the ghost in the machine: synthetic spectral data...
  • Atacama Large mm/submm Array, ALMA, 2016....
  • ASYDO, Astronomical Synthetic Data Observations, 2015....
  • Berkeley Illinois Maryland Association, BIMA, 2004....
  • B. Berriman et al.

    How will astronomy archives survive the data tsunami?

    ACM Queue

    (2011)
  • Berry, D.S., 2007. CUPID: A clump identification and analysis package. Astronomical data analysis software and systems....
  • D.S. Berry

    FellWalker - a clump identification algorithm

    Astron. Comput.

    (2015)
  • Chilean Virtual Observatory, ChiVO, 2015....
  • M. Cushing et al.

    Spextool: a spectral extraction package for spex, a 0.8-5.5μm cross-dispersed spectrograph

    Publ. Astron. Soc. Pac.

    (2004)
  • Dame, T.M., 2011. Optimization of Moment Masking for CO Spectral Line, eprint...
  • R. Hanish et al.

    Definition of the flexible image transport system (FITS)

    Astron. Astrophys.

    (2001)
  • R. Haralick et al.

    Textural features for image classification

    IEEE Trans. Syst. Man Cybern.

    (1973)
  • International Virtual Observatory Alliance, IVOA, 2016....
  • O. Kochukhov et al.

    Least squares deconvolution of the stellar intensity and polarization spectra

    Astron. Astrophys.

    (2010)
  • S. Mukhopadhyay et al.

    Multiscale morphological segmentation of gray-scale images

    IEEE Trans. Image Process.

    (2003)
  • National Radio Astronomy Observatory, CASA, the Common Astronomy Software Applications package, 2016....
  • National Radio Astronomy Observatory, GBTIDL: Data reduction for the GBT using IDL, 2005....
  • National Radio Astronomy Observatory, Green Bank Site, GBT, 2016....
  • Cited by (6)

    • TensorFit a tool to analyse spectral cubes in a tensor mode

      2018, Astronomy and Computing
      Citation Excerpt :

      Working with cubes of astronomical data is complex. On the one hand, we have the problem of data size that has been extensively studied in recent times (Araya et al., 2016; Law et al., 2016; Hassan et al., 2013, 2011); but there is another equally relevant problem that has not had the same scientific attention, as the dimensionality of these cubes. This problem in computer science is known as the curse of dimensionality, a term coined by Bellman (1961).

    • JOVIAL: Notebook-based astronomical data analysis in the cloud

      2018, Astronomy and Computing
      Citation Excerpt :

      While Section 3 presents an architecture to scale in terms of users and notebooks, one of the main advantages of bringing code to data is task distribution across the data center infrastructure. We developed a proof of concept of a distributed pipeline for notebooks that finds regions of interest using a fast algorithm called RoISE (Araya et al., 2016) that is implemented in the ACAlib python package (Araya et al., 2018c). The main objective of this pipeline is to show that a large number of data products can be processed despite the different resolutions, signal-to-noise ratios, densities, morphologies, imaging parameters, among others (Araya et al., 2018a).

    • Unsupervised learning of structure in spectroscopic cubes

      2018, Astronomy and Computing
      Citation Excerpt :

      This can be combined with a low-pass filter to smooth the signal and select regions of interest rather than isolated pixels. More advanced methods use morphological transformations (structured elements or kernel-density functions) and edge detection techniques such as in Araya et al. (2016). An interesting family of detection methods is pixel-based clumping algorithms for spectroscopic data cubes, which separate the signal not only from background, but clusterize pixels in different emission sources.

    • Cloud services on an astronomy data center

      2016, Proceedings of SPIE - The International Society for Optical Engineering
    View full text