Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

https://doi.org/10.1016/j.future.2012.05.024Get rights and content

Abstract

We describe the design of a high-throughput storage system, Galileo, for data streams generated in observational settings. To cope with data volumes, the shared nothing architecture in Galileo supports incremental assimilation of nodes, while accounting for heterogeneity in their capabilities. To achieve efficient storage and retrievals of data, Galileo accounts for the geospatial and chronological characteristics of such time-series observational data streams. Our benchmarks demonstrate that Galileo supports high-throughput storage and efficient retrievals of specific portions of large datasets while supporting different types of queries.

Highlights

► Distributed file system designed for storing billions of scientific data files. ► High-throughput storage of geospatial streams. ► Fast retrieval and query support. ► Distributed computation support. ► Visualization capabilities.

Introduction

There has been a steady increase in the number and type of observational devices. Data from such devices must be stored for (1) processing that relies on access to historical data to make forecasts, and (2) visualizing how the observational data changes over time for a given spatial area. Data produced by such observational devices can be thought of as time-series data streams; a device generates the packets periodically or as part of configured change notifications. Data packets generated in these settings contain measurements from multiple, proximate locations. These measurements can be made by a single device (e.g., volumetric scans generated by radars) or from multiple devices (e.g., sensors send data to a base station that collates multiple observations to generate a single packet).

Observational data have spatio-temporal characteristics. Each measurement represents a feature of interest such as temperature, pressure, humidity, etc. The measurement is tied to specific location and elevation, and has a timestamp associated with it. While individual packets within an observational stream may not be large, (often in the order of kilobytes) the frequency of the reported measurements combined with increases in the number and type of devices lead to increasing data volumes.

Our targeted usage scenario is in the atmospheric domain where data from such measurements are used as inputs to weather forecasting models and visualization schemes. These usage patterns entail access to historical data to validate new models, identify correlations or trends, and visualize feature changes over time. We need to be able to access specific portions of the data efficiently to ensure faster completions of the aforementioned activities.

Research challenges in designing a storage framework for such observational data include the following.

  • 1.

    Support for a scale-out architecture. An extensible architecture that can assimilate nodes one at a time to support increased data storage requirements.

  • 2.

    High throughput storage of data. Given the number of data sources, we must be able to store data streams arriving at high rates. We measure throughput in terms of the total number of stream packets stored by the system over a period of time.

  • 3.

    Efficient retrievals of specific portions of the data. Given the large data volumes involved we must support fast sifting of stored data streams in response to queries that target a specific feature at a specific time for a given geospatial area. To accomplish this, we must account for the spatial and temporal characteristics of the data during the storage process and in turn use this metadata for efficient retrievals.

  • 4.

    Fast detection of non-matching queries. Often the query parameters are adjusted based on results from past queries. To support fine-tuning of queries, we must have accurate and efficient detection of situations when there are no data that match a specified query.

  • 5.

    Range query support. We must be able to support range queries over both the spatial and temporal dimensions while ensuring that support for such queries do not result in unacceptable overheads.

  • 6.

    Dynamic indexing strategies. The system should allow its indexing functionality to be adaptively reconfigured to better service different usage patterns and reduce latencies.

  • 7.

    Extensive data format support. There are vast amounts of data stored in established scientific storage formats. Our system must not require a particular input format so that it is useful for researchers that have already invested in an existing format, and we must allow the system to read and understand a variety of metadata without any loss of fidelity from conversion.

  • 8.

    Efficient processing of stored data. Our system should not only facilitate data retrieval, but also simplify launching distributed computations for processing data using the system’s indexing semantics as inputs.

  • 9.

    Integrated support for visualization. The system should provide functionality for visualizing data in a number of ways to facilitate analysis.

  • 10.

    Failure recovery. We must account for any possible failures and data corruptions at individual nodes. Recovery from failures must be fast and consistent.

This paper describes the design of a demonstrably high-throughput geospatial data storage system, Galileo. The storage framework is distributed and is incrementally scalable with the ability to assimilate new storage nodes as they become available. The storage subsystem organizes the storage and dispersion of data streams to support fast, efficient range queries targeting specific features across the spatial and temporal dimensions. A dynamic indexing scheme is utilized to help the system respond to differing load conditions, and a flexible framework is provided to allow a number of observational data formats to be read and understood by the system. To sustain failures and recover from data corruptions of specific blocks the system relies on replication. Most importantly, our benchmarks demonstrate the feasibility of designing high-throughput data storage from commodity nodes while accounting for differences in the capabilities of these nodes. Leveraging heterogeneity in the available nodes is particularly useful in cloud settings where newer nodes tend to have better storage and processing capabilities.

In the following section, the architecture of Galileo will be discussed, including an overview of how data is stored to disk, the network layout, and how data is positioned and replicated within the system. Next, an overview of the system’s scientific data format support infrastructure is detailed in Section 3, including information about the runtime plugin architecture and implementation of NetCDF format support. In Section 4, the dataset and query system will be described, followed by details of Galileo’s built-in distributed computation API in Section 5. Section 6 explores using the system for client-side data visualization. Section 8 provides a brief survey of related technologies in the field, and Section 7 presents benchmarks of our system’s capabilities. Section 9 reports conclusions from our research and discusses the future direction of the project.

Since the initial publication of Galileo: A Framework for Distributed Storage of High-Throughput Data Streams [1], we have added several key extensions for this special edition issue. Updates to the system architecture are detailed, including block versioning, compression support, and cross-group replication. We also added support for launching distributed computations on the data stored in Galileo, streaming visualization, and converting other storage types to our metadata format through a plugin architecture. Finally, a number of new benchmarks have been added, including a comparison with Hadoop and timing information for visualization, computation launching, graph reorientation, and metadata conversion.

Section snippets

System architecture

Galileo runs as a computation on the Granules Runtime for Cloud Computing [2]. Granules is an ideal platform to build upon because it provides a basis for streaming communication between nodes in the system and for incoming data streams as well. As data enters the system, it can be sifted and pre-processed with the Granules runtime and then stored in a distributed manner across multiple machines with Galileo. When accessing data, users have the option of pushing their computations out to

Scientific data format support

While the scientific community has invested considerable time and effort in collecting and storing vast amounts of data, there has also been a great deal of investment in the formats used for storing these volumes of data. The storage formats in question could include simple comma-separated plain text files, relational database management systems, proprietary storage formats, distributed storage, or multi-dimensional arrays stored as binary files. This diversity in storage formats is largely

Information retrieval: datasets

Galileo’s information retrieval process is different from traditional databases or key-value stores. Instead of matching user-submitted queries against the data available in the system and returning the raw data, Galileo streams metadata of the matching blocks back to the requestor incrementally and our client-side API transparently collates these metadata blocks into a traversable dataset graph. This dataset is a subset of Galileo’s in-memory metadata graph, and describes the attributes of the

Distributed computation API

Galileo is tightly integrated with the Granules [2] cloud runtime, which supports expressing computations using the MapReduce paradigm or as directed, cyclic graphs. In this case, a computation simply refers to a distributed, executable task that is deployed and run across a cluster of machines. These computations could include operations such as data analysis, pre-processing, transformation, and visualization. Galileo runs within the Granules computation framework as well, underscoring the

Client-side visualization

One common use case for scientific data stored in a distributed file system like Galileo is real-time visualization. These visualizations could include biological, medical, or meteorological data and can provide insights to researchers in these fields. Another related area includes Geographic Information Systems (GIS) which can involve visualizing organization-specific data with an object-oriented approach. GIS systems are often powered by a database backend which is used to provide detailed

Benchmarks

To test the capabilities of Galileo’s storage system, we ran benchmarks on a 48-node Xeon-based cluster of servers with a gigabit Ethernet interconnect. Each server in the cluster was equipped with 12 GB of RAM and a 300 GB, 15,000 RPM hard disk formatted with the ext4 file system. The benchmarks were run on the OpenJDK Java Runtime Environment, version 1.6.0_22.

One billion (1,000,000,000) random data blocks were generated for the experiments and dispersed across the 48 machines, each

Related work

Hadoop [30] and its accompanying file system, HDFS [18] share some common objectives with Galileo. Hadoop is an implementation of the MapReduce framework, and HDFS can be used to store and retrieve results from computations orchestrated by Hadoop. A primary difference between HDFS and Galileo is the role of metadata in the two systems; HDFS is designed for more general-purpose storage needs, and cannot perform the indexing optimizations Galileo’s geospatial metadata makes possible. HDFS is also

Conclusions

A shared-nothing architecture allows incremental addition of nodes into the storage network with a proportional improvement in system throughputs. Efficient evaluation of queries is possible by the following.

  • 1.

    Accounting for spatio-temporal relationships in the distributed storage of observational data streams.

  • 2.

    Separating metadata from content.

  • 3.

    Maintaining an efficient representation of the metadata graph in memory.

  • 4.

    Distributed, concurrent evaluation of queries.

Continuous streaming of partial

Matthew Malensek is a graduate student in the Department of Computer Science at Colorado State University. His research interests include distributed systems and cloud computing.

References (39)

  • M. Malensek, S. Pallickara, S. Pallickara, Galileo: a framework for distributed storage of high-throughput data...
  • S. Pallickara et al.

    Granules: a lightweight, streaming runtime for cloud computing with support, for map-reduce

  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Communications of the ACM

    (2008)
  • S. Ghemawat et al.

    The google file system

  • D. Hastorun, M. Jampani, G. Kakulapati, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazons highly...
  • K. Ericson et al.

    Adaptive heterogeneous language support within a cloud runtime

    Future Generation Computer Systems

    (2011)
  • K. Ericson et al.

    Analyzing electroencephalograms using cloud computing techniques

  • K. Ericson et al.

    Handwriting recognition using a cloud runtime

  • P. Cudré-Mauroux et al.

    A demonstration of scidb: a science-oriented DBMS

    Proceedings of the VLDB Endowment

    (2009)
  • P. Brown

    Overview of sciDB: large scale array storage, processing and analysis

  • R. Rew et al.

    NetCDF: an interface for scientific data access

    IEEE Computer Graphics and Applications

    (1990)
  • D. Wells et al.

    FITS—a flexible image transport system

    Astronomy and Astrophysics Supplement Series

    (1981)
  • W. Contributors, Geohash. Wikipedia.org,...
  • Q. Koziol, R. Matzke, HDF5—a new generation of HDF: reference manual and user guide, National Center for Supercomputing...
  • F. Chang et al.

    Bigtable: a distributed storage system for structured data

    ACM Transactions on Computer Systems (TOCS)

    (2008)
  • I. Stoica et al.

    Chord: a scalable peer-to-peer lookup service for Internet applications

    ACM SIGCOMM Computer Communication Review

    (2001)
  • A. Rowstron et al.

    Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems

  • K. Shvachko et al.

    The hadoop distributed file system

  • A. Lakshman et al.

    Cassandra: a decentralized structured storage system

    ACM SIGOPS Operating Systems Review

    (2010)
  • Cited by (22)

    • Some Recent Advances in Utility and Cloud Computing

      2016, Future Generation Computer Systems
    • Predictive analytics using statistical, learning, and ensemble methods to support real-time exploration of discrete event simulations

      2016, Future Generation Computer Systems
      Citation Excerpt :

      These outputs must be stored in a scalable and fault-tolerant manner: 400,000 iterations produced from our single pilot scenario (one in a multitude of exploration possibilities) consumed about 1 TB of storage space. We use our Galileo [17–20] DHT-based key–value store to fulfill these requirements. Galileo is a distributed, fault-tolerant, and document-oriented storage system, making it an ideal candidate for managing the JSON output files produced by our subject simulation.

    • A Survey of Spatio-Temporal Big Data Indexing Methods in Distributed Environment

      2022, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
    • Mind the Gap: Generating imputations for satellite data collections at myriad spatiotemporal scopes

      2021, Proceedings - 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2021
    • Streaming remote sensing data processing for the future smart cities: State of the art and future challenges

      2018, Environmental Information Systems: Concepts, Methodologies, Tools, and Applications
    View all citing articles on Scopus

    Matthew Malensek is a graduate student in the Department of Computer Science at Colorado State University. His research interests include distributed systems and cloud computing.

    Sangmi Lee Pallickara is a research scientist in the Department of Computer Science at Colorado State University. She received her Masters and Ph.D. degrees in Computer Science from Syracuse University and Florida State University, respectively. Her research interests are in the area of large-scale scientific data management, data mining, scientific metadata, and data-intensive computing.

    Shrideep Pallickara is an Assistant Professor in the Department of Computer Science at Colorado State University. He received his Masters and Ph.D. degrees from Syracuse University. His research interests are in the area of large-scale distributed systems, specifically cloud computing and streaming.

    View full text