Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals
Highlights
► Distributed file system designed for storing billions of scientific data files. ► High-throughput storage of geospatial streams. ► Fast retrieval and query support. ► Distributed computation support. ► Visualization capabilities.
Introduction
There has been a steady increase in the number and type of observational devices. Data from such devices must be stored for (1) processing that relies on access to historical data to make forecasts, and (2) visualizing how the observational data changes over time for a given spatial area. Data produced by such observational devices can be thought of as time-series data streams; a device generates the packets periodically or as part of configured change notifications. Data packets generated in these settings contain measurements from multiple, proximate locations. These measurements can be made by a single device (e.g., volumetric scans generated by radars) or from multiple devices (e.g., sensors send data to a base station that collates multiple observations to generate a single packet).
Observational data have spatio-temporal characteristics. Each measurement represents a feature of interest such as temperature, pressure, humidity, etc. The measurement is tied to specific location and elevation, and has a timestamp associated with it. While individual packets within an observational stream may not be large, (often in the order of kilobytes) the frequency of the reported measurements combined with increases in the number and type of devices lead to increasing data volumes.
Our targeted usage scenario is in the atmospheric domain where data from such measurements are used as inputs to weather forecasting models and visualization schemes. These usage patterns entail access to historical data to validate new models, identify correlations or trends, and visualize feature changes over time. We need to be able to access specific portions of the data efficiently to ensure faster completions of the aforementioned activities.
Research challenges in designing a storage framework for such observational data include the following.
- 1.
Support for a scale-out architecture. An extensible architecture that can assimilate nodes one at a time to support increased data storage requirements.
- 2.
High throughput storage of data. Given the number of data sources, we must be able to store data streams arriving at high rates. We measure throughput in terms of the total number of stream packets stored by the system over a period of time.
- 3.
Efficient retrievals of specific portions of the data. Given the large data volumes involved we must support fast sifting of stored data streams in response to queries that target a specific feature at a specific time for a given geospatial area. To accomplish this, we must account for the spatial and temporal characteristics of the data during the storage process and in turn use this metadata for efficient retrievals.
- 4.
Fast detection of non-matching queries. Often the query parameters are adjusted based on results from past queries. To support fine-tuning of queries, we must have accurate and efficient detection of situations when there are no data that match a specified query.
- 5.
Range query support. We must be able to support range queries over both the spatial and temporal dimensions while ensuring that support for such queries do not result in unacceptable overheads.
- 6.
Dynamic indexing strategies. The system should allow its indexing functionality to be adaptively reconfigured to better service different usage patterns and reduce latencies.
- 7.
Extensive data format support. There are vast amounts of data stored in established scientific storage formats. Our system must not require a particular input format so that it is useful for researchers that have already invested in an existing format, and we must allow the system to read and understand a variety of metadata without any loss of fidelity from conversion.
- 8.
Efficient processing of stored data. Our system should not only facilitate data retrieval, but also simplify launching distributed computations for processing data using the system’s indexing semantics as inputs.
- 9.
Integrated support for visualization. The system should provide functionality for visualizing data in a number of ways to facilitate analysis.
- 10.
Failure recovery. We must account for any possible failures and data corruptions at individual nodes. Recovery from failures must be fast and consistent.
This paper describes the design of a demonstrably high-throughput geospatial data storage system, Galileo. The storage framework is distributed and is incrementally scalable with the ability to assimilate new storage nodes as they become available. The storage subsystem organizes the storage and dispersion of data streams to support fast, efficient range queries targeting specific features across the spatial and temporal dimensions. A dynamic indexing scheme is utilized to help the system respond to differing load conditions, and a flexible framework is provided to allow a number of observational data formats to be read and understood by the system. To sustain failures and recover from data corruptions of specific blocks the system relies on replication. Most importantly, our benchmarks demonstrate the feasibility of designing high-throughput data storage from commodity nodes while accounting for differences in the capabilities of these nodes. Leveraging heterogeneity in the available nodes is particularly useful in cloud settings where newer nodes tend to have better storage and processing capabilities.
In the following section, the architecture of Galileo will be discussed, including an overview of how data is stored to disk, the network layout, and how data is positioned and replicated within the system. Next, an overview of the system’s scientific data format support infrastructure is detailed in Section 3, including information about the runtime plugin architecture and implementation of NetCDF format support. In Section 4, the dataset and query system will be described, followed by details of Galileo’s built-in distributed computation API in Section 5. Section 6 explores using the system for client-side data visualization. Section 8 provides a brief survey of related technologies in the field, and Section 7 presents benchmarks of our system’s capabilities. Section 9 reports conclusions from our research and discusses the future direction of the project.
Since the initial publication of Galileo: A Framework for Distributed Storage of High-Throughput Data Streams [1], we have added several key extensions for this special edition issue. Updates to the system architecture are detailed, including block versioning, compression support, and cross-group replication. We also added support for launching distributed computations on the data stored in Galileo, streaming visualization, and converting other storage types to our metadata format through a plugin architecture. Finally, a number of new benchmarks have been added, including a comparison with Hadoop and timing information for visualization, computation launching, graph reorientation, and metadata conversion.
Section snippets
System architecture
Galileo runs as a computation on the Granules Runtime for Cloud Computing [2]. Granules is an ideal platform to build upon because it provides a basis for streaming communication between nodes in the system and for incoming data streams as well. As data enters the system, it can be sifted and pre-processed with the Granules runtime and then stored in a distributed manner across multiple machines with Galileo. When accessing data, users have the option of pushing their computations out to
Scientific data format support
While the scientific community has invested considerable time and effort in collecting and storing vast amounts of data, there has also been a great deal of investment in the formats used for storing these volumes of data. The storage formats in question could include simple comma-separated plain text files, relational database management systems, proprietary storage formats, distributed storage, or multi-dimensional arrays stored as binary files. This diversity in storage formats is largely
Information retrieval: datasets
Galileo’s information retrieval process is different from traditional databases or key-value stores. Instead of matching user-submitted queries against the data available in the system and returning the raw data, Galileo streams metadata of the matching blocks back to the requestor incrementally and our client-side API transparently collates these metadata blocks into a traversable dataset graph. This dataset is a subset of Galileo’s in-memory metadata graph, and describes the attributes of the
Distributed computation API
Galileo is tightly integrated with the Granules [2] cloud runtime, which supports expressing computations using the MapReduce paradigm or as directed, cyclic graphs. In this case, a computation simply refers to a distributed, executable task that is deployed and run across a cluster of machines. These computations could include operations such as data analysis, pre-processing, transformation, and visualization. Galileo runs within the Granules computation framework as well, underscoring the
Client-side visualization
One common use case for scientific data stored in a distributed file system like Galileo is real-time visualization. These visualizations could include biological, medical, or meteorological data and can provide insights to researchers in these fields. Another related area includes Geographic Information Systems (GIS) which can involve visualizing organization-specific data with an object-oriented approach. GIS systems are often powered by a database backend which is used to provide detailed
Benchmarks
To test the capabilities of Galileo’s storage system, we ran benchmarks on a 48-node Xeon-based cluster of servers with a gigabit Ethernet interconnect. Each server in the cluster was equipped with 12 GB of RAM and a 300 GB, 15,000 RPM hard disk formatted with the ext4 file system. The benchmarks were run on the OpenJDK Java Runtime Environment, version 1.6.0_22.
One billion (1,000,000,000) random data blocks were generated for the experiments and dispersed across the 48 machines, each
Related work
Hadoop [30] and its accompanying file system, HDFS [18] share some common objectives with Galileo. Hadoop is an implementation of the MapReduce framework, and HDFS can be used to store and retrieve results from computations orchestrated by Hadoop. A primary difference between HDFS and Galileo is the role of metadata in the two systems; HDFS is designed for more general-purpose storage needs, and cannot perform the indexing optimizations Galileo’s geospatial metadata makes possible. HDFS is also
Conclusions
A shared-nothing architecture allows incremental addition of nodes into the storage network with a proportional improvement in system throughputs. Efficient evaluation of queries is possible by the following.
- 1.
Accounting for spatio-temporal relationships in the distributed storage of observational data streams.
- 2.
Separating metadata from content.
- 3.
Maintaining an efficient representation of the metadata graph in memory.
- 4.
Distributed, concurrent evaluation of queries.
Continuous streaming of partial
Matthew Malensek is a graduate student in the Department of Computer Science at Colorado State University. His research interests include distributed systems and cloud computing.
References (39)
- M. Malensek, S. Pallickara, S. Pallickara, Galileo: a framework for distributed storage of high-throughput data...
- et al.
Granules: a lightweight, streaming runtime for cloud computing with support, for map-reduce
- et al.
Mapreduce: simplified data processing on large clusters
Communications of the ACM
(2008) - et al.
The google file system
- D. Hastorun, M. Jampani, G. Kakulapati, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazons highly...
- et al.
Adaptive heterogeneous language support within a cloud runtime
Future Generation Computer Systems
(2011) - et al.
Analyzing electroencephalograms using cloud computing techniques
- et al.
Handwriting recognition using a cloud runtime
- et al.
A demonstration of scidb: a science-oriented DBMS
Proceedings of the VLDB Endowment
(2009) Overview of sciDB: large scale array storage, processing and analysis
NetCDF: an interface for scientific data access
IEEE Computer Graphics and Applications
FITS—a flexible image transport system
Astronomy and Astrophysics Supplement Series
Bigtable: a distributed storage system for structured data
ACM Transactions on Computer Systems (TOCS)
Chord: a scalable peer-to-peer lookup service for Internet applications
ACM SIGCOMM Computer Communication Review
Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems
The hadoop distributed file system
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Cited by (22)
Some Recent Advances in Utility and Cloud Computing
2016, Future Generation Computer SystemsPredictive analytics using statistical, learning, and ensemble methods to support real-time exploration of discrete event simulations
2016, Future Generation Computer SystemsCitation Excerpt :These outputs must be stored in a scalable and fault-tolerant manner: 400,000 iterations produced from our single pilot scenario (one in a multitude of exploration possibilities) consumed about 1 TB of storage space. We use our Galileo [17–20] DHT-based key–value store to fulfill these requirements. Galileo is a distributed, fault-tolerant, and document-oriented storage system, making it an ideal candidate for managing the JSON output files produced by our subject simulation.
A Survey of Spatio-Temporal Big Data Indexing Methods in Distributed Environment
2022, IEEE Journal of Selected Topics in Applied Earth Observations and Remote SensingMind the Gap: Generating imputations for satellite data collections at myriad spatiotemporal scopes
2021, Proceedings - 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2021Streaming remote sensing data processing for the future smart cities: State of the art and future challenges
2018, Environmental Information Systems: Concepts, Methodologies, Tools, and ApplicationsResearch on real time processing and intelligent analysis technology of power big data
2017, ACM International Conference Proceeding Series
Matthew Malensek is a graduate student in the Department of Computer Science at Colorado State University. His research interests include distributed systems and cloud computing.
Sangmi Lee Pallickara is a research scientist in the Department of Computer Science at Colorado State University. She received her Masters and Ph.D. degrees in Computer Science from Syracuse University and Florida State University, respectively. Her research interests are in the area of large-scale scientific data management, data mining, scientific metadata, and data-intensive computing.
Shrideep Pallickara is an Assistant Professor in the Department of Computer Science at Colorado State University. He received his Masters and Ph.D. degrees from Syracuse University. His research interests are in the area of large-scale distributed systems, specifically cloud computing and streaming.