Distributed frameworks and parallel algorithms for processing large-scale geographic data

doi:10.1016/j.parco.2003.04.001

Parallel Computing

Volume 29, Issue 10, October 2003, Pages 1297-1333

https://doi.org/10.1016/j.parco.2003.04.001 Get rights and content

Abstract

The number of applications that require parallel and high-performance computing techniques has diminished in recent years due to to the continuing increase in power of PC, workstation and mono-processor systems. However, Geographic information systems (GIS) still provide a resource-hungry application domain that can make good use of parallel techniques. We describe our work with geographical systems for environmental and defence applications and some of the algorithms and techniques we have deployed to deliver high-performance prototype systems that can deal with large data sets. GIS applications are often run operationally as part of decision support systems with both a human interactive component as well as large scale batch or server-based components. Parallel computing technology embedded in a distributed system therefore provides an ideal and practical solution for multi-site organisations and especially government agencies who need to extract the best value from bulk geographic data.

We describe the distributed computing approaches we have used to integrate bulk data and metadata sources and the grid computing techniques we have used to embed parallel services in an operational infrastructure. We describe some of the parallel techniques we have used: for data assimilation; for image and map data processing; for data cluster analysis; and for data mining. We also discuss issues related to emerging standards for data exchange and design issues for integrating together data in a distributed ownership system. We include a historical review of our work in this area over the last decade which leads us to believe parallel computing will continue to play an important role in GIS. We speculate on algorithmic and systems issues for the future.

Introduction

Integrating parallel and distributed computer programs into a framework that can be easily used by applied scientists is a challenging problem. Such a framework has to enable easy access to both computationally complex operations and high-performance computing technologies, as well as providing a means for defining the appropriate data set for the operation. This paper presents an overview of our work over recent years on applying high-performance parallel and distributed computing to geographical information systems (GIS) and decision support systems that utilise geospatial data. Our main research interest is the development of metacomputing (or grid computing) systems that enable cooperative use of networked data storage and processing resources (including parallel resources) for large-scale data-intensive applications. Much of our work has addressed the issue of embedding high-performance parallel computing systems and applications in this kind of distributed computing framework. We make this transparent to the user by providing standard interfaces to distributed processing services that hide the details of the underlying implementation of the service.

We have focussed on GIS as an example application domain since it provides very large data sets that can benefit from parallel and distributed computing. For example, modern satellites produce on the order of 1 Terabyte of data per day, and there are many challenging problems in the storage, access and processing of such data. The enormous number and variety of geospatial data, metadata, data repositories and application programs also raises issues of interoperability and standards for interfacing to data and processing services, all of which are general problems in grid computing. Unlike a number of other fields, the GIS community has made significant advances in defining such standards. GIS also has a wide range of users and applications in many important areas, including oil and mineral exploration, forestry and agriculture, environmental monitoring and planning, government services, emergency services and defence.

Parallel techniques can be applied to geospatial data in a number of ways. Spatial data such as satellite imagery or raster-based map data lends itself well to a geometric decomposition among processors, and many applications will give good parallel speedups as long as the image size is large enough. For some applications on very large data, parallel processing provides the potential for superlinear speedup, by avoiding the costly overhead of paging to disk that may occur on a mono-processor machine with much less memory than a parallel computer. However, with the enormous advances in processing speed and memory capacity of PCs in the past few years, the advantages of parallel computing for some applications are perhaps not as compelling as they once were.

Nevertheless, data sets available from some areas such as satellite imaging and aerial reconnaissance continue to increase in size, from improved spatial resolution, an increase in the number of frequency channels used (modern hyperspectral satellites may have hundreds of channels), and the ever-increasing sequence of images over time. There are still important opportunities to exploit parallel computing techniques in analyzing such large multi-dimensional data sets, as discussed in Section 3.

Analyzing this data might involve some interactive experiments in deciding upon spatial coordinate transformations, regions of interest, choice of spectral channels, and the image processing required. Subsequently a batch request might be made to a parallel computer (or perhaps a network of workstations or a computational grid) to extract an appropriate time sequence from the archive. On modern computer systems it is often not worthwhile spatially or geometrically decomposing the data for a single image across processors, however it may be very useful to allocate the processing of different images to different processors.

Not all spatial data is raster data and indeed many useful data sets will be stored as lists of objects such as spatial points for locations; vectors for roads or pathways; or other forms of data that can be referenced spatially. The nature of these lists of data mean that they are less obviously allocated for processing among processors in a load balanced fashion. Objects might be allocated in a round-robin fashion which will generally provide a reasonable load balance but may entail an increased communications overhead unless there are no spatial dependencies in the processing algorithms involved. An example that we have investigated is assimilating simple observational data into a weather forecasting system (see Section 4.1). The observed data from field stations, weather ships, buoys or other mobile devices can be treated at one level as an independent set of data that must be processed separately and which can therefore be randomly allocated among processors. However if such data must be incorporated onto a raster data set that is already allocated among processors spatially or according to some geometric scheme then a load imbalance may occur. The use of irregularly spaced observational data generally requires interpolation onto a regular grid structure, such as the spatial grid used for weather forecasting simulations or the pixels in a satellite image. This can be very computationally demanding, and we have explored some different parallel implementations of spatial data interpolation, which are discussed in Section 4.

Another type of GIS data is the metadata, which may or may not be spatial in nature. Metadata may be textual or other properties that are directly related to a spatial point or vector, but may also be associated with a spatial region or with some other subset of the bulk data. The processing implications for such data are similar to those for other list-based data with the added complication that metadata may relate to many other elements of list or raster data and associated parallel load balancing is therefore difficult to predict.

Data mining is a technique that is gaining increasing attention from businesses. The techniques generally involve carrying out correlation analyses on customer or product sales data. The techniques are particularly valuable when applied to spatially organised data and can help in planning and decision support for sales organisations and even product development. Generally customer or sales data will be organised primarily as lists of data objects with associated spatial properties. Processing this data mostly lends itself to the independent decomposition into sets of task for processors or possibly presorting objects by spatial coordinates first. This is a promising area for development and exploitation of new and existing data mining algorithms [64].

Many GIS applications are interactive, real-time or time-critical systems. For example, spatial data and spatial information systems play a key role in many time-critical decision support systems, including emergency services response and military command and control systems. Parallel computing can be used to reduce solution times for computationally intensive components of these systems, and to help gather and process the large amounts of data from many different sources that are typically used in such systems. For example, environmental studies may require the analysis of data from multiple satellites, aerial photography, vector data showing land ownership boundaries, and other spatially referenced data on land tenure, land use, soil and vegetation types and species distributions. This data is typically owned by many different organisations, stored on different servers in different physical locations. It is often not useful or feasible to take local copies of all the data sets, since they may be too large and/or dynamically updated.

There has been substantial progress in the past few years on the concept of a spatial data infrastructure that would enable users of decision support systems or GIS applications to easily search and access data from distributed online data archives. This work has mainly involved the development of standard data formats and metadata models, as well as more recent efforts by the U.S. National Imagery and Mapping Association (NIMA) [48] and the OpenGIS Consortium [52] in developing standard interfaces for locating and accessing geospatial data from digital libraries [50], [53]. One of our projects involves developing software to provide these standard interfaces for accessing distributed archives of geospatial imagery data [10], [11].

We have also been investigating frameworks for supporting distributed data processing, as well as distributed data access (see Section 2). We believe these should be addressed in concert. Particularly for large data sets, it may be much more efficient for data to be processed remotely, preferably on a high-performance compute server with a high-speed network connection to the data servers proving input to the processing services, than for a user to download multiple large files over a lower bandwidth connection for processing on their less powerful workstation. However the user should not be burdened with having to make the decisions on finding the “best” data servers (the required data set may be replicated at multiple sites) or compute servers (there may be many different servers offering the same processing services). This is a difficult problem that should be handled by grid computing middleware including resource brokers and job schedulers.

The goal of the OpenGIS Consortium is to also provide standard interfaces for geospatial data processing services, in addition to the existing standards for searching and accessing spatial data. However this is a much more difficult problem and there has been little progress thus far. NIMA has been developing the Geospatial Imagery eXploitation Services (GIXS) standard [49], however this is restricted to processing of raster image data and still has several shortcomings (see Section 5).

In a distributed computing framework with well-defined standard interfaces to data access and processing services, an application can invoke a remote processing service without necessarily knowing how it is implemented. This allows faster, parallel implementations of services to be provided transparently. A distributed framework could also provide concurrency across multiple servers, either in concurrent (or pipelined) processing of different services as part of a larger task, or repeating the same computation for multiple input data (e.g. satellite images at different times or places).

In the past few years there has been significant progress in the area of metacomputing, or grid computing, which aims to provide coordinated access to computational resources distributed over wide-area networks. Our work is aiming to develop high-level metacomputing middleware that will make grid computing easier to use, particularly for GIS developers and users. We envision a system whereby environmental scientists or value-adders will be able to utilise a toolset as part of a metacomputing/grid system to prepare derived data products such as the rainfall/vegetation correlations described in Section 4.2. End users such as land managers will be able to access these products, which may be produced regularly in a batch procedure or even requested interactively from a low-bandwidth Web browser platform such as a personal computer. Web protocols will be used to specify a complex service request and the system will respond by scheduling parallel or high-performance computing (HPC) and data storage resources to prepare the desired product prior to delivery of the condensed result in a form suitable for decision support. HPC resources will be interconnected with high-performance networks, whereas the end user need not be. In summary the end user will be able to invoke actions at a distance on remote data.

In this paper we present some of our work on applying high-performance distributed and parallel processing for applications that use large geospatial data sets. The paper is structured as follows. Section 2 provides an overview of our work on metacomputing and grid computing, particularly the design and development of distributed computing frameworks for supporting GIS services and applications, and describes how this type of framework allows transparent access to parallel implementations of these services and applications. Some examples of parallel GIS applications are given in 3 Satellite image processing applications, 4 Parallel spatial data interpolation, along with some discussion of how they can be embedded in a distributed computing framework. Section 3 describes some image processing applications on large satellite images, that can be parallelised using standard geometric decomposition techniques. These include image filtering, georectification and image classification. For the case of image classification, we give an example of a Java applet that could be used to provide a front end user interface to a remote high-performance compute server, which may be a parallel machine or a computational grid. Section 4 presents two different approaches to the problem of developing parallel algorithms for spatial data interpolation, which involves mapping irregularly spaced data points onto a regular grid. We describe how parallel implementations of this computationally intensive application can be utilised within a larger application, such as weather forecasting or rainfall analysis and prediction, and how this can be done using the DISCWorld metacomputing/grid framework.

The remainder of the paper discusses software architectures for distributed computing frameworks to support distributed GIS applications. Section 5 outlines some work on implementing a distributed framework for geospatial image analysis that is based on the proposed GIXS standard from NIMA. We discuss several key issues, including mechanisms for specifying the components of the distributed application and their interaction, the exploitation of concurrency in the distributed framework, and an analysis of the the proposed interface standards. We also describe an example application for target detection. In Section 6, we discuss our vision for the application of grid computing technologies to distributed geographical information systems and spatial data infrastructure. We present an overview of evolving grid protocols, services and technologies to support access and processing of large distributed data sets, and outline some of our work on developing prototypes of these virtual data grids for spatial data. We also discuss the important issue of standards for spatial data grids, which must integrate general standards for grid computing with standard interfaces for geospatial data being developed by the OpenGIS consortium. We conclude with a summary of our work and a discussion of what we see as some of the future issues in developing high-performance parallel and distributed applications for processing large-scale geospatial data.

Section snippets

Distributed computing frameworks and computational grids

In this section we describe some general issues relating to grid frameworks and focus in on the parallel and concurrency needs of a middleware system, then on specific GIS issues, and finally embedded parallelism.

A grid computing or metacomputing environment allows applications to utilise multiple computational resources that may be distributed over a wide-area network [19]. Each resource is typically a high-performance computer or a cluster of workstations, however the term encompasses other

Satellite image processing applications

In this section we describe some image processing and analysis applications that we have developed for use as on-demand services that can provide input for GIS applications for various value-adders and end users. All these applications require significant computation, and can be computed in parallel to provide fast results, which is particularly useful for interactive or real-time systems. These applications were implemented for satellite image data collected from the Japanese GMS-5 satellite,

Parallel spatial data interpolation

GIS applications commonly use a regular 2D or 3D grid, such as pixels in a satellite image, a latitude–longitude coordinate grid on the surface of the earth, or the 3D grid of points modelling the atmosphere and the earth’s surface for weather forecasting. In some situations, data required as input for an application, or collected for ground truthing a simulation, may not be co-located with grid coordinates used in the computation, and may be highly irregular in its spatial distribution. In

Distributed workflow-based defence applications

In this section we describe a spatial data processing application prototype we developed with the Australian Defence Science and Technology Organisation (DSTO) to demonstrate a distributed computing framework embedding specialist processing components. The application involves the detection of small targets using a set of heuristics and algorithmic components used in the defence community. Our goal was to implement a workflow system that would allow processing of archived or remote imagery

Data grids

Many computational grid projects require remote access to large amounts of data, which has led to the term “data grid” [1] being used to describe grid environments and services that are developed to support such applications. Many scientific data repositories are so large that it is not feasible for all the researchers that use them to have local copies of the data. For example, the latest earth observation satellites produce Terabytes of data per day [47]. Such large data sets will be

Summary and future issues

In this paper we have summarised projects that have spanned nearly a decade. During this time we have seen a transition in the maturing and deployment of parallel computing. In the early 1990s massively parallel computing was still a relatively arcane branch of computing. It gained acceptance into the 1990s and by the end of the 20th century we might justifiably claim that massive parallelism had been accepted into the mainstream of computing. Certainly there are now plenty of tools and

Acknowledgments

We thank S.J. Del Fabbro, C.J. Patten, K.E. Kerry Falkner, J.F. Hercus, K. Hutchens, K.D. Mason, J.A. Mathew, A.J. Silis and D. Webb who helped implement the application prototypes described in this paper, and K.J. Maciunas, F.A. Vaughan and A.L. Wendelborn for their input in developing the grid computing concepts. Thanks also to K.P. Bryceson for useful discussions on data interpolation and mapping. The Distributed and High Performance Computing Group is a collaborative venture between the

References (68)

K.A Hawick et al.
Interfacing to distributed active data archives
Future Generation Computer Systems
(1999)
K.A Hawick et al.
DISCWorld: an environment for service-based metacomputing
Future Generation Computing Systems (FGCS)
(1999)
W Allcock
The data grid: towards an architecture for the distributed management and analysis of large scientific datasets
Journal of Network and Computer Applications
(2001)
S.R.M. Barres, T. Kauranne, Spectral and multigrid spherical Helmholtz equation solvers on distributed memory parallel...
R.S. Bell, A. Dickinson, The Meteorological Office Unified Model for data assimilation, climate modelling and NWP and...
S. Border, The use of indicator Kriging as a biased estimator to discriminate between ore and waste, Applications of...
K. Bryceson, M. Bryant, The GIS/rainfall connection, in GIS User, No. 4, August 1993, pp....
K.P. Bryceson, P. Wilson, M. Bryant, Daily rainfall estimation from satellite data using rules-based classification...
H. Casanova, J. Dongarra, NetSolve: a network server for solving computational science problems, in: Proc....
J Choi et al.
ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers

P.D. Coddington, G. Hamlyn, K.A. Hawick, H.A. James, J. Hercus, D. Uksi, D. Weber, A software infrastructure for...

P.D Coddington et al.

Implementation of a geospatial imagery digital library using Java and CORBA

P.D. Coddington et al., Interfacing to on-line geospatial imagery archives, in: Proc. of Australasian Urban and...

E Deelman

A virtual data grid for LIGO

Lecture Notes in Computer Science

(2001)

D. Dent, The ECMWF model on the CRAY Y-MP8, in: Proc. Fourth Workshop on Use of Parallel Processors in Meteorology,...

Distributed and High-Performance Computing Group, On-Line Data Archives (OLDA) Project. Available from...

Earth System Grid. Available from...

European Data Grid, Earth Observation Science Applications. Available from...

European Space Agency, SpaceGrid. Available from...

I. Foster, D. Gannon (Eds.), The Open Grid Services Architecture Platform. Available from...

G.C Fox et al.

Parallel Computing Works!

(1994)

The Globus Project. Available from...

The Globus Project, Commodity Grid Kits. Available from...

The GMS User’s Guide, Meteorological Satellite Center, 3-235 Nakakiyoto, Kiyose, Tokyo 204, Japan, second ed.,...

M. Grigg, P. Whitbread, C. Irving, A. Lui, R. Jana, Component based architecture for a distributed library system, in:...

K.A. Hawick, R. Stuart Bell, A. Dickinson, P.D. Surry, B.J.N. Wylie, Parallelisation of the unified weather and climate...

K.A. Hawick, P.D. Coddington, H.A. James, C.J. Patten, On-line data archives, in: Proc. Hawai’i Int. Conf. on System...

K.A. Hawick, H.A. James, K.J. Maciunas, F.A. Vaughan, A.L. Wendelborn, M. Buchhorn, M. Rezny, S.R. Taylor, M.D. Wilson,...

K.A. Hawick, H.A. James, A distributed job placement language, DHPC Technical Report DHPC-070, Department of Computer...

K.A. Hawick, H.A. James, Distributed high-performance computation for remote sensing, in: Proc. Supercomputing ’97, San...

High Performance Fortran Forum (HPFF), High Performance Fortran Language Specification, Scientific Programming, vol. 2,...

J. Hildebrandt, M. Grigg, S. Bytheway, R. Hollemby, S. James, Dynamic C2 application of imagery and GIS information...

H.A. James, Scheduling in metacomputing systems, PhD Thesis, Department of Computer Science, University of Adelaide,...

Cited by (87)

Spatial coding-based approach for partitioning big spatial data in Hadoop
2017, Computers and Geosciences
Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.
Case study on: Scalability of preprocessing procedure of remote sensing in Hadoop
2017, Procedia Computer Science
In the field of remote sensing, the recent increase in image sizes has drawn a significant attention on processing these files in a fault tolerant distributed architecture In this regard, Apache Hadoop architecture becomes an efficient MapReduce model. In the satellite image processing, large scale images put the limitation on the single computer analysis. Whereas, Hadoop Distributed File System (HDFS) gives a remarkable solution to handle these files through its inherent data parallelism technique. This architecture is well suited for structured data, as the structured data can be equally distributed easily and access the relevant data. Images are considered as unstructured matrix data in Hadoop and the whole part of the data is relevant for any processing. It leads to a challenge to maintain the data locality with the equal data distribution. In this paper, we introduce a novel technique, which decrypts the standard format of raw satellite data and localizes the distributed preprocessing step on the equal split of datasets in Hadoop. For this purpose, a suitable modification on the Hadoop interface is proposed. For the case study on the scalability of preprocessing steps, Synthetic Aperture Radar (SAR) and Multispectral (MS), are used in a distributed environment.
Data decomposition method for parallel polygon rasterization considering load balancing
2015, Computers and Geosciences
Citation Excerpt :
Moreover, in the event of uneven spatial distribution of the polygons, DMRS cannot guarantee that the polygons will be evenly allocated to the processes, and the decomposition result is prone to data skewing. The quality of a spatial data decomposition strategy relies primarily on the following factors: degree of data dependence among the processes, proportion of the decomposition time, and load balancing (Hawick et al., 2003). DMPIDS can achieve disjoint data more effectively and consume a smaller proportion of the decomposition time.
It is essential to adopt parallel computing technology to rapidly rasterize massive polygon data. In parallel rasterization, it is difficult to design an effective data decomposition method. Conventional methods ignore load balancing of polygon complexity in parallel rasterization and thus fail to achieve high parallel efficiency. In this paper, a novel data decomposition method based on polygon complexity (DMPC) is proposed. First, four factors that possibly affect the rasterization efficiency were investigated. Then, a metric represented by the boundary number and raster pixel number in the minimum bounding rectangle was developed to calculate the complexity of each polygon. Using this metric, polygons were rationally allocated according to the polygon complexity, and each process could achieve balanced loads of polygon complexity. To validate the efficiency of DMPC, it was used to parallelize different polygon rasterization algorithms and tested on different datasets. Experimental results showed that DMPC could effectively parallelize polygon rasterization algorithms. Furthermore, the implemented parallel algorithms with DMPC could achieve good speedup ratios of at least 15.69 and generally outperformed conventional decomposition methods in terms of parallel efficiency and load balancing. In addition, the results showed that DMPC exhibited consistently better performance for different spatial distributions of polygons.
A framework for processing large scale geospatial and remote sensing data in MapReduce environment
2015, Computers and Graphics (Pergamon)
Citation Excerpt :
In the past decade geographical information science gained a significant role due to the spread of GPS localization, navigation systems [9], advancements in data collection methods [10] and the continuous growth of geographical data available via Internet [11]. Geospatial data has reached many areas of science and found many application areas [12] while the focus was shifted to distributed and high performance computing [13,14]. Evaluation [15] showed that distributed geographical information processing (DGIP) has many benefits, however, its application is not straightforward.
In recent years distributed data processing has reached many areas of computer science including geographic and remote sensing information systems. With the continuing increase of data, existing algorithms and data management need to be moved to new architecture, which may require a great deal of effort.
This paper describes a geospatial data processing framework designed to enable the management and processing of spatial and remote sensing data in distributed environment. The framework is based on the mainstream MapReduce paradigm, and its open-source implementation, the Apache Hadoop library. The primary design goals of this framework are extensibility and adaptability, to enable previously implemented algorithms and existing toolkits to be easily adapted to distributed execution without major effort.
The possibilities of the framework are demonstrated on remote sensing image processing operations in comparison with existing solutions in the Hadoop environment. Results show that this toolkit provides a much broader possibility to advance geospatial data processing than the currently available solutions.
Tensor-CA: A high-performance cellular automata model for land use simulation based on vectorization and GPU
2022, Transactions in GIS
Efficient and portable distribution modeling for large-scale scientific data processing with data-parallel primitives
2021, Algorithms

View all citing articles on Scopus

¹: Tel.: +61-8-8303-4949; fax: +61-8-8303-4366.

View full text