Distributed frameworks and parallel algorithms for processing large-scale geographic data
Introduction
Integrating parallel and distributed computer programs into a framework that can be easily used by applied scientists is a challenging problem. Such a framework has to enable easy access to both computationally complex operations and high-performance computing technologies, as well as providing a means for defining the appropriate data set for the operation. This paper presents an overview of our work over recent years on applying high-performance parallel and distributed computing to geographical information systems (GIS) and decision support systems that utilise geospatial data. Our main research interest is the development of metacomputing (or grid computing) systems that enable cooperative use of networked data storage and processing resources (including parallel resources) for large-scale data-intensive applications. Much of our work has addressed the issue of embedding high-performance parallel computing systems and applications in this kind of distributed computing framework. We make this transparent to the user by providing standard interfaces to distributed processing services that hide the details of the underlying implementation of the service.
We have focussed on GIS as an example application domain since it provides very large data sets that can benefit from parallel and distributed computing. For example, modern satellites produce on the order of 1 Terabyte of data per day, and there are many challenging problems in the storage, access and processing of such data. The enormous number and variety of geospatial data, metadata, data repositories and application programs also raises issues of interoperability and standards for interfacing to data and processing services, all of which are general problems in grid computing. Unlike a number of other fields, the GIS community has made significant advances in defining such standards. GIS also has a wide range of users and applications in many important areas, including oil and mineral exploration, forestry and agriculture, environmental monitoring and planning, government services, emergency services and defence.
Parallel techniques can be applied to geospatial data in a number of ways. Spatial data such as satellite imagery or raster-based map data lends itself well to a geometric decomposition among processors, and many applications will give good parallel speedups as long as the image size is large enough. For some applications on very large data, parallel processing provides the potential for superlinear speedup, by avoiding the costly overhead of paging to disk that may occur on a mono-processor machine with much less memory than a parallel computer. However, with the enormous advances in processing speed and memory capacity of PCs in the past few years, the advantages of parallel computing for some applications are perhaps not as compelling as they once were.
Nevertheless, data sets available from some areas such as satellite imaging and aerial reconnaissance continue to increase in size, from improved spatial resolution, an increase in the number of frequency channels used (modern hyperspectral satellites may have hundreds of channels), and the ever-increasing sequence of images over time. There are still important opportunities to exploit parallel computing techniques in analyzing such large multi-dimensional data sets, as discussed in Section 3.
Analyzing this data might involve some interactive experiments in deciding upon spatial coordinate transformations, regions of interest, choice of spectral channels, and the image processing required. Subsequently a batch request might be made to a parallel computer (or perhaps a network of workstations or a computational grid) to extract an appropriate time sequence from the archive. On modern computer systems it is often not worthwhile spatially or geometrically decomposing the data for a single image across processors, however it may be very useful to allocate the processing of different images to different processors.
Not all spatial data is raster data and indeed many useful data sets will be stored as lists of objects such as spatial points for locations; vectors for roads or pathways; or other forms of data that can be referenced spatially. The nature of these lists of data mean that they are less obviously allocated for processing among processors in a load balanced fashion. Objects might be allocated in a round-robin fashion which will generally provide a reasonable load balance but may entail an increased communications overhead unless there are no spatial dependencies in the processing algorithms involved. An example that we have investigated is assimilating simple observational data into a weather forecasting system (see Section 4.1). The observed data from field stations, weather ships, buoys or other mobile devices can be treated at one level as an independent set of data that must be processed separately and which can therefore be randomly allocated among processors. However if such data must be incorporated onto a raster data set that is already allocated among processors spatially or according to some geometric scheme then a load imbalance may occur. The use of irregularly spaced observational data generally requires interpolation onto a regular grid structure, such as the spatial grid used for weather forecasting simulations or the pixels in a satellite image. This can be very computationally demanding, and we have explored some different parallel implementations of spatial data interpolation, which are discussed in Section 4.
Another type of GIS data is the metadata, which may or may not be spatial in nature. Metadata may be textual or other properties that are directly related to a spatial point or vector, but may also be associated with a spatial region or with some other subset of the bulk data. The processing implications for such data are similar to those for other list-based data with the added complication that metadata may relate to many other elements of list or raster data and associated parallel load balancing is therefore difficult to predict.
Data mining is a technique that is gaining increasing attention from businesses. The techniques generally involve carrying out correlation analyses on customer or product sales data. The techniques are particularly valuable when applied to spatially organised data and can help in planning and decision support for sales organisations and even product development. Generally customer or sales data will be organised primarily as lists of data objects with associated spatial properties. Processing this data mostly lends itself to the independent decomposition into sets of task for processors or possibly presorting objects by spatial coordinates first. This is a promising area for development and exploitation of new and existing data mining algorithms [64].
Many GIS applications are interactive, real-time or time-critical systems. For example, spatial data and spatial information systems play a key role in many time-critical decision support systems, including emergency services response and military command and control systems. Parallel computing can be used to reduce solution times for computationally intensive components of these systems, and to help gather and process the large amounts of data from many different sources that are typically used in such systems. For example, environmental studies may require the analysis of data from multiple satellites, aerial photography, vector data showing land ownership boundaries, and other spatially referenced data on land tenure, land use, soil and vegetation types and species distributions. This data is typically owned by many different organisations, stored on different servers in different physical locations. It is often not useful or feasible to take local copies of all the data sets, since they may be too large and/or dynamically updated.
There has been substantial progress in the past few years on the concept of a spatial data infrastructure that would enable users of decision support systems or GIS applications to easily search and access data from distributed online data archives. This work has mainly involved the development of standard data formats and metadata models, as well as more recent efforts by the U.S. National Imagery and Mapping Association (NIMA) [48] and the OpenGIS Consortium [52] in developing standard interfaces for locating and accessing geospatial data from digital libraries [50], [53]. One of our projects involves developing software to provide these standard interfaces for accessing distributed archives of geospatial imagery data [10], [11].
We have also been investigating frameworks for supporting distributed data processing, as well as distributed data access (see Section 2). We believe these should be addressed in concert. Particularly for large data sets, it may be much more efficient for data to be processed remotely, preferably on a high-performance compute server with a high-speed network connection to the data servers proving input to the processing services, than for a user to download multiple large files over a lower bandwidth connection for processing on their less powerful workstation. However the user should not be burdened with having to make the decisions on finding the “best” data servers (the required data set may be replicated at multiple sites) or compute servers (there may be many different servers offering the same processing services). This is a difficult problem that should be handled by grid computing middleware including resource brokers and job schedulers.
The goal of the OpenGIS Consortium is to also provide standard interfaces for geospatial data processing services, in addition to the existing standards for searching and accessing spatial data. However this is a much more difficult problem and there has been little progress thus far. NIMA has been developing the Geospatial Imagery eXploitation Services (GIXS) standard [49], however this is restricted to processing of raster image data and still has several shortcomings (see Section 5).
In a distributed computing framework with well-defined standard interfaces to data access and processing services, an application can invoke a remote processing service without necessarily knowing how it is implemented. This allows faster, parallel implementations of services to be provided transparently. A distributed framework could also provide concurrency across multiple servers, either in concurrent (or pipelined) processing of different services as part of a larger task, or repeating the same computation for multiple input data (e.g. satellite images at different times or places).
In the past few years there has been significant progress in the area of metacomputing, or grid computing, which aims to provide coordinated access to computational resources distributed over wide-area networks. Our work is aiming to develop high-level metacomputing middleware that will make grid computing easier to use, particularly for GIS developers and users. We envision a system whereby environmental scientists or value-adders will be able to utilise a toolset as part of a metacomputing/grid system to prepare derived data products such as the rainfall/vegetation correlations described in Section 4.2. End users such as land managers will be able to access these products, which may be produced regularly in a batch procedure or even requested interactively from a low-bandwidth Web browser platform such as a personal computer. Web protocols will be used to specify a complex service request and the system will respond by scheduling parallel or high-performance computing (HPC) and data storage resources to prepare the desired product prior to delivery of the condensed result in a form suitable for decision support. HPC resources will be interconnected with high-performance networks, whereas the end user need not be. In summary the end user will be able to invoke actions at a distance on remote data.
In this paper we present some of our work on applying high-performance distributed and parallel processing for applications that use large geospatial data sets. The paper is structured as follows. Section 2 provides an overview of our work on metacomputing and grid computing, particularly the design and development of distributed computing frameworks for supporting GIS services and applications, and describes how this type of framework allows transparent access to parallel implementations of these services and applications. Some examples of parallel GIS applications are given in 3 Satellite image processing applications, 4 Parallel spatial data interpolation, along with some discussion of how they can be embedded in a distributed computing framework. Section 3 describes some image processing applications on large satellite images, that can be parallelised using standard geometric decomposition techniques. These include image filtering, georectification and image classification. For the case of image classification, we give an example of a Java applet that could be used to provide a front end user interface to a remote high-performance compute server, which may be a parallel machine or a computational grid. Section 4 presents two different approaches to the problem of developing parallel algorithms for spatial data interpolation, which involves mapping irregularly spaced data points onto a regular grid. We describe how parallel implementations of this computationally intensive application can be utilised within a larger application, such as weather forecasting or rainfall analysis and prediction, and how this can be done using the DISCWorld metacomputing/grid framework.
The remainder of the paper discusses software architectures for distributed computing frameworks to support distributed GIS applications. Section 5 outlines some work on implementing a distributed framework for geospatial image analysis that is based on the proposed GIXS standard from NIMA. We discuss several key issues, including mechanisms for specifying the components of the distributed application and their interaction, the exploitation of concurrency in the distributed framework, and an analysis of the the proposed interface standards. We also describe an example application for target detection. In Section 6, we discuss our vision for the application of grid computing technologies to distributed geographical information systems and spatial data infrastructure. We present an overview of evolving grid protocols, services and technologies to support access and processing of large distributed data sets, and outline some of our work on developing prototypes of these virtual data grids for spatial data. We also discuss the important issue of standards for spatial data grids, which must integrate general standards for grid computing with standard interfaces for geospatial data being developed by the OpenGIS consortium. We conclude with a summary of our work and a discussion of what we see as some of the future issues in developing high-performance parallel and distributed applications for processing large-scale geospatial data.
Section snippets
Distributed computing frameworks and computational grids
In this section we describe some general issues relating to grid frameworks and focus in on the parallel and concurrency needs of a middleware system, then on specific GIS issues, and finally embedded parallelism.
A grid computing or metacomputing environment allows applications to utilise multiple computational resources that may be distributed over a wide-area network [19]. Each resource is typically a high-performance computer or a cluster of workstations, however the term encompasses other
Satellite image processing applications
In this section we describe some image processing and analysis applications that we have developed for use as on-demand services that can provide input for GIS applications for various value-adders and end users. All these applications require significant computation, and can be computed in parallel to provide fast results, which is particularly useful for interactive or real-time systems. These applications were implemented for satellite image data collected from the Japanese GMS-5 satellite,
Parallel spatial data interpolation
GIS applications commonly use a regular 2D or 3D grid, such as pixels in a satellite image, a latitude–longitude coordinate grid on the surface of the earth, or the 3D grid of points modelling the atmosphere and the earth’s surface for weather forecasting. In some situations, data required as input for an application, or collected for ground truthing a simulation, may not be co-located with grid coordinates used in the computation, and may be highly irregular in its spatial distribution. In
Distributed workflow-based defence applications
In this section we describe a spatial data processing application prototype we developed with the Australian Defence Science and Technology Organisation (DSTO) to demonstrate a distributed computing framework embedding specialist processing components. The application involves the detection of small targets using a set of heuristics and algorithmic components used in the defence community. Our goal was to implement a workflow system that would allow processing of archived or remote imagery
Data grids
Many computational grid projects require remote access to large amounts of data, which has led to the term “data grid” [1] being used to describe grid environments and services that are developed to support such applications. Many scientific data repositories are so large that it is not feasible for all the researchers that use them to have local copies of the data. For example, the latest earth observation satellites produce Terabytes of data per day [47]. Such large data sets will be
Summary and future issues
In this paper we have summarised projects that have spanned nearly a decade. During this time we have seen a transition in the maturing and deployment of parallel computing. In the early 1990s massively parallel computing was still a relatively arcane branch of computing. It gained acceptance into the 1990s and by the end of the 20th century we might justifiably claim that massive parallelism had been accepted into the mainstream of computing. Certainly there are now plenty of tools and
Acknowledgments
We thank S.J. Del Fabbro, C.J. Patten, K.E. Kerry Falkner, J.F. Hercus, K. Hutchens, K.D. Mason, J.A. Mathew, A.J. Silis and D. Webb who helped implement the application prototypes described in this paper, and K.J. Maciunas, F.A. Vaughan and A.L. Wendelborn for their input in developing the grid computing concepts. Thanks also to K.P. Bryceson for useful discussions on data interpolation and mapping. The Distributed and High Performance Computing Group is a collaborative venture between the
References (68)
- et al.
Interfacing to distributed active data archives
Future Generation Computer Systems
(1999) - et al.
DISCWorld: an environment for service-based metacomputing
Future Generation Computing Systems (FGCS)
(1999) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets
Journal of Network and Computer Applications
(2001)- S.R.M. Barres, T. Kauranne, Spectral and multigrid spherical Helmholtz equation solvers on distributed memory parallel...
- R.S. Bell, A. Dickinson, The Meteorological Office Unified Model for data assimilation, climate modelling and NWP and...
- S. Border, The use of indicator Kriging as a biased estimator to discriminate between ore and waste, Applications of...
- K. Bryceson, M. Bryant, The GIS/rainfall connection, in GIS User, No. 4, August 1993, pp....
- K.P. Bryceson, P. Wilson, M. Bryant, Daily rainfall estimation from satellite data using rules-based classification...
- H. Casanova, J. Dongarra, NetSolve: a network server for solving computational science problems, in: Proc....
- et al.
ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers
Implementation of a geospatial imagery digital library using Java and CORBA
A virtual data grid for LIGO
Lecture Notes in Computer Science
Parallel Computing Works!
Cited by (87)
Spatial coding-based approach for partitioning big spatial data in Hadoop
2017, Computers and GeosciencesCase study on: Scalability of preprocessing procedure of remote sensing in Hadoop
2017, Procedia Computer ScienceData decomposition method for parallel polygon rasterization considering load balancing
2015, Computers and GeosciencesCitation Excerpt :Moreover, in the event of uneven spatial distribution of the polygons, DMRS cannot guarantee that the polygons will be evenly allocated to the processes, and the decomposition result is prone to data skewing. The quality of a spatial data decomposition strategy relies primarily on the following factors: degree of data dependence among the processes, proportion of the decomposition time, and load balancing (Hawick et al., 2003). DMPIDS can achieve disjoint data more effectively and consume a smaller proportion of the decomposition time.
A framework for processing large scale geospatial and remote sensing data in MapReduce environment
2015, Computers and Graphics (Pergamon)Citation Excerpt :In the past decade geographical information science gained a significant role due to the spread of GPS localization, navigation systems [9], advancements in data collection methods [10] and the continuous growth of geographical data available via Internet [11]. Geospatial data has reached many areas of science and found many application areas [12] while the focus was shifted to distributed and high performance computing [13,14]. Evaluation [15] showed that distributed geographical information processing (DGIP) has many benefits, however, its application is not straightforward.
- 1
Tel.: +61-8-8303-4949; fax: +61-8-8303-4366.