Optimizing grid computing configuration and scheduling for geospatial analysis: An example with interpolating DEM

doi:10.1016/j.cageo.2010.05.015

Computers & Geosciences

Volume 37, Issue 2, February 2011, Pages 165-176

https://doi.org/10.1016/j.cageo.2010.05.015 Get rights and content

Abstract

Many geographic analyses are very time-consuming and do not scale well when large datasets are involved. For example, the interpolation of DEMs (digital evaluation model) for large geographic areas could become a problem in practical application, especially for web applications such as terrain visualization, where a fast response is required and computational demands exceed the capacity of a traditional single processing unit conducting serial processing. Therefore, high performance and parallel computing approaches, such as grid computing, were investigated to speed up the geographic analysis algorithms, such as DEM interpolation. The key for grid computing is to configure an optimized grid computing platform for the geospatial analysis and optimally schedule the geospatial tasks within a grid platform. However, there is no research focused on this. Using DEM interoperation as an example, we report our systematic research on configuring and scheduling a high performance grid computing platform to improve the performance of geographic analyses through a systematic study on how the number of cores, processors, grid nodes, different network connections and concurrent request impact the speedup of geospatial analyses. Condor, a grid middleware, is used to schedule the DEM interpolation tasks for different grid configurations. A Kansas raster-based DEM is used for a case study and an inverse distance weighting (IDW) algorithm is used in interpolation experiments.

Introduction

Many geographic problems pose significant computational challenges in response to multiple emerging needs (Yang and Raskin, 2009):

•
Large volumes of distributed data. For example, satellites collect terabytes to petabytes of geospatial data from space on a daily basis. In situ sensors and social activities are also accumulating data at a comparable pace.
•
Complex spatial analysis methods. These computationally intensive methods extend across a broad spectrum of spatial and temporal scales, and are now gaining widespread acceptance.
•
Rapid response times. Concurrent user accesses require web-based applications with fast access and rapid responses times.

The interaction between the above factors further contributes to the challenge of processing geospatial data.

Fortunately, research in recent years has shown that grid computing can effectively address these computing demands in a distributed fashion (Armstrong et al., 2005, Yang et al., 2005). Grid computing is the integration of high-speed internet, high-performance computers, large-scale databases, sensors, remote device, etc., to provide computing resource support for data or computing intensive applications (Foster and Karonis, 1998). With the continual decline in price for computer hardware and networks, it becomes practical for most laboratories with limited funding to deploy a grid computing platform. There is no reported research on how to utilize the publicly available computing resources to configure a grid platform with the best performance. Geospatial-analyses-based applications have special requirements that cannot be matched by generic grid computing platforms because most geospatial analyses algorithms are not designed to leverage multiple CPUs and grid computing middleware has generally not been developed for geospatial applications. Therefore, it is an urgent need to investigate how geospatial analyses can leverage grid computing to improve performance, such as to address regional to global level high resolution data processing requirements (Liu et al., 2006). One of the most important issues is how to organize and configure a good grid computing platform for geospatial analyses and applications, and how to schedule the computing resources in the computing pool (Rahman et al., 2010). This paper utilizes a digital elevation model (DEM) interoperation as an example to investigate how to configure and schedule a better grid computing platform to improve the performance of geographic analyses, offers insights into a computing solution for geospatial analyses and provides guidance for developing middleware for better scheduling the jobs.

This paper addresses these challenges and designs a set of experiments to study the impact of different configurations on grid computing. The research results can also be adapted by GIS experts to help improve overall grid platform performance for other geospatial applications, such as model simulation by configuring an optimized computing pool. It also intends to provide insights for IT experts to develop middleware to improve the grid computing capability for better supporting scientific problems. We aim to answer the following questions for geospatial applications:

•
What is the comparable performance of homogeneous and heterogeneous grid nodes?
•
What are the best network configurations for a grid computing pool?
•
How to select the grid nodes: multiprocessors (two or more processors on multiple chips), multi-core processors (two or more cores on the same chip), multiprocessors (single-core processors on multiple chips) or a single processor?
•
What are the potential bottlenecks in multi-core technology and how to possibly avoid/solve them?
•
What is the impact of concurrent requests to the performance of grid computing pool?

The experiments utilize different grid computing configurations to support the process of DEM interpolation. During this process, a DEM domain is decomposed into subdomains in a way that balances the workload across the grid computing platform. A uniform grid decomposition of the domain is used to partition an entire DEM into several subdomains with each having the same size. A DEM of the state of Kansas with an almost rectangular boundary, which is appropriate for grid decomposition, is selected as study data. Since the purpose of this paper is not to evaluate the accuracy of interpolation results but to demonstrate the impacts of different grid computing environment configurations to the DEM interpolation process and different interpolation methods will not change the results; the popular and easy-to-use IDW interpolation algorithm is applied to interpolate the DEM datasets. The Condor middleware is utilized to schedule the DEM interpolation tasks.

Section 2 introduces grid computing architecture, the benefits and challenges of high performance technology, grid computing platforms, and methodologies used for fast interpolation of high resolution DEMs. Section 3 introduces the DEM data and data processes used in this paper. Section 4 reports and discusses our research and results in the comparative performance of grid computing environments with different number of CPU cores and CPUs, different network connections of computers, and concurrent requests. Section 5 concludes and provides some solution suggestions for solving time-consuming geographic analyses using grid computing, and discusses future research directions and aspects.

Section snippets

Grid computing architecture

Grid computing supports distributed user requests and achieves optimal resource scheduling implementing distributed collaborative mechanisms. A grid computing platform includes three layers: (a) resource layer, which includes a variety of data resources, computing resources, devices, and other resources connected through computer networks. (b) To achieve resource sharing in a distributed network environment, intelligent management mechanisms are required for resource discovery and dynamic

Data and data processing

Kansas was selected as a study region due to its near-rectangular boundary that makes it well suited to grid decomposition. The Kansas DEM was downloaded from the National Map Seamless Server (NMSS³). NMSS provides 1/9 arc-sec high resolution data, 1/3 arc-sec USGS DEMs, 1 arc-sec USGS DEMs, 2 arc-sec USGS DEMs, and 3 arc-sec USGS DEMs in the ArcGrid format, while the 2 arc-sec DEMs are used only in Alaska and the 3 arc-sec DEMs are used only to fill in values over some large

Grid computing platform

The Joint Center of Intelligent Spatial Computing (CISC) at George Mason University (GMU) hosts a grid-based computing pool as illustrated in Fig. 4. In this computing pool, Condor is used as a middleware to dispatch and execute the jobs.

The efficiency of the grid platform was tested in different configurations from homogeneous grid, heterogeneous grid nodes, multi-core architecture, to the computer networks. For the homogeneous grid vs. heterogeneous grid and communication latency experiments,

Conclusion and discussion

This paper reports our research on how the configuration and scheduling of grid computing will impact the efficiency of a grid platform for spatial analysis using the DEM interpolation as an example. The study uses the publicly available 7.5 minute USGS DEM data of Kansas, and the inverse distance weighting (IDW) algorithm for DEM interpolation. Different domain decompositions of the Kansas DEM are used to test the grid computing performance. Five sets of experiments are conducted to investigate

Acknowledgements

The research reported is supported by FGDC CAP program grants (08HQPA0002 and G09AC00103)m and a 2007 NASA grant (NNX07AD99G).

References (31)

J. Gong et al.
Extraction of drainage networks from large terrain datasets using high throughput computing
Computers & Geosciences
(2009)
K.A. Hawick et al.
Distributed frameworks and parallel algorithms for processing large-scale geographic data
Parallel Computing
(2003)
M.F. Hutchinson
A new procedure for gridding elevation and streamline data with automatic removal of spurious pits
Journal of Hydrology
(1989)
M. Rahman et al.
Cooperative and decentralized workflow scheduling in global grids
Future Generation Computer Systems
(2010)
S. Wang et al.
A quadtree approach to domain decomposition for spatial interpolation in grid computing environments
Parallel Computing
(2003)
C. Yang et al.
Introduction to distributed geographic information processing
International Journal of Geographic Information Science
(2009)
C. Yang et al.
Geospatial cyberinfrastructure: past, present and future
Computers, Environment, and Urban Systems
(2010)
M.P. Armstrong et al.
Using a computational grid for geographic information analysis
Professional Geographer
(2005)
Audenino, P., Rognant, L., Chassery, J.M., Planes, J.G., 2001. Fusion strategies for high resolution urban DEM. In:...
P.E. Burrough et al.
Principles of Geographical Information Systems
(1998)

Chai, L., Gao, Q., Dhabaleswar, K., 2007. Understanding the impact of multi-core architecture in cluster computing: a...

B.E. Cramer et al.

An evaluation of domain decomposition strategies for parallel spatial interpolation of surfaces

Geographical Analysis

(1999)

Denning, P.J., 1968. Thrashing: its causes and prevention. In: Proceedings of the Fall Joint Computer Conference, Part...

Dong, Y., Fu, B., Yoshiki, N., 2008. DEM generation methods and applications in revealing of topographic changes caused...

L. Eklundh et al.

Rapid generation of digital elevation models from topographic maps

International Journal of Geographical Information Science

(1995)

Cited by (29)

Parallelization of interpolation, solar radiation and water flow simulation modules in GRASS GIS using OpenMP
2017, Computers and Geosciences
Citation Excerpt :
First studies related to parallelization of Geographic Information System (GIS) operations were done by Healey et al. (1998) and Mineter and Dowers (1999). Several parallelization studies were published in the area of digital terrain modeling and analysis (for example, Huang and Yang, 2011; Huang et al., 2011; Schiele et al., 2012; Xie, 2012) and hydrological modeling (Cui et al., 2005; Sten et al., 2016). Still surprisingly, most of the current GIS software products exploit these advances in a very limited way and nearly all operations are executed by a single process.
In this paper, we describe the parallelization of three complex and computationally intensive modules of GRASS GIS using the OpenMP application programming interface for multi-core computers. These include the v.surf.rst module for spatial interpolation, the r.sun module for solar radiation modeling and the r.sim.water module for water flow simulation. We briefly describe the functionality of the modules and parallelization approaches used in the modules. Our approach includes the analysis of the module's functionality, identification of source code segments suitable for parallelization and proper application of OpenMP parallelization code to create efficient threads processing the subtasks. We document the efficiency of the solutions using the airborne laser scanning data representing land surface in the test area and derived high-resolution digital terrain model grids. We discuss the performance speed-up and parallelization efficiency depending on the number of processor threads. The study showed a substantial increase in computation speeds on a standard multi-core computer while maintaining the accuracy of results in comparison to the output from original modules. The presented parallelization approach showed the simplicity and efficiency of the parallelization of open-source GRASS GIS modules using OpenMP, leading to an increased performance of this geospatial software on standard multi-core computers.
Utilizing Cloud Computing to Support Scalable Atmospheric Modeling: A Case study of Cloud-Enabled Model E
2016, Cloud Computing in Ocean and Atmospheric Sciences
Atmospheric modeling is an important method to generate physical and numerical measurements of climate parameters, quantify the spatiotemporal changes of atmospheric phenomena over space and time, and predict their occurrences. With simulated data sets from atmospheric models, scientists are able to examine the driving forces of atmospheric phenomena and perform advanced analysis. Due to the inherent complexity and computational intensity of atmospheric models, running such models requires considerable amounts of computing resources. Traditionally, high-performance supercomputers or clusters have been used to perform atmospheric modeling. Recently, cloud computing solutions are emerged as a cost-effective approach to provide on-demand computing resources, remove the technical barriers, and reduce the high costs for computing facility management and maintenance. This chapter presents the design and implementation of a cloud-based framework to facilitate atmospheric modeling. The framework consists of a web portal, cloud instances, and a cloud-based data repository. To evaluate the feasibility of the framework, we have customized and deployed the serial processing version of ModelE onto our framework. Upon the deployment, we conducted two sets of experiments to evaluate the readiness of cloud computing resources to support large-scale atmospheric modeling. Experimental results demonstrate the framework provides scalable and customizable computing resources that meet the computational needs of atmospheric modeling.
A hierarchical network-based algorithm for multi-scale watershed delineation
2014, Computers and Geosciences
Watershed delineation is a process for defining a land area that contributes surface water flow to a single outlet point. It is a commonly used in water resources analysis to define the domain in which hydrologic process calculations are applied. There has been a growing effort over the past decade to improve surface elevation measurements in the U.S., which has had a significant impact on the accuracy of hydrologic calculations. Traditional watershed processing on these elevation rasters, however, becomes more burdensome as data resolution increases. As a result, processing of these datasets can be troublesome on standard desktop computers. This challenge has resulted in numerous works that aim to provide high performance computing solutions to large data, high resolution data, or both. This work proposes an efficient watershed delineation algorithm for use in desktop computing environments that leverages existing data, U.S. Geological Survey (USGS) National Hydrography Dataset Plus (NHD+), and open source software tools to construct watershed boundaries. This approach makes use of U.S. national-level hydrography data that has been precomputed using raster processing algorithms coupled with quality control routines. Our approach uses carefully arranged data and mathematical graph theory to traverse river networks and identify catchment boundaries. We demonstrate this new watershed delineation technique, compare its accuracy with traditional algorithms that derive watershed solely from digital elevation models, and then extend our approach to address subwatershed delineation. Our findings suggest that the open-source hierarchical network-based delineation procedure presented in the work is a promising approach to watershed delineation that can be used summarize publicly available datasets for hydrologic model input pre-processing. Through our analysis, we explore the benefits of reusing the NHD+ datasets for watershed delineation, and find that the our technique offers greater flexibility and extendability than traditional raster algorithms.
High-throughput computing provides substantial time savings for landscape and conservation planning
2014, Landscape and Urban Planning
Citation Excerpt :
This approach will tile the input features based on estimated computational requirements of underlying spatial structure while facilitating dissemination of these tiles to individual processors (Wang & Armstrong, 2009). In our study, strict optimization of hardware and software (see Huang & Yang, 2011) was not desired or necessary due to unrestricted access to the cyberinfrastructure, number of available workstations, and our focus on usability to the end user. Although HTC fills an immediate need of planners to improve their models and planning efforts at multiple spatial scales, it is not the panacea.
The social and ecological complexity of conservation has increased as a function of human transformation of landscapes, requiring robust decision support tools for planning. Remotely sensed data, available at increasingly fine-grain sizes, combined with powerful computing hardware and a plethora of software packages, provide landscape and conservation planners an opportunity to contribute to the design of future landscapes at local-regional extents. A computing limitation that has largely been accepted by the planning community is a sacrifice of grain size with increasing spatial extent. High-throughput and high-performance computing applications (i.e., “grid computing”, “supercomputing”) expand the potential for large-extent analyses with high-resolution data. We provide three landscape and conservation planning experiments that investigate to what degree high-throughput computing expedites spatial analyses at varying grains and extents. The first two distribute tasks to networked GIS computers: (1) detecting small landforms using fine-grained data over a limited spatial extent and (2) coarse-grained protected areas analysis at a continental-extent. The final experiment uses a supercomputer to run stand-alone software for a multi-step habitat connectivity analysis at moderate resolution and extent. All three experiments demonstrated massive time savings, shifting processing time for intensive landscape analyses from months to hours. We suggest high-throughput computing will improve landscape planning by (1) allowing finer grained analyses at greater extents, (2) reducing time consumed by complex algorithms, and (3) facilitating analyses of model sensitivities. Employing these methods may allow planners to ask and solve more complex ecological and planning questions and more accurately represent pattern and process at multiple scales.
Large-scale, high-resolution agricultural systems modeling using a hybrid approach combining grid computing and parallel processing
2013, Environmental Modelling and Software
Citation Excerpt :
Grid computing can offer a viable alternative to clusters for high-throughput computing without the need to translate Windows-based models to another operating system. Large organizations can have many Windows-based desktop computers connected through high-speed networks which, with significant idle time, commonly operate at only a fraction of their processing potential (Huang and Yang, 2011). A key advantage of grid computing is that it can effectively coordinate loosely coupled, heterogeneous, and geographically dispersed computing resources over multiple administrative domains to achieve a common computing goal (Jeffery, 2007; Schwiegelshohn et al., 2010).
The solution of complex global challenges in the land system, such as food and energy security, requires information on the management of agricultural systems at a high spatial and temporal resolution over continental or global extents. However, computing capacity remains a barrier to large-scale, high-resolution agricultural modeling. To model wheat production, soil carbon, and nitrogen dynamics in Australia's cropping regions at a high resolution, we developed a hybrid computing approach combining parallel processing and grid computing. The hybrid approach distributes tasks across a heterogeneous grid computing pool and fully utilizes all the resources of computers within the pool. We simulated 325 management scenarios (nitrogen application rates and stubble management) at a daily time step over 122 years, for 12,707 climate–soil zones using the Windows-based Agricultural Production Systems SIMulator (APSIM). These simulations would have taken over 30 years on a single computer. Our hybrid high performance computing (HPC) approach completed the modeling within 10.5 days—a speed-up of over 1000 times—with most jobs finishing within the first few days. The approach utilizes existing idle organization-wide computing resources and eliminates the need to translate Windows-based models to other operating systems for implementation on computing clusters. There are however, numerous computing challenges that need to be addressed for the effective use of these techniques and there remain several potential areas for further performance improvement. The results demonstrate the effectiveness of the approach in making high-resolution modeling of agricultural systems possible over continental and global scales.
Implementation and performance optimization of a parallel contour line generation algorithm
2012, Computers and Geosciences
Citation Excerpt :
High throughput computing was used to solve the computing intensive problem of DTA (Mineter et al., 2003; Gong and Xie, 2009). Optimizing grid computing configuration and scheduling was tested to enhance the DTA performance in a Grid computing environment (Huang and Yang, 2011). In recent years, Graphic Processing Unit (GPU) was used to accelerate DTA algorithms such as the view-shed analysis (Fang et al., 2011).
This paper introduces a parallel contour line interpolation algorithm from Digital Elevation Model (DEM). Contour generation from DEM is the basic task of the computer-aided mapping and one of the most important applications of DEM. With the increasing of DEM resolution, the Digital Terrain Analysis (DTA) has become one of the computing-intensive tasks in GIS. Many studies have been done on the implementation of parallel DTA algorithms based on different parallel hardware and software. In this paper, we use the open source GIS toolkit to implement a parallel contour generation algorithm. The Message Passing Interface (MPI) standard is used for the parallel algorithm programming. A slackly LAN (Local Area Network)-connected windows cluster is setup to implement and test the parallel contour line generation algorithm. The performance optimization method for the parallel algorithm is specifically discussed, including data redundancy method, group communication, packaging collection of results, and memory optimization method for results merging. The experimental results show the capacity and potential to implement the parallel GIS algorithms based on open source GIS toolkit in the LAN-connected PC environment.

View all citing articles on Scopus

View full text

Optimizing grid computing configuration and scheduling for geospatial analysis: An example with interpolating DEM

Abstract

Introduction

Section snippets

Grid computing architecture

Data and data processing

Grid computing platform

Conclusion and discussion

Acknowledgements

Computers & Geosciences

Parallel Computing

Journal of Hydrology

Future Generation Computer Systems

Parallel Computing

International Journal of Geographic Information Science

Computers, Environment, and Urban Systems

Using a computational grid for geographic information analysis

Professional Geographer

Principles of Geographical Information Systems

An evaluation of domain decomposition strategies for parallel spatial interpolation of surfaces

Geographical Analysis

Rapid generation of digital elevation models from topographic maps

International Journal of Geographical Information Science