ABSTRACT
With recent advances in data collection technologies such as remote sensing and global positioning systems, the amount of spatial data being produced has been increasing at a staggering rate. Simultaneously, a shift is being experienced in computing from single-core to multi-core processors. To effectively utilize the computational power afforded by these new generation of processors for serving data-intensive geospatial applications, parallel computing techniques need to be employed. Parallel computing, however, raises new challenges associated with handling the input and output of spatial data in parallel. This paper describes a Parallel Input/Output System (PIOS) to address challenges associated with handling large amounts of diverse spatial data. The PIOS is based on a hierarchical structure that uses a scalable file partitioning strategy and combines data and metadata to enable efficient handling of terabyte-scale data sets in parallel. A spatially-explicit agent-based model is developed as a case study. Computational experiments were conducted on a supercomputer supported by the National Science Foundation. PIOS achieved ten times speedup in parallel input/output time, and was demonstrated to efficiently scale to over one thousand processing cores and handle multiple terabytes of data.
- K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, et al. A View of the Parallel Computing Landscape. Communications of the ACM, 52(10):56--67, 2009. Google ScholarDigital Library
- G. Bell, T. Hey, and A. Szalay. Beyond the Data Deluge. Science, 323(5919):1297--1298, 2009.Google Scholar
- L. Bian. The Representation of the Environment in the Context of Individual-based Modeling. Ecological Modelling, 159(2--3):279--296, 2003.Google Scholar
- R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record, 39(4):13, 2010. Google ScholarDigital Library
- A. Caulfield, L. Grupp, and S. Swanson. Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications. In Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 217--228. ACM, 2009. Google ScholarDigital Library
- Center for International Earth Science Information Network (CIESIN), Columbia University; and Centro Internacional de Agricultura Tropical (CIAT). Gridded Population of the World Version 3 (GPWv3): Population Grids. Palisades, NY: Socioeconomic Data and Applications Center (SEDAC), Columbia University. Available at http://sedac.ciesin.columbia.edu/gpw. (June 14, 2011).Google Scholar
- J. Epstein. Modelling to contain pandemics. Nature, 460(7256):687--687, 2009.Google ScholarCross Ref
- J. Epstein and R. Axtell. Growing Artificial Societies: Social Science from the Bottom Up. The MIT Press, 1996. Google ScholarDigital Library
- H. Gimblett. Integrating Geographic Information Systems and Agent-based Modeling Techniques for Simulating Social and Ecological Processes. Oxford University Press, USA, 2002.Google Scholar
- W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir. MPI-The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press, Cambridge, MA, 1998.Google ScholarCross Ref
- A. Hey, S. Tansley, and K. Tolle. The fourth paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA, 2009.Google Scholar
- S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock. I/O Performance Challenges at Leadership Scale. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 40. ACM, 2009. Google ScholarDigital Library
- Y. Liu, K. Wu, S. Wang, Y. Zhao, and Q. Huang. A MapReduce approach to Gi*(d) spatial statistic. In Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems, pages 11--18. ACM, 2010. Google ScholarDigital Library
- J. Prost, R. Treumann, R. Hedges, B. Jia, and A. Koniges. MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 1--15. ACM, 2001. Google ScholarDigital Library
- D. Skinner. Performance Monitoring of Parallel Scientific Applications. Technical report LBNL/PUB-5503, Lawrence Berkeley National Laboratory, Berkeley, CA, 2005.Google Scholar
- R. Thakur, W. Gropp, and E. Lusk. A case for using MPI's derived datatypes to improve I/O performance. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1--10. IEEE Computer Society, 1998. Google ScholarDigital Library
- R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. Frontiers, page 182, 1999. Google ScholarDigital Library
- R. Thakur, W. Gropp, and E. Lusk. On Implementing MPI-IO Portably and with High Performance. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, pages 23--32. ACM, 1999. Google ScholarDigital Library
- R. Thakur, W. Gropp, and E. Lusk. Optimizing Noncontiguous Accesses in MPI-IO. Parallel Computing, 28(1):83--105, 2002. Google ScholarDigital Library
- K. Wang, J. Han, B. Tu, J. Dai, W. Zhou, and X. Song. Accelerating Spatial Data Processing with MapReduce. In 2010 IEEE 16th International Conference on Parallel and Distributed Systems, pages 229--236. IEEE, 2010. Google ScholarDigital Library
Index Terms
- A parallel input-output system for resolving spatial data challenges: an agent-based model case study
Recommendations
Geographical information system parallelization for spatial big data processing: a review
With the increasing interest in large-scale, high-resolution and real-time geographic information system (GIS) applications and spatial big data processing, traditional GIS is not efficient enough to handle the required loads due to limited ...
Input/output characteristics of scalable parallel applications
Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on SupercomputingRapid increases in computing and communication performance are exacerbating the long-standing problem of performance-limited input/output. Indeed, for many otherwise scalable parallel applications. input/output is emerging as a major performance ...
Comments