ABSTRACT
As the number of I/O-intensive MPI programs becomes increasingly large, many efforts have been made to improve I/O performance, on both software and architecture sides. On the software side, researchers can optimize processes' access patterns, either individually (e.g., by using large and sequential requests in each process), or collectively (e.g., by using collective I/O). On the architecture side, files are striped over multiple I/O nodes for a high aggregate I/O throughput. However, a key weakness, the access interference on each I/O node, remains unaddressed in these efforts. When requests from multiple processes are served simultaneously by multiple I/O nodes, one I/O node has to concurrently serve requests from different processes. Usually the I/O node stores its data on the hard disks, and different process accesses different regions of a data set. When there are a burst of requests from multiple processes, requests from different processes to a disk compete with each other for its single disk head to access data. The disk efficiency can be significantly reduced due to frequent disk head seeks.
In this paper, we propose a scheme, InterferenceRemoval, to eliminate I/O interference by taking advantage of optimized access patterns and potentially high throughput provided by multiple I/O nodes. It identifies segments of files that could be involved in the interfering accesses and replicates them to their respectively designated I/O nodes. When the interference is detected at an I/O node, some I/O requests can be re-directed to the replicas on other I/O nodes, so that each I/O node only serves requests from one or a limited number of processes. InterferenceRemoval has been implemented in the MPI library for high portability on top of the Lustre parallel file system. Our experiments with representative benchmarks, such as NPB BTIO and mpi-tile-io, show that it can significantly improve I/O performance of MPI programs. For example, the I/O throughput of mpi-tile-io can be increased by 105% as compared to that without using collective I/O, and by 23% as compared to that using collective I/O.
- M. Bhadkamkar, J. Guerra, L. Useche, S. Burnett, J. Liptak, R. Rangaswami, and V. Hristidis, "BORG: Block-reORGanization for Self-optimizing Storage Systems", In Proceedings of the 7th USENIX Conference on File and Storage Technologies, San Fancisco, CA, 2009. Google ScholarDigital Library
- A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp, "Efficient Structured Data Access in Parallel File System", In Proceedings of IEEE International Conference on Cluster Computing, Hong Kong, China, 2003.Google ScholarCross Ref
- A. Ching, A. Choudhary, K. Coloma, and W. Liao, "Noncontiguous I/O Accesses Through MPI-IO", In Proceedings of IEEE International Symposium on Cluster, Cloud, and Grid Computing, Tokyo, Japan, 2003. Google ScholarDigital Library
- Cluster File Systems, Inc. Lustre. "Lustre: A scalable, robust, highly-available cluster file system", http://www.lustre.org/. Online-document, 2010.Google Scholar
- H. Huang, W. Hung, and K. Shin, "FS2: Dynamic Data Replication in Free Disk Space for Improving Disk Performance and Energy Consumption", In Proceedings of ACM Symposium on Operating Systems Principles, Brighton, UK, 2005. Google ScholarDigital Library
- W. Hsu, A. Smith, H. Young, "The Automatic Improvement of Locality in Storage Systems", ACM Transactions on Computer Systems, Volume 23, Issue 4, Nov. 2006, Pages 424--473. Google ScholarDigital Library
- W. Hsu, A. Smith, H. Young, "The Automatic Improvement of Locality in Storage Systems", Technical Report CSD-03-1264, UC Berkeley, Jul. 2003.Google Scholar
- Interleaved or Random (IOR) benchmarks, http://www.cs.dartmouth.edu/pario/examples.html, Online-document, 2008.Google Scholar
- S. Iyer and P. Druschel, "Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O", In Proceedings of ACM Symposium on Operating Systems Principles, Banff, Canada, 2001. Google ScholarDigital Library
- D. Kotz, "Disk-directed I/O for MIMD Multiprocessors.", ACM Transactions on Computer Systems, Volume 15, Issue 1, Feb. 1997, pages 41--74. Google ScholarDigital Library
- S. Liang, S. Jiang, and X. Zhang, "STEP: Sequentiality and Thrashing Detection Based Prefetching to Improve Performance of Networked Storage Servers.", In Proceedings of International Conference on Distributed Computing Systems, Toronto, Canada, 2007. Google ScholarDigital Library
- Mpi-tile-io Benchmark, http: www-unix.mcs.anl.gov/thakur/pio-benchmarks.html. Online-document, 2009.Google Scholar
- M. Kandemir, S. Son, M. Karakoy, "Improving I/O Performance of Applications through Compiler-Directed Code Restructuring", In Proceedings of the 6th USENIX Conference on File and Storage Technologies, San Jose, CA, 2008. Google ScholarDigital Library
- MPICH2, Argonne National Laboratory, http://www.mcs.anl.gov/research-/projects/mpich2/. Online-document, 2009.Google Scholar
- NAS Parallel Benchmarks, NASA AMES Research Center, http://www.nas.nasa.gov/Software/NPB/. Online-document, 2009.Google Scholar
- PVFS, http://www.pvfs.org. Online-document, 2010.Google Scholar
- P. Pacheco, "Parallel Programming with MPI", Morgran Kaufmann Publishers, pages 137--178, 1997. Google ScholarDigital Library
- K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, "Server-directed collective I/O in Panda", In Proceedings of Supercomputing, San Diego, CA, 1995. Google ScholarDigital Library
- F. Schmuck and R. Haskin, "GPFS: A shared-disk file system for large computing clusters.", In Proceedings of the 1st USENIX Conference on File and Storage Technologies, Monterey, CA, 2002, Monterey, CA, USA. Google ScholarDigital Library
- R. Thakur, W. Gropp and E. Lusk, "Data Sieving and Collective I/O in ROMIO", In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation, Annapolis, MD, 1999. Google ScholarDigital Library
- S3aSim I/O Benchmark, http://www-unix.mcs.anl.gov/thakur/s3asim.html. Online-document, 2009.Google Scholar
- The DiskSim Simulation Environment(v4.0), Parallel Data Lab, http://www.pdl.cmu.edu/DiskSim/. Online-document, 2009.Google Scholar
- Y. Wang and D. Kaeli, "Profile-Guided I/O Partitioning", In Proceedings of International Conference on Supercomputing, San Fancisco, CA, 2003. Google ScholarDigital Library
- C. Wang, Z. Zhang, X. Ma, S. Vazhkudai, and F. Mueller, "Improving the Availability of Supercomputer Job Input Data Using Temporal Replication", In Proceedings of International Supercomputing Conference, Hamburg, Germany, 2009.Google ScholarCross Ref
- X. Zhang, S. Jiang, and K. Davis, "Making Resonance a Common Case: A High-performance Implementation of Collective I/O on Parallel File Systems", In Proceedings of IEEE International Parallel & Distributed Processing Symposium, Rome, Italy, 2009. Google ScholarDigital Library
Index Terms
- InterferenceRemoval: removing interference of disk access for MPI programs through data replication
Recommendations
IR+: Removing parallel I/O interference of MPI programs via data replication over heterogeneous storage devices
Highlights- A data replication method IR+ is proposed to eliminate parallel I/O interference of an MPI program.
AbstractI/O requests from parallel processes to a disk compete with each other for a single disk head to access data. The disk efficiency can be significantly reduced due to frequent disk head seeks. In this paper, we propose a scheme, named ...
MPI-IO/Gfarm: An Optimized Implementation of MPI-IO for the Gfarm File System
CCGRID '11: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingThis paper proposes a design and implementation of an MPI-IO implementation of the Gfarm file system, called MPI-IO/Gfarm. The Gfarm file system is a global file system that federates the local storage of compute nodes among several clusters. It has a ...
MPI-IO on a Parallel File System for Cluster of Workstations
IWCC '99: Proceedings of the 1st IEEE Computer Society International Workshop on Cluster ComputingSince the MPI-IO definition, a standard interface for parallel IO, some implementations are available for cluster of workstations, but the performances are within the limits of the file system (typically NFS). New parallel file systems are now available ...
Comments