ABSTRACT
Many-Task Computing (MTC) is a new application category that encompasses increasingly popular applications in biology, economics, and statistics. The high inter-task parallelism and data-intensive processing capabilities of these applications pose new challenges to existing supercomputer hardware-software stacks. These challenges include resource provisioning; task dispatching, dependency resolution, and load balancing; data management; and resilience.
This paper examines the characteristics of MTC applications which create these challenges, and identifies related gaps in the middleware that supports these applications on extreme-scale systems. Based on this analysis, we propose AME, an Anyscale MTC Engine, which addresses the scalability aspects of these gaps. We describe the AME framework and present performance results for both synthetic benchmarks and real applications. Our results show that AME's dispatching performance linearly scales up to 14,120 tasks/second on 16,384 cores with high efficiency. The overhead of the intermediate data management scheme does not increase significantly up to 16,384 cores. AME eliminates 73% of the file transfer between compute nodes and the global filesystem for the Montage astronomy application running on 2,048 cores. Our results indicate that AME scales well on today's petascale machines, and is a strong candidate for exascale machines.
- S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. The case for a versatile storage system. SIGOPS Oper. Syst. Rev., 44:10--14, March 2010. Google ScholarDigital Library
- B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, and S. Tuecke. Data management and transfer in high-performance computational grid environments. Parallel Comput., 28:749--771, May 2002. Google ScholarDigital Library
- D. Borthakur. HDFS architecture. http://hadoop.apache.org/hdfs/docs/current/hdfs\_design.pdf.Google Scholar
- P. H. Carns, W. B. Ligon, III, R. B. Ross, and R. Thakur. PVFS: a parallel file system for linux clusters. In Proceedings of the 4th annual Linux Showcase & Conference - Volume 4, pages 28--28, Berkeley, CA, USA, 2000. USENIX Association. Google ScholarDigital Library
- S. Donovan, G. Huizenga, A. J. Hutton, A. J. Hutton, C. C. Ross, C. C. Ross, L. Symposium, L. Symposium, L. Symposium, M. K. Petersen, W. O. Source, and P. Schwan. Lustre: Building a file system for 1,000-node clusters, 2003.Google Scholar
- J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: A computation management agent for multi-institutional grids. Cluster Computing, 5:237--246, 2002. Google ScholarDigital Library
- K. Iskra, J. W. Romein, K. Yoshii, and P. Beckman. ZOID: I/O-forwarding infrastructure for petascale architectures. In Proc. of 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP'08, pages 153--162, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- D. S. Katz, J. C. Jacob, G. B. Berriman, J. Good, A. C. Laity, E. Deelman, C. Kesselman, and G. Singh. A comparison of two methods for building astronomical image mosaics on a grid. In Proc. 2005 Intl. Conf. on Parallel Proc. Workshops, pages 85--94, 2005. Google ScholarDigital Library
- I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework. In Proc. IEEE/ACM Supercomputing 2007, pages 1--12, 2007. Google ScholarDigital Library
- F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In In Proceedings of the 2002 Conference on File and Storage Technologies FAST, pages 231--244, 2002. Google ScholarDigital Library
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM '01 Conference, August 2001. Google ScholarDigital Library
- D. Thain, C. Moretti, and J. Hemmes. Chirp: a practical global filesystem for cluster and grid computing. Journal of Grid Computing, 7(1):51--72, 2009.Google ScholarCross Ref
- R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. Symp. on Frontiers of Massively Par. Proc., page 182, 1999. Google ScholarDigital Library
- M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, and I. Raicu. Parallel scripting for applications at the petascale and beyond. Computer, 42:50--60, 2009. Google ScholarDigital Library
- M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster. Swift: A language for distributed parallel scripting. Parallel Computing, pages 633--652, September 2011. Google ScholarDigital Library
- J. M. Wozniak and M. Wilde. Case studies in storage access by loosely coupled petascale applications. In Proc. 4th Annual Workshop on Petascale Data Storage, pages 16--20, 2009. Google ScholarDigital Library
- Z. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, and M. Wilde. Design and evaluation of a collective I/O model for loosely coupled petascale programming. In Proceedings of Many-Task Computing on Grids and Supercomputers, 2008, pages 1--10, 2008.Google ScholarCross Ref
Index Terms
- AME: an anyscale many-task computing engine
Recommendations
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
HPC '13: Proceedings of the High Performance Computing SymposiumExascale computers (expected to be composed of millions of nodes and billions of threads of execution) will enable the unraveling of significant scientific mysteries. Many-task computing is a distributed paradigm, which can potentially address three of ...
Data driven workflow planning in cluster management systems
HPDC '07: Proceedings of the 16th international symposium on High performance distributed computingTraditional scientific computing has been associated with harnessing computation cycles within and across clusters of machines. In recent years, scientific applications have become increasingly data-intensive. This is especially true in the fields of ...
Middleware support for many-task computing
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file ...
Comments