ABSTRACT
Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel maps, filters, and sum-like reductions. In this paper, we describe the implementation of NDP in Parallel ML (PML), part of the Manticore project. Managing the parallel decomposition of work is one of the main challenges of implementing NDP. If the decomposition creates too many small chunks of work, performance will be eroded by too much parallel overhead. If, on the other hand, there are too few large chunks of work, there will be too much sequential processing and processors will sit idle.
Recently the technique of Lazy Binary Splitting was proposed for dynamic parallel decomposition of work on flat arrays, with promising results. We adapt Lazy Binary Splitting to parallel processing of binary trees, which we use to represent parallel arrays in PML. We call our technique Lazy Tree Splitting (LTS). One of its main advantages is its performance robustness: per-program tuning is not required to achieve good performance across varying platforms. We describe LTS-based implementations of standard NDP operations, and we present experimental data demonstrating the scalability of LTS across a range of benchmarks.
Supplemental Material
- }}Appel, A. W. Simple generational garbage collection and fast allocation. SP&E, 19(2), 1989, pp. 171--183. Google ScholarDigital Library
- }}Appel, A. W. Compiling with Continuations. Cambridge University Press, Cambridge, England, 1992. Google ScholarDigital Library
- }}Boehm, H.-J., R. Atkinson, and M. Plass. Ropes: an alternative to strings. SP&E, 25(12), December 1995, pp. 1315--1330. Google ScholarDigital Library
- }}Blelloch, G. E., S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. JPDC, 21(1), 1994, pp. 4--14. Google ScholarDigital Library
- }}Blelloch, G. E. and J. Greiner. A provable time and space efficient implementation of NESL. In ICFP '96. ACM, May 1996, pp. 213--225. Google ScholarDigital Library
- }}Barnes, J. and P. Hut. A hierarchical o(n log n) force calculation algorithm. Nature, 324, December 1986, pp. 446--449.Google ScholarCross Ref
- }}Blumofe, R. D. and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5), 1999, pp. 720--748. Google ScholarDigital Library
- }}Blelloch, G. E. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990.Google Scholar
- }}Blelloch, G. E. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990. Google ScholarDigital Library
- }}Blelloch, G. E. Programming parallel algorithms. CACM, 39(3), March 1996, pp. 85--97. Google ScholarDigital Library
- }}Burton, F. W. and M. R. Sleep. Executing functional programs on a virtual tree of processors. In FPCA '81. ACM, October 1981, pp. 187--194. Google ScholarDigital Library
- }}Chatterjee, S. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM TOPLAS, 15(3), July 1993, pp. 400--462. Google ScholarDigital Library
- }}Chakravarty, M. M. T., R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data Parallel Haskell: A status report. In DAMP '07. ACM, January 2007, pp. 10--18. Google ScholarDigital Library
- }}Chakravarty, M. M. T., R. Leshchinskiy, S. Peyton Jones, and G. Keller. Partial Vectorisation of Haskell Programs. In DAMP '08. ACM, January 2008.Google Scholar
- }}Fluet, M., N. Ford, M. Rainey, J. Reppy, A. Shaw, and Y. Xiao. Status Report: The Manticore Project. In ML '07. ACM, October 2007, pp. 15--24. Google ScholarDigital Library
- }}Frigo, M., C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI '98, June 1998, pp. 212--223. Google ScholarDigital Library
- }}Fluet, M., M. Rainey, J. Reppy, A. Shaw, and Y. Xiao. Manticore: A heterogeneous parallel language. In DAMP '07. ACM, January 2007, pp. 37--44. Google ScholarDigital Library
- }}Fluet, M., M. Rainey, and J. Reppy. A scheduling framework for general-purpose parallel languages. In ICFP '08, Victoria, BC, Candada, September 2008. ACM, pp. 241--252. Google ScholarDigital Library
- }}Fluet, M., M. Rainey, J. Reppy, and A. Shaw. Implicitly-threaded parallelism in Manticore. In ICFP '08, Victoria, BC, Candada, September 2008. ACM, pp. 119--130. Google ScholarDigital Library
- }}Barnes Hut benchmark written in Haskell. Available from http://darcs.haskell.org/packages/ndp/examples/barnesHut/.Google Scholar
- }}Ghuloum, A., E. Sprangle, J. Fang, G. Wu, and X. Zhou. Ct: A flexible parallel programming model for tera-scale architectures. Technical report, Intel, October 2007. Available at http://techresearch.intel.com/UserFiles/en-us/File/terascale/Whitepaper-Ct.pdf.Google Scholar
- }}Halstead Jr., R. H. Implementation of multilisp: Lisp on a multiprocessor. In LFP '84. ACM, August 1984, pp. 9--17. Google ScholarDigital Library
- }}Hinze, R. and R. Paterson. Finger trees: a simple general-purpose data structure. JFP, 16(2), 2006, pp. 197--217. Google ScholarDigital Library
- }}Huet, G. The zipper. JFP, 7(5), 1997, pp. 549--554. Google ScholarDigital Library
- }}Intel Threading Building Blocks Reference Manual, 2008.Google Scholar
- }}Keller, G. Transformation-based Implementation of Nested Data Parallelism for Distributed Memory Machines. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 1999.Google Scholar
- }}Leiserson, C. E. The Cilk++concurrency platform. In DAC '09, San Francisco, California, 2009. ACM, pp. 522--527. Google ScholarDigital Library
- }}Leshchinskiy, R. Higher-Order Nested Data Parallelism: Semantics and Implementation. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 2005.Google Scholar
- }}Loidl, H. W. and K. Hammond. On the Granularity of Divide-and-Conquer Parallelism. In GWFP '95. Springer-Verlag, 1995, pp. 8--10. Google ScholarDigital Library
- }}McBride, C. Clowns to the left of me, jokers to the right (pearl): dissecting data structures. In POPL '08. ACM, January 2008, pp. 287--295. Google ScholarDigital Library
- }}MLton. The MLton Standard ML compiler. Available at http://mlton.org.Google Scholar
- }}Milner, R., M. Tofte, R. Harper, and D. MacQueen. The Definition of Standard ML (Revised). The MIT Press, Cambridge, MA, 1997. Google ScholarDigital Library
- }}Narlikar, G. J. and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM TOPLAS, 21(1), 1999, pp. 138--173. Google ScholarDigital Library
- }}Nikhil, R. S. ID Language Reference Manual. Laboratory for Computer Science, MIT, Cambridge, MA, July 1991.Google Scholar
- }}Rainey, M. The Manticore runtime model. Master's dissertation, University of Chicago, January 2007. Available from http://manticore.cs.uchicago.edu.Google Scholar
- }}Rainey, M. Prototyping nested schedulers. In M. Felleisen, R. Findler, and M. Flatt (eds.), Semantics Engineering with PLT Redex. MIT Press, 2009.Google Scholar
- }}Robison, A., M. Voss, and A. Kukanov. Optimization via Reflection on Work Stealing in TBB. In IPDPS '08. IEEE Computer Society Press, 2008.Google Scholar
- }}Scandal Project. A library of parallel algorithms written NESL. Available from http://www.cs.cmu.edu/~scandal/nesl/algorithms.html.Google Scholar
- }}So, B., A. Ghuloum, and Y. Wu. Optimizing data parallel operations on many-core platforms. In STMCS '06, 2006.Google Scholar
- }}Tzannes, A., G. C. Caragea, R. Barua, and U. Vishkin. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. In PPoPP '10, Bangalore, India, January 2010. ACM, pp. 179--190. Google ScholarDigital Library
- }}Trinder, P. W., K. Hammond, H.-W. Loidl, and S. L. Peyton Jones. Algorithm+strategy = parallelism. JFP, 8(1), January 1998, pp. 23--60. Google ScholarDigital Library
- }}Tick, E. and X. Zhong. A compile-time granularity analysis algorithm and its performance evaluation. In FGCS '92, Tokyo, Japan, 1993. Springer-Verlag, pp. 271--295. Google ScholarDigital Library
- }}Weeks, S. Whole program compilation in MLton. Invited talk at ML '06 Workshop, September 2006. Invited talk; slides available at http://mlton.org/pages/References/attachments/060916-mlton.pdf. Google ScholarDigital Library
Index Terms
- Lazy tree splitting
Recommendations
Lazy tree splitting
ICFP '10Nested data-parallelism (NDP) is a declarative style for programming irregular parallel applications. NDP languages provide language features favoring the NDP style, efficient compilation of NDP programs, and various common NDP operations like parallel ...
A scheduling framework for general-purpose parallel languages
ICFP '08The trend in microprocessor design toward multicore and manycore processors means that future performance gains in software will largely come from harnessing parallelism. To realize such gains, we need languages and implementations that can enable ...
Lazy tree splitting
Nested data-parallelism (NDP) is a language mechanism that supports programming irregular parallel applications in a declarative style. In this paper, we describe the implementation of NDP in Parallel ML (PML), which is a part of the Manticore system. ...
Comments