Abstract
Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.






Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Yuan N, Zhou Y, Tan G, Zhang J, Fan D (2009) High performance matrix multiplication on many cores. In: Proc of the 15th international Euro-Par conference on parallel processing (Euro-Par’09), Delft, The Netherlands, August 2009
MAXIMUMPC (2007) Fast forward: multicore vs manycore. June. Available online: http://www.maximumpc.com/article/fast_forward_multicore_vs_manycore
Wikipedia (2013) Multi-core processor. February. Available online: http://en.wikipedia.org/wiki/Manycore
Tilera (2013) Tilera cloud computing. February. Available online: http://www.tilera.com/solutions/cloud_computing
Tilera (2013) Tilera TILEmpower platform. February. Available online: http://www.tilera.com/sites/default/files/productbriefs/TILEProEmpower_PB021_v4.pdf
Levy M, Conte T (2009) Embedded multicore processors and systems. IEEE MICRO 29(3):7–9
Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson K, Sen D, Wawrzynek J, Wessel D, Yelick K (2009) A view of the parallel computing landscape. Commun ACM 52(10):56–67
Cuvillo Jd, Zhu W, Gao GR (2006) Landing OpenMP on Cyclops-64: an efficient mapping of OpenMP to a many-core system-on-a-chip. In: Proc of ACM 3rd conference on computing frontiers (CF), Ischia, Italy, May 2006
Vangal SR, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):29–41
Musoll E (2010) A cost-effective load-balancing policy for tile-based, massive multi-core packet processors. ACM Trans Embedded Comput Syst 9(3):24
Wu N, Yang Q, Wen M, He Y, Ren J, Guan M, Zhang C (2011) Tiled multi-core stream architecture. In: Transactions on high-performance embedded architectures and compilers IV (HiPEAC IV), vol 4, pp 274–293
Mattson TG, Wijngaart RVd, Frumkin M (2008) Programming the Intel 80-core network-on-a-chip terascale processor. In: Proc of IEEE/ACM conference on supercomputing (SC), Austin, Texas, November 2008
Crowell T (2011) Will 2011 mark the beginning of manycore? January. Available online: http://talbottcrowell.wordpress.com/2011/01/01/manycore/
Tilera (2012) Manycore without boundaries: TILEPro64 processor. May. Available online: http://www.tilera.com/products/processors/TILEPRO64
Brown R, Sharapov I (2008) Performance and programmability comparison between OpenMP and MPI implementations of a molecular modeling application. In: Lecture notes in computer science, vol 4315. Springer, Berlin, pp 349–360
Sun X, Zhu J (1995) Performance considerations of shared virtual memory machines. IEEE Trans Parallel Distrib Syst 6(11):1185–1194
Cortesi D (1998) Origin2000 and Onyx2 performance tuning and optimization guide. Available online: http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/index.html
Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proc of the international parallel and distributed processing symposium (IPDPS), Santa Fe, New Mexico, April 2004
Lee H-J, Robertson JP, Fortes J (1997) Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc of the ACM international conference on supercomputing (ICS), Vienna, Austria, July 1997, pp 44–51
van de Geijn RA, Watts J (1995) Summa: scalable universal matrix multiplication algorithm. University of Texas at Austin, Tech rep. Available online: http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail&id=oai%3Ancstrlh%3Autexas_cs%3AUTEXAS_CS%2F%2FCS-TR-95-13
Li J, Ranka S, Sahni S (2012) GPU matrix multiplication. In: Rajasekaran S (ed) Handbook on multicore computing. CRC Press, Boca Raton
More A (2008) A case study on high performance matrix multiplication. Available online: mm-matrixmultiplicationtool.googlecode.com/files/mm.pdf
Higham N (1990) Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans Math Softw 16(4):352–368
Goto K, Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25
Nishtala R, Vuduc RW, Demmel JW, Yelick KA (2004) Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Tech rep UCB/CSD-04-1335, EECS Department, University of California, Berkeley. Available online: http://www.eecs.berkeley.edu/Pubs/TechRpts/2004/5535.html
Lam MD, Rothberg EE, Wolf ME (1991) The cache performance and optimizations of blocked algorithms. In: Proc of the fourth ACM international conference on architectural support for programming languages and operating systems (ASPLOS), Santa Clara, California, April 1991, pp 63–74
Rixner S (2002) Stream processor architecture. Kluwer Academic, Norwell
Zhu W, Cuvillo Jd, Gao GR (2005) Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture. In: Proc of the 2005 and 2006 international conference on OpenMP shared memory parallel programming (IWOMP’05/IWOMP’06), Eugene, Oregon, June 2005
Garcia E, Venetis I, Khan R, Gao G (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proc of the ACM Euro-Par conference on parallel processing
Safari S, Fijany A, Diotalevi F, Hosseini F (2012) Highly parallel and fast implementation of stereo vision algorithms on MIMD many-core Tilera architecture. In: Proc of the IEEE aerospace conference, Boston, MA, August 2012, pp 1–11
Munir A, Gordon-Ross A, Ranka S (2012) Parallelized benchmark-driven performance evaluation of SMPs and tiled multi-core architectures for embedded systems. In: Proc of the IEEE international performance computing and communications conference (IPCCC), Austin, Texas, December 2012
Keckler S, Olukotun K, Hofstee H (2009) Multicore processors and systems. Springer, Berlin
Tilera (2012) Manycore without boundaries: TILE64 processor. April. Available online: http://www.tilera.com/products/processors/TILE64
Intel (2013) Intel’s teraflops research chip. February. Available online: http://download.intel.com/pressroom/kits/Teraflops/Teraflops_Research_Chip_Overview.pdf
Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnect for a TeraFLOPS processor. IEEE MICRO 27(5):51–61
IBM (2012) Linux and Symmetric Multiprocessing, February. Available online: http://www.ibm.com/developerworks/library/l-linux-smp/
Tilera (2009) Tile processor architecture overview for the TILEPro series. In: Tilera official documentation. November
Tilera (2010) Multicore development environment system programmer’s guide. In: Tilera official documentation. March
Tilera (2009) Tile processor architecture overview. In: Tilera official documentation. November
Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing. Benjamin-Cummings, Redwood City
Tilera (2010) Multicore development environment optimization guide. In: Tilera official documentation. March
ARM (2012) Cortex-A15 MPCore: technical reference manual. April. Available online: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438e/DDI0438E_cortex_a15_r3p0_trm.pdf
Oracle (2013) Sun studio 12: Fortran programming guide. February. Available online: http://docs.oracle.com/cd/E19205-01/819-5262/aeuic/index.html
Mahlke S, Warter N, Chen W, Chang P, Hwu W-m (1991) The effect of compiler optimizations on available parallelism in scalar programs. In: Proc of 20th annual IEEE international conference on parallel processing (ICPP), Austin, Texas, August 1991
Williams J, Massie C, George A, Richardson J, Gosrani K, Lam H (2010) Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans on Reconfigurable Technology and Systems 3(4)
Tilera (2010) TILEmPower appliance user’s guide. In: Tilera official documentation. January
Tilera (2009) Tilera multicore development environment: iLib API reference manual. In: Tilera official documentation. April
Acknowledgements
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Space and Naval Warfare Systems Command (SPAWAR N66001-11-1-4103), the Office of Naval Research (ONR R16480), and the National Science Foundation (NSF) (CNS-0953447 and CNS-0905308). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSERC, the SPAWAR, the ONR, and the NSF. Furthermore, the views expressed are those of the author(s) and do not reflect the official policy or position of the Department of Defense or the US Government. We would like to acknowledge Dr. Alan D. George, Director of the NSF Center of High-Performance Reconfigurable Computing (CHREC) at the University of Florida, Gainesville, Florida, USA, for providing access to CHREC resources and Tilera’s TILE64 and TILEPro64 for this work as well as discussions on high-performance computing with the leading author of this article.
Author information
Authors and Affiliations
Corresponding author
Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64
Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64
This appendix section provides code snippets of our matrix multiplication algorithms for Tilera’s TILEPro64. The code snippets are presented selectively to provide an understanding of our algorithms and some portions of the code are skipped for conciseness.
1.1 A.1 Serial non-blocked matrix multiplication algorithm
1.1.1 A.1.1 SerialNonBlockedMM.h

1.1.2 A.1.2 SerialNonBlockedMM.c

1.2 A.2 Serial blocked matrix multiplication algorithm
1.2.1 A.2.1 SerialBlockedMM.h

1.2.2 A.2.2 SerialBlockedMM.c

1.3 A.3 Parallel blocked matrix multiplication algorithm
1.3.1 A.3.1 ParallelBlockedMM.h

1.3.2 A.3.2 ParallelBlockedMM.c

1.4 A.4 Parallel blocked cannon’s algorithm for matrix multiplication
1.4.1 A.4.1 ParallelBlockedCannonMM.h

1.4.2 A.4.2 ParallelBlockedCannonMM.c

Rights and permissions
About this article
Cite this article
Munir, A., Koushanfar, F., Gordon-Ross, A. et al. High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66, 431–487 (2013). https://doi.org/10.1007/s11227-013-0916-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0916-9