High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Munir, Arslan; Koushanfar, Farinaz; Gordon-Ross, Ann; Ranka, Sanjay

doi:10.1007/s11227-013-0916-9

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Published: 05 April 2013

Volume 66, pages 431–487, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Arslan Munir¹,
Farinaz Koushanfar¹,
Ann Gordon-Ross^2,3 &
…
Sanjay Ranka⁴

553 Accesses
7 Citations
Explore all metrics

Abstract

Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

MAPS: A Software Development Environment for Embedded Multicore Applications

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Yuan N, Zhou Y, Tan G, Zhang J, Fan D (2009) High performance matrix multiplication on many cores. In: Proc of the 15th international Euro-Par conference on parallel processing (Euro-Par’09), Delft, The Netherlands, August 2009
Google Scholar
MAXIMUMPC (2007) Fast forward: multicore vs manycore. June. Available online: http://www.maximumpc.com/article/fast_forward_multicore_vs_manycore
Wikipedia (2013) Multi-core processor. February. Available online: http://en.wikipedia.org/wiki/Manycore
Tilera (2013) Tilera cloud computing. February. Available online: http://www.tilera.com/solutions/cloud_computing
Tilera (2013) Tilera TILEmpower platform. February. Available online: http://www.tilera.com/sites/default/files/productbriefs/TILEProEmpower_PB021_v4.pdf
Levy M, Conte T (2009) Embedded multicore processors and systems. IEEE MICRO 29(3):7–9
Article Google Scholar
Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson K, Sen D, Wawrzynek J, Wessel D, Yelick K (2009) A view of the parallel computing landscape. Commun ACM 52(10):56–67
Article Google Scholar
Cuvillo Jd, Zhu W, Gao GR (2006) Landing OpenMP on Cyclops-64: an efficient mapping of OpenMP to a many-core system-on-a-chip. In: Proc of ACM 3rd conference on computing frontiers (CF), Ischia, Italy, May 2006
Google Scholar
Vangal SR, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):29–41
Article Google Scholar
Musoll E (2010) A cost-effective load-balancing policy for tile-based, massive multi-core packet processors. ACM Trans Embedded Comput Syst 9(3):24
Article Google Scholar
Wu N, Yang Q, Wen M, He Y, Ren J, Guan M, Zhang C (2011) Tiled multi-core stream architecture. In: Transactions on high-performance embedded architectures and compilers IV (HiPEAC IV), vol 4, pp 274–293
Chapter Google Scholar
Mattson TG, Wijngaart RVd, Frumkin M (2008) Programming the Intel 80-core network-on-a-chip terascale processor. In: Proc of IEEE/ACM conference on supercomputing (SC), Austin, Texas, November 2008
Google Scholar
Crowell T (2011) Will 2011 mark the beginning of manycore? January. Available online: http://talbottcrowell.wordpress.com/2011/01/01/manycore/
Tilera (2012) Manycore without boundaries: TILEPro64 processor. May. Available online: http://www.tilera.com/products/processors/TILEPRO64
Brown R, Sharapov I (2008) Performance and programmability comparison between OpenMP and MPI implementations of a molecular modeling application. In: Lecture notes in computer science, vol 4315. Springer, Berlin, pp 349–360
Google Scholar
Sun X, Zhu J (1995) Performance considerations of shared virtual memory machines. IEEE Trans Parallel Distrib Syst 6(11):1185–1194
Article Google Scholar
Cortesi D (1998) Origin2000 and Onyx2 performance tuning and optimization guide. Available online: http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/index.html
Krishnan M, Nieplocha J (2004) SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proc of the international parallel and distributed processing symposium (IPDPS), Santa Fe, New Mexico, April 2004
Google Scholar
Lee H-J, Robertson JP, Fortes J (1997) Generalized Cannon’s algorithm for parallel matrix multiplication. In: Proc of the ACM international conference on supercomputing (ICS), Vienna, Austria, July 1997, pp 44–51
Google Scholar
van de Geijn RA, Watts J (1995) Summa: scalable universal matrix multiplication algorithm. University of Texas at Austin, Tech rep. Available online: http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail&id=oai%3Ancstrlh%3Autexas_cs%3AUTEXAS_CS%2F%2FCS-TR-95-13
Li J, Ranka S, Sahni S (2012) GPU matrix multiplication. In: Rajasekaran S (ed) Handbook on multicore computing. CRC Press, Boca Raton
Google Scholar
More A (2008) A case study on high performance matrix multiplication. Available online: mm-matrixmultiplicationtool.googlecode.com/files/mm.pdf
Higham N (1990) Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans Math Softw 16(4):352–368
Article MathSciNet MATH Google Scholar
Goto K, Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25
Article Google Scholar
Nishtala R, Vuduc RW, Demmel JW, Yelick KA (2004) Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Tech rep UCB/CSD-04-1335, EECS Department, University of California, Berkeley. Available online: http://www.eecs.berkeley.edu/Pubs/TechRpts/2004/5535.html
Lam MD, Rothberg EE, Wolf ME (1991) The cache performance and optimizations of blocked algorithms. In: Proc of the fourth ACM international conference on architectural support for programming languages and operating systems (ASPLOS), Santa Clara, California, April 1991, pp 63–74
Chapter Google Scholar
Rixner S (2002) Stream processor architecture. Kluwer Academic, Norwell
MATH Google Scholar
Zhu W, Cuvillo Jd, Gao GR (2005) Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture. In: Proc of the 2005 and 2006 international conference on OpenMP shared memory parallel programming (IWOMP’05/IWOMP’06), Eugene, Oregon, June 2005
Google Scholar
Garcia E, Venetis I, Khan R, Gao G (2010) Optimized dense matrix multiplication on a many-core architecture. In: Proc of the ACM Euro-Par conference on parallel processing
Google Scholar
Safari S, Fijany A, Diotalevi F, Hosseini F (2012) Highly parallel and fast implementation of stereo vision algorithms on MIMD many-core Tilera architecture. In: Proc of the IEEE aerospace conference, Boston, MA, August 2012, pp 1–11
Google Scholar
Munir A, Gordon-Ross A, Ranka S (2012) Parallelized benchmark-driven performance evaluation of SMPs and tiled multi-core architectures for embedded systems. In: Proc of the IEEE international performance computing and communications conference (IPCCC), Austin, Texas, December 2012
Google Scholar
Keckler S, Olukotun K, Hofstee H (2009) Multicore processors and systems. Springer, Berlin
Book MATH Google Scholar
Tilera (2012) Manycore without boundaries: TILE64 processor. April. Available online: http://www.tilera.com/products/processors/TILE64
Intel (2013) Intel’s teraflops research chip. February. Available online: http://download.intel.com/pressroom/kits/Teraflops/Teraflops_Research_Chip_Overview.pdf
Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnect for a TeraFLOPS processor. IEEE MICRO 27(5):51–61
Article Google Scholar
IBM (2012) Linux and Symmetric Multiprocessing, February. Available online: http://www.ibm.com/developerworks/library/l-linux-smp/
Tilera (2009) Tile processor architecture overview for the TILEPro series. In: Tilera official documentation. November
Google Scholar
Tilera (2010) Multicore development environment system programmer’s guide. In: Tilera official documentation. March
Google Scholar
Tilera (2009) Tile processor architecture overview. In: Tilera official documentation. November
Google Scholar
Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing. Benjamin-Cummings, Redwood City
MATH Google Scholar
Tilera (2010) Multicore development environment optimization guide. In: Tilera official documentation. March
Google Scholar
ARM (2012) Cortex-A15 MPCore: technical reference manual. April. Available online: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438e/DDI0438E_cortex_a15_r3p0_trm.pdf
Oracle (2013) Sun studio 12: Fortran programming guide. February. Available online: http://docs.oracle.com/cd/E19205-01/819-5262/aeuic/index.html
Mahlke S, Warter N, Chen W, Chang P, Hwu W-m (1991) The effect of compiler optimizations on available parallelism in scalar programs. In: Proc of 20th annual IEEE international conference on parallel processing (ICPP), Austin, Texas, August 1991
Google Scholar
Williams J, Massie C, George A, Richardson J, Gosrani K, Lam H (2010) Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans on Reconfigurable Technology and Systems 3(4)
Tilera (2010) TILEmPower appliance user’s guide. In: Tilera official documentation. January
Google Scholar
Tilera (2009) Tilera multicore development environment: iLib API reference manual. In: Tilera official documentation. April
Google Scholar

Download references

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Space and Naval Warfare Systems Command (SPAWAR N66001-11-1-4103), the Office of Naval Research (ONR R16480), and the National Science Foundation (NSF) (CNS-0953447 and CNS-0905308). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSERC, the SPAWAR, the ONR, and the NSF. Furthermore, the views expressed are those of the author(s) and do not reflect the official policy or position of the Department of Defense or the US Government. We would like to acknowledge Dr. Alan D. George, Director of the NSF Center of High-Performance Reconfigurable Computing (CHREC) at the University of Florida, Gainesville, Florida, USA, for providing access to CHREC resources and Tilera’s TILE64 and TILEPro64 for this work as well as discussions on high-performance computing with the leading author of this article.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
Arslan Munir & Farinaz Koushanfar
Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA
Ann Gordon-Ross
NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida, Gainesville, FL, USA
Ann Gordon-Ross
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Sanjay Ranka

Authors

Arslan Munir
View author publications
You can also search for this author inPubMed Google Scholar
Farinaz Koushanfar
View author publications
You can also search for this author inPubMed Google Scholar
Ann Gordon-Ross
View author publications
You can also search for this author inPubMed Google Scholar
Sanjay Ranka
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Arslan Munir.

Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64

This appendix section provides code snippets of our matrix multiplication algorithms for Tilera’s TILEPro64. The code snippets are presented selectively to provide an understanding of our algorithms and some portions of the code are skipped for conciseness.

1.1 A.1 Serial non-blocked matrix multiplication algorithm

1.1.1 A.1.1 SerialNonBlockedMM.h

1.1.2 A.1.2 SerialNonBlockedMM.c

1.2 A.2 Serial blocked matrix multiplication algorithm

1.2.1 A.2.1 SerialBlockedMM.h

1.2.2 A.2.2 SerialBlockedMM.c

1.3 A.3 Parallel blocked matrix multiplication algorithm

1.3.1 A.3.1 ParallelBlockedMM.h

1.3.2 A.3.2 ParallelBlockedMM.c

1.4 A.4 Parallel blocked cannon’s algorithm for matrix multiplication

1.4.1 A.4.1 ParallelBlockedCannonMM.h

1.4.2 A.4.2 ParallelBlockedCannonMM.c

Rights and permissions

Reprints and permissions

About this article

Cite this article

Munir, A., Koushanfar, F., Gordon-Ross, A. et al. High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study. J Supercomput 66, 431–487 (2013). https://doi.org/10.1007/s11227-013-0916-9

Download citation

Published: 05 April 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s11227-013-0916-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimizing Matrix Multiplication on NERSC’s High Performance Computer Cori

MAPS: A Software Development Environment for Embedded Multicore Applications

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64

Appendix: Matrix multiplication algorithms’ code snippets for Tilera’s TILEPro64

1.1 A.1 Serial non-blocked matrix multiplication algorithm

1.1.1 A.1.1 SerialNonBlockedMM.h

1.1.2 A.1.2 SerialNonBlockedMM.c

1.2 A.2 Serial blocked matrix multiplication algorithm

1.2.1 A.2.1 SerialBlockedMM.h

1.2.2 A.2.2 SerialBlockedMM.c

1.3 A.3 Parallel blocked matrix multiplication algorithm

1.3.1 A.3.1 ParallelBlockedMM.h

1.3.2 A.3.2 ParallelBlockedMM.c

1.4 A.4 Parallel blocked cannon’s algorithm for matrix multiplication

1.4.1 A.4.1 ParallelBlockedCannonMM.h

1.4.2 A.4.2 ParallelBlockedCannonMM.c

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now