Skip to main content

Advertisement

Log in

He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

The end of Dennardian scaling in advanced technologies brought about new architectural templates to overcome the so-called utilization wall and provide Moore’s Law-like performance and energy scaling in embedded SoCs. One of the most promising templates, architectural heterogeneity, is hindered by high cost due to the design space explosion and the lack of effective exploration tools. Our work provides three contributions towards a scalable and effective methodology for design space exploration in embedded MC-SoCs. First, we present the He-P2012 architecture, augmenting the state-of-art STMicroelectronics P2012 platform with heterogeneous shared-L1 coprocessors called HW processing elements (HWPE). Second, we propose a novel methodology for the semi-automatic definition and instantiation of shared-memory HWPEs from a C source, supporting both simple and structured data types. Third, we demonstrate that the integration of HWPEs can provide significant performance and energy efficiency benefits on a set of benchmarks originally developed for the homogeneous P2012, achieving up to 123x speedup on the accelerated code region (∼98 % of Amdahl’s law limit) while saving 2/3 of the energy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

Notes

  1. A more detailed discussion regarding the architectural tradeoffs of shared-memory HWPEs versus private-memory ones can be found in Dehyadegari et al. [16].

References

  1. Khronos OpenCL website. http://www.khronos.org/opencl/.

  2. NVidia Fermi Compute Architecture Whitepaper. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

  3. OpenCV website. http://opencv.org/.

  4. OpenMP website. http://www.openmp.org.

  5. Backus, J. (1978). Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs. Communications of the ACM, 21(8), 613–641. doi:10.1145/359576.359579.

    Article  MathSciNet  MATH  Google Scholar 

  6. Benini, L., Flamand, E., Fuin, D., & Melpignano, D. (2012). P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In 2012 design, automation & test in europe conference & exhibition (DATE) (pp. 983–987). IEEE. doi:10.1109/DATE.2012.6176639. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6176639.

  7. Bosi, B., Bois, G., & Savaria, Y. (1999). Reconfigurable pipelined 2-D convolvers for fast digital signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7 (3), 299–308. doi:10.1109/92.784091. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=784091.

    Article  Google Scholar 

  8. Burgio, P., Marongiu, A., Heller, D., Chavet, C., Coussy, P., & Benini, L. (2012). OpenMP-based synergistic parallelization and HW acceleration for on-chip shared-memory clusters. In 15th euromicro conference on digital system design: architectures, methods & tools, Turkey (2012). doi:10.1109/DSD.2012.97. http://hal.archives-ouvertes.fr/hal-00721366/ (pp. 751–758).

  9. Clermidy, F., Lemaire, R., Popon, X., Ktenas, D., & Thonnart, Y. (2009). An open and reconfigurable platform for 4G telecommunication: concepts and application. In 2009 12th euromicro conference on digital system design, architectures, methods and tools. doi:10.1109/DSD.2009.200. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5350210 (pp. 449–456).

  10. Cong, J., Ghodrat, M.A., Gill, M., Grigorian, B., & Reinman, G. (2012). Architecture support for accelerator-rich CMPs. In Proceedings of the 49th annual design automation conference - DAC 2012. doi:10.1145/2228360.2228512 (p. 843).

  11. Conti, F., Marongiu, A., & Benini, L. (2013). Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters. In 2013 international conference on hardware/software codesign and system synthesis (CODES+ISSS). IEEE. doi:10.1109/CODES-ISSS.2013.6658992. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6658992 (pp. 1–10).

  12. Conti, F., Pilkington, C., Marongiu, A., & Benini, L. (2014). He-P2012: architectural heterogeneity exploration on a scalable many-core platform. In Proceedings of 25th IEEE conference on application-specific architectures and processors.

  13. Conti, F., Pullini, A., & Benini, L. (2014). Brain-inspired classroom occupancy monitoring on a low-power mobile platform. In CVPR 2014 workshops.

  14. Conti, F., Rossi, D., Pullini, A., Loi, I., & Benini, L. (2014). Energy-efficient vision on the PULP platform for ultra-low power parallel computing. In Proceedings of 2014 IEEE Workshop on Signal Processing Systems (SiPS). http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6986099.

  15. De Micheli, G., Ernst, R., & Wolf, W. (Eds.) (2002). Readings in hardware/software co-design. Norwell: Kluwer.

  16. Dehyadegari, M., Marongiu, A., Kakoee, M., Mohammadi, S., Yazdani, N., & Benini, L. (2014). Architecture support for tightly-coupled multi-core clusters with shared-memory HW accelerators. IEEE Transactions on Computers, 64(8), 2132–2144. doi:10.1109/TC.2014.2360522. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6915684.

    MathSciNet  Google Scholar 

  17. Dehyadegari, M., Marongiu, A., Kakoee, M.R., Benini, L., Mohammadi, S., & Yazdani, N. (2012). A tightly-coupled multi-core cluster with shared-memory HW accelerators. In 2012 international conference on embedded computer systems (SAMOS). doi:10.1109/SAMOS.2012.6404162. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6404162 (pp. 96–103).

  18. Esmaeilzadeh, H., Blem, E., Amant, R.St., Sankaralingam, K., & Burger, D. (2011). Dark silicon and the end of multicore scaling. In Proceeding of the 38th annual international symposium on computer architecture - ISCA ’11. doi:10.1145/2000064.2000108 (p. 365).

  19. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., & LeCun, Y. (2011). NeuFlow: a runtime reconfigurable dataflow processor for vision. In CVPR 2011 Workshops. IEEE. doi:10.1109/CVPRW.2011.5981829. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5981829 (pp. 109–116).

  20. Gokhale, V., Jin, J., Dundar, A., Martini, B., & Culurciello, E. A 240 G-ops/s mobile coprocessor for deep neural networks. In CVPR 2014 Workshops.

  21. Goulding-Hotta, N., Sampson, J., Venkatesh, G., Garcia, S., Auricchio, J., Huang, P.C., Arora, M., Nath, S., Bhatt, V., Babb, J., Swanson, S., & Taylor, M. (2011). The GreenDroid mobile application processor: an architecture for silicon’s dark future. IEEE Micro, 31(2), 86–95. doi:10.1109/MM.2011.18. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5719585.

    Article  Google Scholar 

  22. Ienne, P., & Leupers, R. (Eds.) (2007). Customizable embedded processors. Burlington: Morgan Kaufmann.

  23. Jalier, C., Lattard, D., Jerraya, A.A., Sassatelli, G., Benoit, P., & Torres, L. (2010). Heterogeneous vs homogeneous MPSoC approaches for a mobile LTE modem. In 2010 design, automation & test in europe conference & exhibition (DATE 2010). doi:10.1109/DATE.2010.5457213. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5457213 (pp. 184–189).

  24. Kalray: Kalray MPPA 256. http://www.kalray.eu/products/mppa-manycore/mppa-256/.

  25. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., & Weinberger, K.Q. (Eds.), Advances in neural information processing systems. Curran Associates, Inc., (Vol. 25 pp. 1097–1105).

  26. Magno, M., Tombari, F., Brunelli, D., Di Stefano, L., & Benini, L. (2009). Multimodal abandoned/removed object detection for low power video surveillance systems. In 2009 6th IEEE international conference on advanced video and signal based surveillance. doi:10.1109/AVSS.2009.72. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5280160 (pp. 188–193).

  27. Marongiu, A., Capotondi, A., Tagliavini, G., & Benini, L. (2013). Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP. In Proceedings of the 1st international workshop on many-core embedded systems - MES ’13. doi:10.1145/2489068.2489069 (pp. 1–8). New York: ACM Press.

  28. Movidius: Innovation in mobile computational imaging.

  29. Park, S., Maashri, A.A., Irick, K.M., Chandrashekhar, A., Cotter, M., Chandramoorthy, N., Debole, M., & Narayanan, V. (2012). System-On-Chip for biologically inspired vision applications. IPSJ Transactions on System LSI Design Methodology, 5, 71–95. doi:10.2197/ipsjtsldm.5.71. http://www.cse.psu.edu/nic5090/HMAP/sldm.pdf.

    Article  Google Scholar 

  30. Patterson, D. (2014). Compute cores. AMD Whitepaper.

  31. Paulin, P.G., & Pilkington, C. (2002). StepNP: a system-level exploration platform for network processors.

  32. Paulin, P.G., Pilkington, C., Langevin, M., Bensoudane, E., Lyonnard, D., Benny, O., Lavigueur, B., Lo, D., Beltrame, G., Gagné, V., & Nicolescu, G. (2006). Parallel Programming models for a multiprocessor SoC platform applied to networking and multimedia, 14(7), 667–680.

  33. Pell, O., Bower, J., Dimond, R., Mencer, O., & Flynn, M.J. (2013). Finite-Difference wave propagation modeling on special-purpose dataflow machines. IEEE Transactions on Parallel and Distributed Systems, 24(5), 906–915. doi:10.1109/TPDS.2012.198. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6226384.

    Article  Google Scholar 

  34. Pilato, C., Ferrandi, F., & Sciuto, D. (2011). A design methodology to implement memory accesses in high-level synthesis. In 2011 proceedings of the 9th international conference hardware/software codesign and system synthesis (CODES+ISSS). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6062318 (pp. 49–58).

  35. Plurality: The HyperCore Architecture White Paper (2010).

  36. Rahimi, A., Loi, I., Kakoee, M.R., & Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters. In 2011 design, automation & test in Europe. (pp. 1–6). doi:10.1109/DATE.2011.5763085. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5763085.

  37. Ramacher, U. (2007). Software-Defined radio prospects for multistandard mobile phones. Computer, 40(10), 62–69. doi:10.1109/MC.2007.362. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4343691.

    Article  Google Scholar 

  38. Rosten, E., & Drummond, T. (2005). Fusing points and lines for high performance tracking. In 10th IEEE international conference on computer vision (ICCV’05) volume 1, (Vol. 2 pp. 1508–1515). doi:10.1109/ICCV.2005.104. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1544896.

  39. Rosten, E., & Drummond, T. (2006). Machine learning for high-speed corner detection. Computer VisionECCV 2006, (pp. 1–14). doi:10.1007/11744023_34.

  40. Sabarad, J., Kestur, S., Dantara, D., Narayanan, V., & Khosla, D. (2012). A reconfigurable accelerator for neuromorphic object recognition. In 17th Asia and south pacific design automation conference (pp. 813–818). IEEE. doi:10.1109/ASPDAC.2012.6165067. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6165067.

  41. Shagrithaya, K., Kpa, K., Athanas, P., & Tech, V. (2013). Enabling development of OpenCL applications on FPGA platforms, (pp. 26–30).

  42. Silicon Hive: Silicon Hive website.

  43. Synopsys: Processor Designer Datasheet.

  44. Taylor, M.B. (2012). Is dark silicon useful?. In Proceedings of the 49th annual design automation conference on - DAC ’12 (p. 1131). New York: ACM Press. doi:10.1145/2228360.2228567. http://dl.acm.org/citation.cfm?id=2228567.

  45. Tensilica: Xtensa Architecture and Performance White Paper (October) (2005).

  46. Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., & Taylor, M.B. (2010). Conservation cores. In Proceedings of the 15th edition of ASPLOS on architectural support for programming languages and operating systems - ASPLOS ’10 (p. 205). New York: ACM Press. doi:10.1145/1736020.1736044. http://dl.acm.org/citation.cfm?id=1736044.

  47. Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0. In 2010 18th IEEE annual international symposium on field-programmable custom computing machines (pp. 127–134). doi:10.1109/FCCM.2010.28. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5474060.

  48. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001, 1, I–511–I–518. doi:10.1109/CVPR.2001.990517. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=990517.

    Google Scholar 

  49. Wang, P.H., Collins, J.D., Chinya, G.N., Jiang, H., Tian, X., Girkar, M., Yang, N.Y., Lueh, G.Y., & Wang, H. (2007). EXOCHI. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation - PLDI ’07 (p. 156). New York: ACM Press. doi:10.1145/1250734.1250753. http://dl.acm.org/citation.cfm?id=1250753.

Download references

Acknowledgments

This work was supported by the EU Projects P-SOCRATES (FP7-611016) and PHIDIAS (FP7-318013).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Conti.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Conti, F., Marongiu, A., Pilkington, C. et al. He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores. J Sign Process Syst 85, 325–340 (2016). https://doi.org/10.1007/s11265-015-1056-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1056-7

Keywords

Navigation