Skip to main content

Advertisement

Log in

A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core Clusters

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

A widely adopted design paradigm for many-core accelerators features processing elements grouped in clusters. Due to area, power and design simplicity, processors in the same clusters are often not equipped with data-caches but rather share a tightly coupled data memory (TCDM). Even if the use of a TCDM is more energy and area efficient than a cache, it requires a higher programming effort because memory transfers need to be explicitly managed (often with DMA-based off-chip memory to TCDM copies) . In this context software caches can be used to automatically transfer data between the local TCDM and the external memory, simplifying the task of the programmer. Despite their ease of use, software caches may incur in non-negligible overheads due to repeatedly invoking the cache runtime. Many-Core systems, however, are often used today for applications in which the unit of computation is not the single memory word but rather more complex data objects, opening room for optimization in software cache runtimes. A good example are computer vision applications, where the computation involves multi-byte objects (e.g feature descriptors). In this paper we present a software cache implementation for the STMicroelectronics STHORM acceleration fabric, with special focus on object-oriented caching techniques aimed at reducing as much as possible the global overhead introduced by the proposed software cache. Our software cache is validated by a set of experiments, and three case studies of Object-Oriented software cache for computer vision applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Listing 1
Listing 2
Listing 3
Figure 5
Listing 4
Figure 6
Figure 7
Figure 8
Figure 9
Listing 5
Figure 10
Figure 11
Listing 6
Figure 12
Figure 13
Figure 14

Similar content being viewed by others

References

  1. Aggarwal, A (2002). Software caching vs. prefetching. SIGPLAN Notices, 38(2supplement), 157–162. doi:10.1145/773039.512450.

    MathSciNet  Google Scholar 

  2. AMD (2009). Amd fusion white paper. http://www.amd.com/us/Documents/48423_fusion_whitepaper_WEB.pdf.

  3. Azevedo, A, & Juurlink, B. (2011). An instruction to accelerate software caches In Berekovic, M, Fornaciari, W, Brinkschulte, U, Silvano, C (Eds.), Architecture of Computing Systems - ARCS 2011. Lecture Notes in Computer Science (Vol. 6566, pp. 158–170): Springer Berlin Heidelberg.

  4. Azevedo, A, & Juurlink, B H H (2010). A multidimensional software cache for scratchpad-based systems. IJERTCS, 1(4), 1–20. doi:10.4018/jertcs.2010100101.

    Google Scholar 

  5. Balart, J, Gonzalez, M, Martorell, X, Ayguade, E, Sura, Z, Chen, T, Zhang, T, OBrien, K, OBrien, K. (2008). A novel asynchronous software cache implementation for the cell-be processor In Adve, V, Garzarn, M, Petersen, P (Eds.), Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science (Vol. 5234, pp. 125–140): Springer Berlin Heidelberg. doi:10.1007/978-3-540-85261-2_9.

  6. Banakar, R, Steinke, S, Lee, B S, Balakrishnan, M, Marwedel, P. (2002). Scratchpad memory: design alternative for cache on-chip memory in embedded systems. Proceedings of the tenth international symposium on Hardware/software codesign, ACM, New York, NY, USA, CODES ’02 (pp. 73–78. doi:10.1145/774789.774805.

  7. Beigne, E, Clermidy, F, Vivet, P, Clouard, A, Renaudin, M. (2005). An asynchronous noc architecture providing low latency service and its multi-level design framework. Proceedings. 11th IEEE International Symposium on Asynchronous Circuits and Systems, 2005. ASYNC 2005 (pp.54–63). doi:10.1109/ASYNC.2005.10.

  8. Benini, L, Flamand, E, Fuin, D, Melpignano, D. (2012). P2012: building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. Proceedings of the Conference on Design, Automation and Test in Europe, EDA Consortium, San Jose, CA, USA, DATE ’12 (pp. 983–987). http://dl.acm.org/citation.cfm?id=2492708.2492954.

  9. Borkar, S. (2007). Thousand core chips: a technology perspective. Proceedings of the 44th annual Design Automation Conference, ACM, New York, NY, USA, DAC ’07 (pp. 746–749). doi:10.1145/1278480.1278667.

  10. Bradski, G. (2000). The OpenCV Library. Dr Dobb’s Journal of Software Tools.

  11. Callahan, D, Kennedy, K, Porterfield, A (1991). Software prefetching. SIGPLAN Notices, 26(4), 40–52. doi:10.1145/106973.106979.

    Article  Google Scholar 

  12. Chen, C, Manzano, J, Gan, G, Gao, G, Sarkar, V. (2010). A study of a software cache implementation of the openmp memory model for multicore and manycore architectures In DAmbra, P, Guarracino, M, Talia, D (Eds.), Euro-Par 2010 - Parallel Processing. Lecture Notes in Computer Science (Vol. 6272, pp. 341–352): Springer Berlin Heidelberg. doi:10.1007/978-3-642-15291-7_31.

  13. Chen, T, Zhang, T, Sura, Z, Tallada, M G. (2008). Prefetching irregular references for software cache on cell. Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, ACM, New York, NY, USA, CGO ’08 (pp. 155–164). doi:10.1145/1356058.1356079.

  14. Eichenberger, A E, O’Brien, J K, O’Brien, K M, Wu, P, Chen, T, Oden, P H, Prener, D A, Shepherd, J C, So, B, Sura, Z, Wang, A, Zhang, T, Zhao, P, Gschwind, M K, Archambault, R, Gao, Y, Koo, R (2006). IBM Systems Journal, 45(1), 59–84. doi:10.1147/sj.451.0059.

    Article  Google Scholar 

  15. Freund, Y, & Schapire, R. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting In Vitnyi, P (Ed.), Computational Learning Theory. Lecture Notes in Computer Science (Vol. 904, pp. 23–37): Springer Berlin Heidelberg. doi:10.1007/3-540-59119-2_166.

  16. Glass, G, & Cao, P. (1997). Adaptive page replacement based on memory reference behavior. Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, ACM, New York, NY, USA, SIGMETRICS ’97 (pp. 115–126). doi:10.1145/258612.258681.

  17. Gonzàlez, M, Vujic, N, Martorell, X, Ayguadé, E, Eichenberger , A E, Chen, T, Sura, Z, Zhang, T, O’Brien, K, O’Brien, K. (2008). Hybrid access-specific software cache techniques for the cell be architecture. Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ACM, New York, NY, USA, PACT ’08 (pp. 292–302). doi:10.1145/1454115.1454156.

  18. Gschwind, M, Hofstee, H, Flachs, B, Hopkin, M, Watanabe, Y, Yamazaki, T (2006). Synergistic processing in cell’s multicore architecture. Micro, IEEE, 26(2), 10–24. doi:10.1109/MM.2006.41.

    Article  Google Scholar 

  19. Hallnor, E G, & Reinhardt, S K. (2000). A fully associative software-managed cache design. Proceedings of the 27th annual international symposium on Computer architecture, ACM, New York, NY, USA, ISCA ’00 (pp. 107–116).

  20. Horowitz, M, Alon, E, Patil, D, Naffziger, S, Kumar, R, Bernstein, K. (2005). Scaling, power, and the future of cmos. IEEE International Electron Devices Meeting, 2005. IEDM Technical Digest. (pp. 7–15). doi:10.1109/IEDM.2005.1609253.

  21. Lee, J, Seo, S, Kim, C, Kim, J, Chun, P, Sura, Z, Kim, J, Han, S. (2008). Comic: a coherent shared memory interface for cell be. Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ACM, New York, NY, USA, PACT ’08 (pp. 303–314). doi:10.1145/1454115.1454157.

  22. Magno, M, Tombari, F, Brunelli, D, Di Stefano, L, Benini, L. (2008). Multi-modal Video Surveillance Aided by Pyroelectric Infrared Sensors. Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications - M2SFA2 2008. Marseille, France: Andrea Cavallaro and Hamid Aghajan. http://hal.inria.fr/inria-00326749.

  23. Melpignano, D, Benini, L, Flamand, E, Jego, B, Lepley, T, Haugou, G, Clermidy, F, Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. Proceedings of the 49th Annual Design Automation Conference, ACM, New York, NY, USA, DAC ’12 (pp. 1137–1142). doi:10.1145/2228360.2228568.

  24. Miller, J E, & Agarwal, A (2006). Software-based instruction caching for embedded processors. SIGARCH Computer Architecture, 34(5), 293–302. doi:10.1145/1168919.1168894.

    Article  Google Scholar 

  25. Moritz, C A, Frank, M, Frank, M M, Lee, W, Amarasinghe, S (1999). Hot pages: Software caching for raw microprocessors.

  26. Corp, NVidia (2010). Nvidia tegra white paper. http://www.nvidia.com/docs/IO/90715/Tegra_Multiprocessor_Architecture_white_paper_Final_v1.1.pdf.

  27. Pham, D, Asano, S, Bolliger, M, Day, M, Hofstee, H, Johns, C, Kahle, J, Kameyama, A, Keaty, J, Masubuchi, Y, Riley, M, Shippy, D, Stasiak, D, Suzuoki, M, Wang, M, Warnock, J, Weitzel, S, Wendel, D, Yamazaki, T, Yazawa, K. (2005). The design and implementation of a first-generation cell processor. Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International (Vol. 1, pp. 184–592). doi:10.1109/ISSCC.2005.1493930.

  28. Qualcomm (2011). Snapdragon s4 white paper. http://www.qualcomm.com/media/documentssnapdragon-s4-processors-system-system-chip-solutions-new-mobile-age.

  29. Seo, S, Lee, J, Sura, Z. (2009). Design and implementation of software-managed caches for multicores with local memory. IEEE 15th International Symposium on High Performance Computer Architecture, 2009. HPCA 2009 (pp. 55–66). doi:10.1109/HPCA.2009.4798237.

  30. Tomkins, A, Patterson, R H, Gibson, G. (1997). Informed multi-process prefetching and caching. Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, ACM, New York, NY, USA, SIGMETRICS ’97 (pp. 100–114). doi:10.1145/258612.258680.

  31. Viola, P, & Jones, M (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. doi:10.1023/B:VISI.0000013087.49260.fb.

    Article  Google Scholar 

  32. Inc, Xilinx (2012). Zynq-7000 all programmable SoC overview. http://www.xilinx.com/support/documentation/data_sheets/ds/190-Zynq-7000-Overview.pdf.

Download references

Acknowledgements

This work has been supported by STMicroelectronics, by the EU FP7 Project vIrtical (GA n. 288574) and by the EU FP7 ERC Project MULTITHERMAN (GA n. 291125).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Pinto.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pinto, C., Benini, L. A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core Clusters. J Sign Process Syst 77, 77–93 (2014). https://doi.org/10.1007/s11265-014-0881-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-014-0881-4

Keywords

Navigation