Skip to main content
Log in

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Embedded system applications, with their inherently limited parallelism, rarely exploit all available processing resources in large DSM-based manycore architectures. From a cache coherence perspective, this provides an opportunity to move away from global coherence spanning across all tiles, which does not scale well. Therefore, we favor a region-based cache coherence (RBCC) approach that enables coherence among a selectable cluster of tiles in accordance with application requirements. We present the design and hardware implementation of a flexibly configurable coherency region manager (CRM) that enables RBCC. We introduce two novel features that enhance RBCC, namely, runtime coherency region re-configuration and RBCC-malloc(), that dynamically tailor coherence to actually shared application working sets. Further, we propose, implement and evaluate additional CRM functions such as a non-intrusive barrier synchronization mechanism and a false sharing resolution strategy for our DSM-based manycore architecture. We have synthesized the CRM on an FPGA prototype for a 64-core system and observe a 38% reduction in BRAM-utilization compared to a global coherence directory for regions with up to 32 cores. Experiments using a video streaming application reveal a speed-up of up to 42% compared to an alternative message passing based implementation. We also evaluate the benefits of runtime coherency region re-configuration using two scenarios and present a formal analysis on when a re-configuration is beneficial.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. In our system, coherence and their acknowledgement messages are not re-ordered.

  2. Multiple coherence barriers can be supported by increasing the number of barrier and shadow registers per tile.

  3. For some applications, this can additionally contain state transfers.

References

  1. Fleisch, B., Popek, G.: Mirage: a coherent distributed shared memory design. In: Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pp. 211–223. SOSP ’89, Association for Computing Machinery, New York (1989). https://doi.org/10.1145/74850.74871

  2. Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 168–176. PPOPP ’90, Association for Computing Machinery, New York (1990). https://doi.org/10.1145/99163.99182

  3. de Dinechin, B.D.: Kalray mppa\(\textregistered\): massively parallel processor array: revisiting dsp acceleration with the kalray mppa manycore processor. In: 2015 IEEE Hot Chips 27 Symposium, pp. 1–27 (2015). https://doi.org/10.1109/HOTCHIPS.2015.7477332

  4. Lenoski, D., Laudon, J., Gharachorloo, K., Weber, W., Gupta, A., Hennessy, J., Horowitz, M., Lam, M.S.: The stanford dash multiprocessor. Computer 25(3), 63–79 (1992)

    Article  Google Scholar 

  5. Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown III, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro. 27(5), 15–31 (2007)

    Article  Google Scholar 

  6. Kessler, R.E.: The cavium 32 core octeon ii 68xx. In: 2011 IEEE Hot Chips 23 Symposium (HCS), pp. 1–33 (2011). https://doi.org/10.1109/HOTCHIPS.2011.7477487

  7. Srivatsa, A., Rheindt, S., Wild, T., Herkersdorf, A.: Region based cache coherence for tiled mpsocs. In: 2017 30th IEEE International System-on-Chip Conference (SOCC), pp. 286–291 (2017)

  8. Southern, G., Renau, J.: Analysis of parsec workload scalability. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 133–142 (2016). https://doi.org/10.1109/ISPASS.2016.7482081

  9. Srivatsa, A., Rheindt, S., Gabriel, D., Wild, T., Herkersdorf, A.: Cod: coherence-on-demand-runtime adaptable working set coherence for dsm-based manycore architectures. In: Pnevmatikatos, D.N., Pelcat, M., Jung, M. (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, pp. 18–33. Springer, Cham (2019)

    Chapter  Google Scholar 

  10. Eggers, S.J., Katz, R.H.: Evaluating the performance of four snooping cache coherency protocols. In: Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 2–15. ISCA ’89, Association for Computing Machinery, New York (1989). https://doi.org/10.1145/74925.74927

  11. Hennessy, J., Heinrich, M., Gupta, A.: Cache-coherent distributed shared memory: perspectives on its development and future challenges. Proc. IEEE 87(3), 418–429 (1999). https://doi.org/10.1109/5.747863

    Article  Google Scholar 

  12. Gupta, A., dietrich Weber, W., Mowry, T.: Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In: International Conference on Parallel Processing, pp. 312–321 (1990)

  13. Yao, Y., Wang, G., Ge, Z., Mitra, T., Chen, W., Zhang, N.: Selectdirectory: a selective directory for cache coherence in many-core architectures. In: 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 175–180 (2015)

  14. Ferdman, M., Lotfi-Kamran, P., Balet, K., Falsafi, B.: Cuckoo directory: a scalable directory for many-core systems. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 169–180 (2011)

  15. Chaiken, D., Kubiatowicz, J., Agarwal, A.: Limitless Directories: A Scalable Cache Coherence Scheme, pp. 224–234. ASPLOS IV, ACM, New York (1991). https://doi.org/10.1145/106972.106995

  16. Sodani, A., Gramunt, R., Corbal, J., Kim, H., Vinod, K., Chinthamani, S., Hutsell, S., Agarwal, R., Liu, Y.: Knights landing: Second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)

    Article  Google Scholar 

  17. Fu, Y., Nguyen, T.M., Wentzlaff, D.: Coherence domain restriction on large scale systems. In: 48th International Symposium on Microarchitecture, pp. 686–698. MICRO-48, ACM, New York (2015). https://doi.org/10.1145/2830772.2830832

  18. Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Multiprocessor System-on-Chip: Hardware Design and Tool Integration. https://doi.org/10.1007/978-1-4419-6460-1_11

  19. Torrellas, J., Lam, H.S., Hennessy, J.L.: False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. 43(6), 651–663 (1994). https://doi.org/10.1109/12.286299

    Article  MATH  Google Scholar 

  20. Jeremiassen, T.E., Eggers, S.J.: Reducing false sharing on shared memory multiprocessors through compile time data transformations. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 179–188. PPOPP ’95, Association for Computing Machinery, New York (1995). https://doi.org/10.1145/209936.209955

  21. Liu, T., Tian, C., Hu, Z., Berger, E.D.: Predator: predictive false sharing detection. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3–14. PPoPP ’14, Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2555243.2555244

  22. Liu, T., Liu, X.: Cheetah: detecting false sharing efficiently and effectively. In: Proceedings of the 2016 International Symposium on Code Generation and Optimization, pp. 1–11. CGO ’16, Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2854038.2854039

  23. Liu, T., Berger, E.D.: Sheriff: precise detection and automatic mitigation of false sharing. SIGPLAN Not. 46(10), 3–18 (2011). https://doi.org/10.1145/2076021.2048070

    Article  Google Scholar 

  24. Freeh, V.W., Andrews, G.R.: Dynamically controlling false sharing in distributed shared memory. In: Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing, pp. 403–411 (1996). https://doi.org/10.1109/HPDC.1996.546211

  25. Waliullah, M., Stenstrom, P.: Classification and elimination of conflicts in hardware transactional memory systems. In: 2011 23rd International Symposium on Computer Architecture and High Performance Computing, pp. 96–103 (2011). https://doi.org/10.1109/SBAC-PAD.2011.18

Download references

Acknowledgements

This work was partly funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Number 146371743-TRR 89: Invasive Computing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akshay Srivatsa.

Ethics declarations

Conflict of interest

The authors would like to thank Sai Varun Brahmadevara, Li-Yu Peng and Miguel Montoya Rendon for their contributions as master and internship students at the Chair of Integrated Systems, TUM. We would also like to thank Sebastian Maier at the Computer Science 4 department, FAU, Erlangen-Nuremberg for his OS support.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srivatsa, A., Mansour, M., Rheindt, S. et al. DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures. Int J Parallel Prog 49, 570–599 (2021). https://doi.org/10.1007/s10766-020-00688-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00688-6

Keywords

Navigation