Skip to main content

Locality: The 3rd Wall and the Need for Innovation in Parallel Architectures

  • Conference paper
  • First Online:
Architecture of Computing Systems (ARCS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12800))

Included in the following conference series:

Abstract

In the past we have seen two major “walls” (memory and power) whose vanquishing required significant advances in architecture. This paper discusses evidence of a third wall dealing with data locality, which is prevalent in data intensive applications where computation is dominated by memory access and movement – not flops, Such apps exhibit large sets of often persistent data, with little reuse during computation, no predictable regularity, significantly different scaling characteristics, and where streaming is becoming important. Further, as we move to highly parallel algorithms (as in running in the cloud), these issues will get even worse. Solving such problems will take a new set of innovations in architecture. In addition to data on the new wall, this paper will look at one possible technique: the concept of migrating threads, and give evidence of its potential value based on several benchmarks that have scaling difficulties on conventional architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://crnch.gatech.edu/rogues-emu.

References

  1. Barker, K., Davis, K., Hoisie, A., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: 2008 SC - International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2008, p. 1, November 2008

    Google Scholar 

  2. Berry, J., Porter, A.: Stateful streaming in distributed memory supercomputers. In: Chesapeake Large Scale Data Analytics Conference (2016)

    Google Scholar 

  3. Bic, L.: Distributed computing using autonomous objects. In: Proceedings of 5th IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 160–168 (1995)

    Google Scholar 

  4. Bylina, B., Bylina, J., Stpiczynski, P., Szalkowski, D.: Performance analysis of multicore and multinodal implementation of SpMV operation. In: 2014 Federated Conference on Computer Science and Information Systems, pp. 569–576, September 2014

    Google Scholar 

  5. Chan, T., Brown, A., Ensor, A.: SDP memo 54: compute node pipeline efficiency assessment framework. Technical report, SKA Square Kilometre Array, August 2018

    Google Scholar 

  6. Chatarasi, P., Sarkar, V.: A preliminary study of compiler transformations for graph applications on the EMU system. In: Proceedings Workshop on Memory Centric High Performance Computing, MCHPC 2018, pp. 37–44. Association for Computing Machinery, New York (2018)

    Google Scholar 

  7. Cheng, H., Wen, W., Wu, C., Li, S., Li, H.H., Chen, Y.: Understanding the design of IBM neurosynaptic system and its tradeoffs: a user perspective. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2017, pp. 139–144 (2017)

    Google Scholar 

  8. Dysart, T., Kogge, P.M., Deneroff, M., et al.: Highly scalable near memory processing with migrating threads on the EMU system architecture. In: Proceedings of 6th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, pp. 2–9. IEEE Press, Piscataway, November 2016

    Google Scholar 

  9. von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active messages: a mechanism for integrated communication and computation. In: Proceedings of 19th International Symposium on Computer Architecture, ISCA 1992, pp. 256–266. ACM, New York (1992). http://doi.acm.org/10.1145/139669.140382

  10. Fu, H., Liao, J., Yang, J., et al.: The Sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001:1–072001:16 (2016)

    Google Scholar 

  11. Gara, A., Blumrich, M.A., Chen, D., et al.: Overview of the Blue Gene/L system architecture. IBM J. R&D 49(2.3), 195–212 (2005)

    Google Scholar 

  12. Ghose, S., Boroumand, A., Kim, J.S., Gómez-Luna, J., Mutlu, O.: Processing-in-memory: a workload-driven perspective. IBM J. R&D 63(6), 3:1–3:19 (2019)

    Google Scholar 

  13. Gmelin, M., Kreuzinger, J., Pfeffer, M., Ungerer, T.: Agent-based distributed computing with JMessengers. In: Böhme, T., Unger, H. (eds.) IICS 2001. LNCS, vol. 2060, pp. 134–145. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48206-7_12

    Chapter  MATH  Google Scholar 

  14. Goldberg, A.: SMALLTALK-80: The Interactive Programming Environment. Addison-Wesley Longman Publishing Co. Inc., Boston (1984)

    MATH  Google Scholar 

  15. Groeneveld, P.: Wafer scale interconnect and pathfinding for machine learning hardware. In: Proceedings of the Workshop on System-Level Interconnect: Problems and Pathfinding Workshop. SLIP 2020, Association for Computing Machinery, New York (2020)

    Google Scholar 

  16. Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Comm. ACM 62(2), 48–60 (2019)

    Article  Google Scholar 

  17. Heroux, M.A., Dongarra, J.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013 4744, June 2013

    Google Scholar 

  18. Jia, Z., Zhan, J., Wang, L., et al.: Understanding big data analytics workloads on modern processors. IEEE Trans. Parallel Distrib. Syst. 28(6), 1797–1810 (2017)

    Google Scholar 

  19. Jouppi, N.P., Yoon, D.H., et al.: A domain-specific supercomputer for training deep neural networks. Comm. ACM 63(7), 67–78 (2020)

    Article  Google Scholar 

  20. Kogge, P.M.: Unifying threading paradigms for highly scalable PGAS systems with mobile threads. In: International Conference on High Performance Computing and Simulation (HPCS), July 2019

    Google Scholar 

  21. Kogge, P.M., Bergman, K., Borkar, S., et al.: ExaScale computing study: technology challenges in achieving exascale systems. Technical Report CSE 2008-13, University of Notre Dame, September 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf

  22. Krawezik, G.P., Kuntz, S.K., Kogge, P.M.: Implementing sparse linear algebra kernels on the Lucata Pathfinder-a computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020

    Google Scholar 

  23. Marjanović, V., Gracia, J., Glass, C.W.: Performance modeling of the HPCG benchmark. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 172–192. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17248-4_9

    Chapter  Google Scholar 

  24. Minutoli, M., Kuntz, S., Tumeo, A., Kogge, P.M.: Implementing radix sort on EMU 1. In: 3rd Workshop on Near-Data Processing in Conjunction with 48th IEEE/ACM International Symposium on Microarchitecture (MICRO-48), December 2015

    Google Scholar 

  25. Niu, F., Recht, B., Re, C., Wright, S.J.: HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011. pp. 693–701. Curran Associates Inc., USA (2011)

    Google Scholar 

  26. Noakes, M.D., Wallach, D.A., Dally, W.J.: The J-machine multicomputer: an architectural evaluation. In: Proceedings 20th International Symposium on Computer Architecture, ISCA 1993, pp. 224–235. ACM, New York (1993)

    Google Scholar 

  27. Page, B.A.: Scalability of irregular problems. Ph.D. thesis, University of Notre Dame, USA, October 2020

    Google Scholar 

  28. Page, B.A., Kogge, P.M.: Scalability of sparse matrix dense vector multiply (SpMV) on a migrating thread architecture. In: Tenth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) held in conjunction with 34th IEEE International Parallel and Distributed Processing Symposium, May 2020

    Google Scholar 

  29. Page, B.A., Kogge, P.M.: Scalability of streaming on migrating threads. In: IEEE High Performance Extreme Computing Conference Extreme Computing Conference (HPEC), September 2020

    Google Scholar 

  30. Page, B.A., Kogge, P.M.: Scalability of hybrid SpMV with hypergraph partitioning and vertex delegation for communication avoidance. In: International Conference on High Performance Computing and Simulation (HPCS 2020), March 2021

    Google Scholar 

  31. Page, B.A., Kogge, P.M.: Scalability of streaming anomaly detection in an unbounded key space on migrating threads. In: 2021 IEEE International Symposium on Parallel Distributed Processing (2021)

    Google Scholar 

  32. Rees, N.: SKA and its computing challenges. Technical report, SKA Square Kilometre Array, May 2017. https://indico.cern.ch/event/638811/attachments/1460553/2255823/SKA_Computing_Challenges-20170516.pdf/

  33. Rolinger, T.B., Krieger, C.D.: Impact of traditional sparse optimizations on a migratory thread architecture. In: 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pp. 45–52 (2018)

    Google Scholar 

  34. Springer, P.L., Schibler, T., Krawezik, G., Lightholder, J., Kogge, P.M.: Machine learning algorithm performance on the Lucata computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020

    Google Scholar 

  35. Wan, W., Kubendran, R., Eryilmaz, S.B., et al.: 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In: 2020 IEEE International Solid-State Circuits Conference (ISSCC), pp. 498–500 (2020)

    Google Scholar 

  36. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

  37. Young, J., Hein, E.R., Eswar, S., et al.: A microbenchmark characterization of the EMU chick. CoRR abs/1809.07696 (2018)

    Google Scholar 

  38. Zhang, H., Hsieh, C.J., Akella, V.: HogWild++: a new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 629–638, December 2016

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by NSF grant CCF-1642280, and in part by the University of Notre Dame. We would also like to acknowledge the CRNCH Center at Georgia Tech for allowing us to use the Emu system there.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter M. Kogge .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kogge, P.M., Page, B.A. (2021). Locality: The 3rd Wall and the Need for Innovation in Parallel Architectures. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81682-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81681-0

  • Online ISBN: 978-3-030-81682-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics