Locality: The 3rd Wall and the Need for Innovation in Parallel Architectures

Kogge, Peter M.; Page, Brian A.

doi:10.1007/978-3-030-81682-7_1

Peter M. Kogge¹¹ &
Brian A. Page¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12800))

Included in the following conference series:

International Conference on Architecture of Computing Systems

804 Accesses
1 Citations

Abstract

In the past we have seen two major “walls” (memory and power) whose vanquishing required significant advances in architecture. This paper discusses evidence of a third wall dealing with data locality, which is prevalent in data intensive applications where computation is dominated by memory access and movement – not flops, Such apps exhibit large sets of often persistent data, with little reuse during computation, no predictable regularity, significantly different scaling characteristics, and where streaming is becoming important. Further, as we move to highly parallel algorithms (as in running in the cloud), these issues will get even worse. Solving such problems will take a new set of innovations in architecture. In addition to data on the new wall, this paper will look at one possible technique: the concept of migrating threads, and give evidence of its potential value based on several benchmarks that have scaling difficulties on conventional architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://crnch.gatech.edu/rogues-emu.

References

Barker, K., Davis, K., Hoisie, A., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: 2008 SC - International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2008, p. 1, November 2008
Google Scholar
Berry, J., Porter, A.: Stateful streaming in distributed memory supercomputers. In: Chesapeake Large Scale Data Analytics Conference (2016)
Google Scholar
Bic, L.: Distributed computing using autonomous objects. In: Proceedings of 5th IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 160–168 (1995)
Google Scholar
Bylina, B., Bylina, J., Stpiczynski, P., Szalkowski, D.: Performance analysis of multicore and multinodal implementation of SpMV operation. In: 2014 Federated Conference on Computer Science and Information Systems, pp. 569–576, September 2014
Google Scholar
Chan, T., Brown, A., Ensor, A.: SDP memo 54: compute node pipeline efficiency assessment framework. Technical report, SKA Square Kilometre Array, August 2018
Google Scholar
Chatarasi, P., Sarkar, V.: A preliminary study of compiler transformations for graph applications on the EMU system. In: Proceedings Workshop on Memory Centric High Performance Computing, MCHPC 2018, pp. 37–44. Association for Computing Machinery, New York (2018)
Google Scholar
Cheng, H., Wen, W., Wu, C., Li, S., Li, H.H., Chen, Y.: Understanding the design of IBM neurosynaptic system and its tradeoffs: a user perspective. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2017, pp. 139–144 (2017)
Google Scholar
Dysart, T., Kogge, P.M., Deneroff, M., et al.: Highly scalable near memory processing with migrating threads on the EMU system architecture. In: Proceedings of 6th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, pp. 2–9. IEEE Press, Piscataway, November 2016
Google Scholar
von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active messages: a mechanism for integrated communication and computation. In: Proceedings of 19th International Symposium on Computer Architecture, ISCA 1992, pp. 256–266. ACM, New York (1992). http://doi.acm.org/10.1145/139669.140382
Fu, H., Liao, J., Yang, J., et al.: The Sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001:1–072001:16 (2016)
Google Scholar
Gara, A., Blumrich, M.A., Chen, D., et al.: Overview of the Blue Gene/L system architecture. IBM J. R&D 49(2.3), 195–212 (2005)
Google Scholar
Ghose, S., Boroumand, A., Kim, J.S., Gómez-Luna, J., Mutlu, O.: Processing-in-memory: a workload-driven perspective. IBM J. R&D 63(6), 3:1–3:19 (2019)
Google Scholar
Gmelin, M., Kreuzinger, J., Pfeffer, M., Ungerer, T.: Agent-based distributed computing with JMessengers. In: Böhme, T., Unger, H. (eds.) IICS 2001. LNCS, vol. 2060, pp. 134–145. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48206-7_12
Chapter MATH Google Scholar
Goldberg, A.: SMALLTALK-80: The Interactive Programming Environment. Addison-Wesley Longman Publishing Co. Inc., Boston (1984)
MATH Google Scholar
Groeneveld, P.: Wafer scale interconnect and pathfinding for machine learning hardware. In: Proceedings of the Workshop on System-Level Interconnect: Problems and Pathfinding Workshop. SLIP 2020, Association for Computing Machinery, New York (2020)
Google Scholar
Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Comm. ACM 62(2), 48–60 (2019)
Article Google Scholar
Heroux, M.A., Dongarra, J.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013 4744, June 2013
Google Scholar
Jia, Z., Zhan, J., Wang, L., et al.: Understanding big data analytics workloads on modern processors. IEEE Trans. Parallel Distrib. Syst. 28(6), 1797–1810 (2017)
Google Scholar
Jouppi, N.P., Yoon, D.H., et al.: A domain-specific supercomputer for training deep neural networks. Comm. ACM 63(7), 67–78 (2020)
Article Google Scholar
Kogge, P.M.: Unifying threading paradigms for highly scalable PGAS systems with mobile threads. In: International Conference on High Performance Computing and Simulation (HPCS), July 2019
Google Scholar
Kogge, P.M., Bergman, K., Borkar, S., et al.: ExaScale computing study: technology challenges in achieving exascale systems. Technical Report CSE 2008-13, University of Notre Dame, September 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf
Krawezik, G.P., Kuntz, S.K., Kogge, P.M.: Implementing sparse linear algebra kernels on the Lucata Pathfinder-a computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020
Google Scholar
Marjanović, V., Gracia, J., Glass, C.W.: Performance modeling of the HPCG benchmark. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 172–192. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17248-4_9
Chapter Google Scholar
Minutoli, M., Kuntz, S., Tumeo, A., Kogge, P.M.: Implementing radix sort on EMU 1. In: 3rd Workshop on Near-Data Processing in Conjunction with 48th IEEE/ACM International Symposium on Microarchitecture (MICRO-48), December 2015
Google Scholar
Niu, F., Recht, B., Re, C., Wright, S.J.: HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011. pp. 693–701. Curran Associates Inc., USA (2011)
Google Scholar
Noakes, M.D., Wallach, D.A., Dally, W.J.: The J-machine multicomputer: an architectural evaluation. In: Proceedings 20th International Symposium on Computer Architecture, ISCA 1993, pp. 224–235. ACM, New York (1993)
Google Scholar
Page, B.A.: Scalability of irregular problems. Ph.D. thesis, University of Notre Dame, USA, October 2020
Google Scholar
Page, B.A., Kogge, P.M.: Scalability of sparse matrix dense vector multiply (SpMV) on a migrating thread architecture. In: Tenth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) held in conjunction with 34th IEEE International Parallel and Distributed Processing Symposium, May 2020
Google Scholar
Page, B.A., Kogge, P.M.: Scalability of streaming on migrating threads. In: IEEE High Performance Extreme Computing Conference Extreme Computing Conference (HPEC), September 2020
Google Scholar
Page, B.A., Kogge, P.M.: Scalability of hybrid SpMV with hypergraph partitioning and vertex delegation for communication avoidance. In: International Conference on High Performance Computing and Simulation (HPCS 2020), March 2021
Google Scholar
Page, B.A., Kogge, P.M.: Scalability of streaming anomaly detection in an unbounded key space on migrating threads. In: 2021 IEEE International Symposium on Parallel Distributed Processing (2021)
Google Scholar
Rees, N.: SKA and its computing challenges. Technical report, SKA Square Kilometre Array, May 2017. https://indico.cern.ch/event/638811/attachments/1460553/2255823/SKA_Computing_Challenges-20170516.pdf/
Rolinger, T.B., Krieger, C.D.: Impact of traditional sparse optimizations on a migratory thread architecture. In: 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pp. 45–52 (2018)
Google Scholar
Springer, P.L., Schibler, T., Krawezik, G., Lightholder, J., Kogge, P.M.: Machine learning algorithm performance on the Lucata computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020
Google Scholar
Wan, W., Kubendran, R., Eryilmaz, S.B., et al.: 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In: 2020 IEEE International Solid-State Circuits Conference (ISSCC), pp. 498–500 (2020)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Young, J., Hein, E.R., Eswar, S., et al.: A microbenchmark characterization of the EMU chick. CoRR abs/1809.07696 (2018)
Google Scholar
Zhang, H., Hsieh, C.J., Akella, V.: HogWild++: a new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 629–638, December 2016
Google Scholar

Download references

Acknowledgements

This work was supported in part by NSF grant CCF-1642280, and in part by the University of Notre Dame. We would also like to acknowledge the CRNCH Center at Georgia Tech for allowing us to use the Emu system there.

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN, 46556, USA
Peter M. Kogge & Brian A. Page

Authors

Peter M. Kogge
View author publications
You can also search for this author in PubMed Google Scholar
Brian A. Page
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter M. Kogge .

Editor information

Editors and Affiliations

Technische Universität Darmstadt, Darmstadt, Germany
Christian Hochberger
Karlsruhe Institute of Technology, Karlsruhe, Germany
Lars Bauer
Otto-von-Guericke University Magdeburg, Magdeburg, Germany
Thilo Pionteck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kogge, P.M., Page, B.A. (2021). Locality: The 3rd Wall and the Need for Innovation in Parallel Architectures. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-81682-7_1
Published: 15 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81681-0
Online ISBN: 978-3-030-81682-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics