Abstract
In the past we have seen two major “walls” (memory and power) whose vanquishing required significant advances in architecture. This paper discusses evidence of a third wall dealing with data locality, which is prevalent in data intensive applications where computation is dominated by memory access and movement – not flops, Such apps exhibit large sets of often persistent data, with little reuse during computation, no predictable regularity, significantly different scaling characteristics, and where streaming is becoming important. Further, as we move to highly parallel algorithms (as in running in the cloud), these issues will get even worse. Solving such problems will take a new set of innovations in architecture. In addition to data on the new wall, this paper will look at one possible technique: the concept of migrating threads, and give evidence of its potential value based on several benchmarks that have scaling difficulties on conventional architectures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barker, K., Davis, K., Hoisie, A., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: 2008 SC - International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2008, p. 1, November 2008
Berry, J., Porter, A.: Stateful streaming in distributed memory supercomputers. In: Chesapeake Large Scale Data Analytics Conference (2016)
Bic, L.: Distributed computing using autonomous objects. In: Proceedings of 5th IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 160–168 (1995)
Bylina, B., Bylina, J., Stpiczynski, P., Szalkowski, D.: Performance analysis of multicore and multinodal implementation of SpMV operation. In: 2014 Federated Conference on Computer Science and Information Systems, pp. 569–576, September 2014
Chan, T., Brown, A., Ensor, A.: SDP memo 54: compute node pipeline efficiency assessment framework. Technical report, SKA Square Kilometre Array, August 2018
Chatarasi, P., Sarkar, V.: A preliminary study of compiler transformations for graph applications on the EMU system. In: Proceedings Workshop on Memory Centric High Performance Computing, MCHPC 2018, pp. 37–44. Association for Computing Machinery, New York (2018)
Cheng, H., Wen, W., Wu, C., Li, S., Li, H.H., Chen, Y.: Understanding the design of IBM neurosynaptic system and its tradeoffs: a user perspective. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2017, pp. 139–144 (2017)
Dysart, T., Kogge, P.M., Deneroff, M., et al.: Highly scalable near memory processing with migrating threads on the EMU system architecture. In: Proceedings of 6th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, pp. 2–9. IEEE Press, Piscataway, November 2016
von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active messages: a mechanism for integrated communication and computation. In: Proceedings of 19th International Symposium on Computer Architecture, ISCA 1992, pp. 256–266. ACM, New York (1992). http://doi.acm.org/10.1145/139669.140382
Fu, H., Liao, J., Yang, J., et al.: The Sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001:1–072001:16 (2016)
Gara, A., Blumrich, M.A., Chen, D., et al.: Overview of the Blue Gene/L system architecture. IBM J. R&D 49(2.3), 195–212 (2005)
Ghose, S., Boroumand, A., Kim, J.S., Gómez-Luna, J., Mutlu, O.: Processing-in-memory: a workload-driven perspective. IBM J. R&D 63(6), 3:1–3:19 (2019)
Gmelin, M., Kreuzinger, J., Pfeffer, M., Ungerer, T.: Agent-based distributed computing with JMessengers. In: Böhme, T., Unger, H. (eds.) IICS 2001. LNCS, vol. 2060, pp. 134–145. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48206-7_12
Goldberg, A.: SMALLTALK-80: The Interactive Programming Environment. Addison-Wesley Longman Publishing Co. Inc., Boston (1984)
Groeneveld, P.: Wafer scale interconnect and pathfinding for machine learning hardware. In: Proceedings of the Workshop on System-Level Interconnect: Problems and Pathfinding Workshop. SLIP 2020, Association for Computing Machinery, New York (2020)
Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Comm. ACM 62(2), 48–60 (2019)
Heroux, M.A., Dongarra, J.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013 4744, June 2013
Jia, Z., Zhan, J., Wang, L., et al.: Understanding big data analytics workloads on modern processors. IEEE Trans. Parallel Distrib. Syst. 28(6), 1797–1810 (2017)
Jouppi, N.P., Yoon, D.H., et al.: A domain-specific supercomputer for training deep neural networks. Comm. ACM 63(7), 67–78 (2020)
Kogge, P.M.: Unifying threading paradigms for highly scalable PGAS systems with mobile threads. In: International Conference on High Performance Computing and Simulation (HPCS), July 2019
Kogge, P.M., Bergman, K., Borkar, S., et al.: ExaScale computing study: technology challenges in achieving exascale systems. Technical Report CSE 2008-13, University of Notre Dame, September 2008. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf
Krawezik, G.P., Kuntz, S.K., Kogge, P.M.: Implementing sparse linear algebra kernels on the Lucata Pathfinder-a computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020
Marjanović, V., Gracia, J., Glass, C.W.: Performance modeling of the HPCG benchmark. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 172–192. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17248-4_9
Minutoli, M., Kuntz, S., Tumeo, A., Kogge, P.M.: Implementing radix sort on EMU 1. In: 3rd Workshop on Near-Data Processing in Conjunction with 48th IEEE/ACM International Symposium on Microarchitecture (MICRO-48), December 2015
Niu, F., Recht, B., Re, C., Wright, S.J.: HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011. pp. 693–701. Curran Associates Inc., USA (2011)
Noakes, M.D., Wallach, D.A., Dally, W.J.: The J-machine multicomputer: an architectural evaluation. In: Proceedings 20th International Symposium on Computer Architecture, ISCA 1993, pp. 224–235. ACM, New York (1993)
Page, B.A.: Scalability of irregular problems. Ph.D. thesis, University of Notre Dame, USA, October 2020
Page, B.A., Kogge, P.M.: Scalability of sparse matrix dense vector multiply (SpMV) on a migrating thread architecture. In: Tenth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) held in conjunction with 34th IEEE International Parallel and Distributed Processing Symposium, May 2020
Page, B.A., Kogge, P.M.: Scalability of streaming on migrating threads. In: IEEE High Performance Extreme Computing Conference Extreme Computing Conference (HPEC), September 2020
Page, B.A., Kogge, P.M.: Scalability of hybrid SpMV with hypergraph partitioning and vertex delegation for communication avoidance. In: International Conference on High Performance Computing and Simulation (HPCS 2020), March 2021
Page, B.A., Kogge, P.M.: Scalability of streaming anomaly detection in an unbounded key space on migrating threads. In: 2021 IEEE International Symposium on Parallel Distributed Processing (2021)
Rees, N.: SKA and its computing challenges. Technical report, SKA Square Kilometre Array, May 2017. https://indico.cern.ch/event/638811/attachments/1460553/2255823/SKA_Computing_Challenges-20170516.pdf/
Rolinger, T.B., Krieger, C.D.: Impact of traditional sparse optimizations on a migratory thread architecture. In: 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pp. 45–52 (2018)
Springer, P.L., Schibler, T., Krawezik, G., Lightholder, J., Kogge, P.M.: Machine learning algorithm performance on the Lucata computer. In: IEEE High Performance Extreme Computing Conference (HPEC), September 2020
Wan, W., Kubendran, R., Eryilmaz, S.B., et al.: 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In: 2020 IEEE International Solid-State Circuits Conference (ISSCC), pp. 498–500 (2020)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Young, J., Hein, E.R., Eswar, S., et al.: A microbenchmark characterization of the EMU chick. CoRR abs/1809.07696 (2018)
Zhang, H., Hsieh, C.J., Akella, V.: HogWild++: a new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 629–638, December 2016
Acknowledgements
This work was supported in part by NSF grant CCF-1642280, and in part by the University of Notre Dame. We would also like to acknowledge the CRNCH Center at Georgia Tech for allowing us to use the Emu system there.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kogge, P.M., Page, B.A. (2021). Locality: The 3rd Wall and the Need for Innovation in Parallel Architectures. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-81682-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81681-0
Online ISBN: 978-3-030-81682-7
eBook Packages: Computer ScienceComputer Science (R0)