research-article

Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB Design

Authors:

Xulong TangAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1154 - 1168

https://doi.org/10.1145/3466752.3480083

Published: 17 October 2021 Publication History

Abstract

In recent years, the ever-growing application complexity and input dataset sizes have driven the popularity of multi-GPU systems as a desirable computing platform for many application domains. While employing multiple GPUs intuitively exposes substantial parallelism for the application acceleration, the delivered performance rarely scales with the number of GPUs. One of the major challenges behind is the address translation efficiency. Many prior works focus on CPUs or single GPU execution scenarios while the address translation in multi-GPU systems receives little attention. In this paper, we conduct a comprehensive investigation of the address translation efficiency in both “single-application-multi-GPU” and “multi-application-multi-GPU” execution paradigms. Based on our observations, we propose a new TLB hierarchy design, called least-TLB, tailored for multi-GPU systems and effectively improves the TLB performance with minimal hardware overheads. Experimental results on 9 single-application workloads and 10 multi-application workloads indicate the proposed least-TLB improves the performances, on average, by 23.5% and 16.3%, respectively.

References

[1]

J. Ahn, S. Jin, and J. Huh. 2012. Revisiting hardware-assisted page walks for virtualized systems. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). 476–487. https://doi.org/10.1109/ISCA.2012.6237041

[2]

J. Ahn, S. Jin, and J. Huh. 2015. Fast Two-Level Address Translation for Virtualized Systems. IEEE Trans. Comput. 64, 12 (2015), 3461–3474. https://doi.org/10.1109/TC.2015.2401022

Digital Library

[3]

AMD. 2015. AMD APP SDK OpenCL Optimization Guide.

[4]

AMD. 2015. AMD Radeon R9 Series Gaming Graphics Cards with High- Bandwidth Memory.

[5]

AMD. 2016. Graphics Core Next Architecture, Generation 3, Reference Guide.

[6]

AMD Corp. 2016. I/O Virtualization Technology(IOMMU) Specification. http://developer.amd.com/wordpress/media/2013/12/48882_IOMMU.pdf

[7]

Nadav Amit. 2017. Optimizing the TLB Shootdown Algorithm with Page Access Tracking. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 27–39. https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit

Digital Library

[8]

R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu. 2017. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 136–150.

[9]

Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’18). ACM, New York, NY, USA, 503–518. https://doi.org/10.1145/3173162.3173169

Digital Library

[10]

Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don’T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture(ISCA ’10). ACM, New York, NY, USA, 48–59. https://doi.org/10.1145/1815961.1815970

Digital Library

[11]

T. W. Barr, A. L. Cox, and S. Rixner. 2011. SpecTLB: A mechanism for speculative address translation. In 2011 38th Annual International Symposium on Computer Architecture (ISCA). 307–317.

[12]

T. Baruah, Y. Sun, A. T. Dinçer, S. A. Mojumder, J. L. Abellán, Y. Ukidave, A. Joshi, N. Rubin, J. Kim, and D. Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 596–609. https://doi.org/10.1109/HPCA47549.2020.00055

[13]

Trinayan Baruah, Yifan Sun, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Valkyrie: Leveraging Inter-TLB Locality to Enhance GPU Performance. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques(PACT ’20). Association for Computing Machinery, New York, NY, USA, 455–466. https://doi.org/10.1145/3410463.3414639

Digital Library

[14]

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture(ISCA ’13). ACM, New York, NY, USA, 237–248. https://doi.org/10.1145/2485922.2485943

Digital Library

[15]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute: Asynchronous multi-GPU programming model with applications to large-scale graph processing. ACM Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.

Digital Library

[16]

S. Bharadwaj, G. Cox, T. Krishna, and A. Bhattacharjee. 2018. Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 271–284. https://doi.org/10.1109/MICRO.2018.00030

Digital Library

[17]

Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-46). ACM, New York, NY, USA, 383–394. https://doi.org/10.1145/2540708.2540741

Digital Library

[18]

Abhishek Bhattacharjee. 2019. Appendix L:Advanced Concepts on Address Translation. Elsevier.

[19]

A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62–63. https://doi.org/10.1109/HPCA.2011.5749717

[20]

A. Bhattacharjee, D. Lustig, and M. Martonosi. 2017. Architectural and Operating System Support for Virtual Memory. Morgan & Claypool Publishers. https://doi.org/10.2200/S00795ED1V01Y201708CAC042

[21]

A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 29–40. https://doi.org/10.1109/PACT.2009.26

Digital Library

[22]

Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-Core Cooperative TLB for Chip Multiprocessors. SIGPLAN Not. 45, 3 (March 2010), 359–370. https://doi.org/10.1145/1735971.1736060

Digital Library

[23]

Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld, and Ayellet Tal. 2004. The bloomier filter: an efficient data structure for static support lookup tables. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. Citeseer, 30–39.

[24]

E. Choukse, M. B. Sullivan, M. O’Connor, M. Erez, J. Pool, D. Nellans, and S. W. Keckler. 2020. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 926–939. https://doi.org/10.1109/ISCA45697.2020.00080

Digital Library

[25]

Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’17). ACM, New York, NY, USA, 435–448. https://doi.org/10.1145/3037697.3037704

Digital Library

[26]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units(GPGPU-3). Association for Computing Machinery, New York, NY, USA, 63–74. https://doi.org/10.1145/1735688.1735702

Digital Library

[27]

Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 92–104. https://doi.org/10.1145/2749460779.2750389

[28]

Bin Fan, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. Cuckoo Filter: Practically Better Than Bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies(CoNEXT ’14). Association for Computing Machinery, New York, NY, USA, 75–88. https://doi.org/10.1145/2674005.2674994

Digital Library

[29]

Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing Memory in Heterogeneous Systems. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’18). ACM, New York, NY, USA, 637–650. https://doi.org/10.1145/3173162.3173194

Digital Library

[30]

Timothy D.R. Hartley, Umit Catalyurek, Antonio Ruiz, Francisco Igual, Rafael Mayo, and Manuel Ujaldon. 2014. Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. In ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, New York, NY, USA, 413–423. https://doi.org/10.1145/2591635.2667189

Digital Library

[31]

Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, and Minsoo Rhu. 2020. NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1109–1124. https://doi.org/10.1145/3373376.3378494

Digital Library

[32]

Intel. 2018. The Future of Core, Intel GPUs, 10nm, and Hybrid x86. https://www.anandtech.com/show/13699/intel-architecture-day-2018-core-future-hybrid-x86/5

[33]

A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., and J. Emer. 2010. Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 151–162. https://doi.org/10.1109/MICRO.2010.52

Digital Library

[34]

Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. DUCATI: High-Performance Address Translation by Extending TLB Reach of GPU-Accelerated Systems. ACM Trans. Archit. Code Optim. 16, 1, Article 6 (March 2019), 24 pages. https://doi.org/10.1145/3309710

Digital Library

[35]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, and J. Emer. 2008. Adaptive insertion policies for managing shared caches. In 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT). 208–219.

[36]

A. Jaleel, J. Nuzman, A. Moga, S. C. Steely, and J. Emer. 2015. High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 343–353. https://doi.org/10.1109/HPCA.2015.7056045

[37]

JEDEC. 2020. High Bandwidth Memory (HBM) DRAM 2. Jesd235 (2020). https://www.jedec.org/standards-documents/docs/jesd235a

[38]

Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2015. Anatomy of GPU Memory System for Multi-Application Execution. In Proceedings of the 2015 International Symposium on Memory Systems(MEMSYS ’15). Association for Computing Machinery, New York, NY, USA, 223–234. https://doi.org/10.1145/2818950.2818979

Digital Library

[39]

Jog, Adwait. 2015. Design and Analysis of Scheduling Techniques for Throughput Processors. https://etda.libraries.psu.edu/catalog/26480

[40]

G. B. Kandiraju and A. Sivasubramaniam. 2002. Going the distance for TLB prefetching: an application-driven study. In Proceedings 29th Annual International Symposium on Computer Architecture. 195–206. https://doi.org/10.1109/ISCA.2002.1003578

[41]

Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman Ünsal. 2015. Redundant Memory Mappings for Fast Access to Large Memories. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture(ISCA ’15). ACM, New York, NY, USA, 66–78. https://doi.org/10.1145/2749469.2749471

Digital Library

[42]

O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 114–126. https://doi.org/10.1109/MICRO.2014.62

Digital Library

[43]

Mohan Kumar Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy Translation Coherence. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’18). ACM, New York, NY, USA, 651–664. https://doi.org/10.1145/3173162.3173198

Digital Library

[44]

Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation(OSDI’16). USENIX Association, Berkeley, CA, USA, 705–721. http://dl.acm.org/citation.cfm?id=3026877.3026931

Digital Library

[45]

Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Trans. Archit. Code Optim. 10, 1, Article 2 (April 2013), 38 pages. https://doi.org/10.1145/2445572.2445574

Digital Library

[46]

Artemiy Margaritov, Dmitrii Ustiugov, Edouard Bugnion, and Boris Grot. 2019. Prefetched Address Translation. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO ’52). ACM, New York, NY, USA, 1023–1036. https://doi.org/10.1145/3352460.3358294

Digital Library

[47]

Sparsh Mittal and Jeffrey S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comput. Surv. 47, 4, Article 69 (July 2015), 35 pages. https://doi.org/10.1145/2788396

Digital Library

[48]

NVIDIA. 2018. DB2 Launch Datasheet Deep Learning Letter WEB. https://www.scribd.com/document/336084072/61681-DB2-Launch-Datasheet-Deep-Learning-Letter-WEB-NVidia-Deep-Learning-Box#

[49]

NVIDIA. 2019. Memory Management on Modern GPU Architectures. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf

[50]

Lena E. Olson, Jason Power, Mark D. Hill, and David A. Wood. 2015. Border Control: Sandboxing Accelerators. In Proceedings of the 48th International Symposium on Microarchitecture(MICRO-48). Association for Computing Machinery, New York, NY, USA, 470–481. https://doi.org/10.1145/2830772.2830819

Digital Library

[51]

M. Parasar, A. Bhattacharjee, and T. Krishna. 2018. SEESAW: Using Superpages to Improve VIPT Caches. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 193–206.

[52]

C. H. Park, T. Heo, J. Jeong, and J. Huh. 2017. Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 444–456. https://doi.org/10.1145/3079856.3080217

Digital Library

[53]

Gregory F Pfister. 2001. An introduction to the infiniband architecture. High performance mass storage and parallel I/O 42, 617-632 (2001), 102.

[54]

B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 558–567. https://doi.org/10.1109/HPCA.2014.6835964

[55]

Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-45). IEEE Computer Society, USA, 258–269. https://doi.org/10.1109/MICRO.2012.32

Digital Library

[56]

B. Pham, J. Veselý, G. H. Loh, and A. Bhattacharjee. 2015. Large pages and lightweight memory management in virtualized environments: Can you have it both ways?. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1145/2830772.2830773

Digital Library

[57]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’14). ACM, New York, NY, USA, 743–758. https://doi.org/10.1145/2541940.2541942

Digital Library

[58]

J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568–578. https://doi.org/10.1109/HPCA.2014.6835965

[59]

B Pratheek, Neha Jawalkar, and Arkaprava Basu. 2021. Improving GPU Multi-tenancy with Page Walk Stealing. In 2021 IEEE 27th International Symposium on High Performance Computer Architecture (HPCA).

[60]

Jihyun Ryoo, Mengran Fan, Xulong Tang, Huaipan Jiang, Meena Arunachalam, Sharada Naveen, and Mahmut T Kandemir. 2019. Architecture-Centric Bottleneck Analysis for Deep Neural Network Applications. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 205–214.

[61]

Albert Segura, Jose-Maria Arnau, and Antonio González. 2019. SCU: A GPU Stream Compaction Unit for Graph Processing. In Proceedings of the 46th International Symposium on Computer Architecture(ISCA ’19). Association for Computing Machinery, New York, NY, USA, 424–435. https://doi.org/10.1145/3307650.3322254

Digital Library

[62]

S. Shahar, S. Bergman, and M. Silberstein. 2016. ActivePointers: A Case for Software Address Translation on GPUs. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 596–608. https://doi.org/10.1109/ISCA.2016.58

Digital Library

[63]

Xuanhua Shi, Zhigao Zheng, Yongluan Zhou, Hai Jin, Ligang He, Bo Liu, and Qiang-Sheng Hua. 2018. Graph processing on GPUs: A survey. ACM Computing Surveys (CSUR) 50, 6 (2018), 1–35.

Digital Library

[64]

S. Shin, G. Cox, M. Oskin, G. H. Loh, Y. Solihin, A. Bhattacharjee, and A. Basu. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 180–192. https://doi.org/10.1109/ISCA.2018.00025

Digital Library

[65]

S. Shin, M. LeBeane, Y. Solihin, and A. Basu. 2018. Neighborhood-Aware Address Translation for Irregular GPU Applications. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 352–363. https://doi.org/10.1109/MICRO.2018.00036

Digital Library

[66]

S. Srikantaiah and M. Kandemir. 2010. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 313–324. https://doi.org/10.1109/MICRO.2010.26

Digital Library

[67]

JEDEC Standard. 2013. High Bandwidth Memory (HBM) DRAM. Jesd235 (2013).

[68]

Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In Proceedings of the 46th International Symposium on Computer Architecture(ISCA ’19). Association for Computing Machinery, New York, NY, USA, 197–209. https://doi.org/10.1145/3307650.3322230

Digital Library

[69]

Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Villegas, and D. Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1–10. https://doi.org/10.1109/IISWC.2016.7581262

[70]

Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[71]

Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2019. Computing with Near Data. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).

Digital Library

[72]

Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[73]

Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA).

[74]

Xulong Tang, Mahmut Taylan Kandemir, Mustafa Karakoy, and Meena Arunachalam. 2019. Co-Optimizing Memory-Level Parallelism and Cache-Level Parallelism. In Proceedings of the 40th annual ACM SIGPLAN conference on Programming Language Design and Implementation.

Digital Library

[75]

Xulong Tang, Ziyu Zhang, Weizheng Xu, Mahmut Taylan Kandemir, Rami Melhem, and Jun Yang. 2020. Enhancing Address Translations in Throughput Processors via Compression. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques(PACT ’20). Association for Computing Machinery, New York, NY, USA, 191–204. https://doi.org/10.1145/3410463.3414633

Digital Library

[76]

Scott Thornton. 2021. Low cost, low latency PCIe ideal for sharing resources. Website. https://www.microcontrollertips.com/pcie-sharing-resources-faq/.

[77]

S. Thoziyoor, J. H. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi. 2008. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies. In 2008 International Symposium on Computer Architecture. 51–62. https://doi.org/10.1109/ISCA.2008.16

Digital Library

[78]

J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161–171. https://doi.org/10.1109/ISPASS.2016.7482091

[79]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53.

Digital Library

[80]

Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Benchmarking High Bandwidth Memory on FPGAs. arXiv preprint arXiv:2005.04324(2020).

[81]

Jinhui Wei, Jianzhuang Lu, Qi Yu, Chen Li, and Yunping Zhao. EasyChair, 2020. Dynamic GMMU Bypass for Address Translation in Multi-GPU Systems. EasyChair Preprint no. 4179.

[82]

Chenhao Xie, Fu Xin, Mingsong Chen, and Shuaiwen Leon Song. 2019. OO-VR: NUMA Friendly OBject-ORiented VR Rendering Framework for Future NUMA-Based Multi-GPU Systems. In Proceedings of the 46th International Symposium on Computer Architecture(ISCA ’19). Association for Computing Machinery, New York, NY, USA, 53–65. https://doi.org/10.1145/3307650.3322247

Digital Library

[83]

Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture(ISCA ’09). Association for Computing Machinery, New York, NY, USA, 174–183. https://doi.org/10.1145/1555754.1555778

Digital Library

[84]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Translation Ranger: Operating System Support for Contiguity-aware TLBs. In Proceedings of the 46th International Symposium on Computer Architecture(ISCA ’19). ACM, New York, NY, USA, 698–710. https://doi.org/10.1145/3307650.3322223

Digital Library

[85]

Z. Yan, J. Veselý, G. Cox, and A. Bhattacharjee. 2017. Hardware translation coherence for virtualized systems. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 430–443. https://doi.org/10.1145/3079856.3080211

Digital Library

[86]

Hongil Yoon, Jason Lowe-Power, and Gurindar S. Sohi. 2018. Filtering Translation Bandwidth with Virtual Caching. SIGPLAN Not. 53, 2 (March 2018), 113–127. https://doi.org/10.1145/3296957.3173195

Digital Library

Cited By

Ko GLee JKal HLee HRo W(2025)REC: Enhancing fine-grained cache coherence protocol in multi-GPU systemsJournal of Systems Architecture10.1016/j.sysarc.2025.103339160(103339)Online publication date: Mar-2025
https://doi.org/10.1016/j.sysarc.2025.103339
Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Show More Cited By

Recommendations

Leveraging Multiple GPUs and CPUs for Graphlet Counting in Large Networks
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Massively parallel architectures such as the GPU are becoming increasingly important due to the recent proliferation of data. In this paper, we propose a key class of hybrid parallel graphlet algorithms that leverages multiple CPUs and GPUs ...
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous ...
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
ASPLOS '14

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,168
Total Downloads

Downloads (Last 12 months)334
Downloads (Last 6 weeks)40

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ko GLee JKal HLee HRo W(2025)REC: Enhancing fine-grained cache coherence protocol in multi-GPU systemsJournal of Systems Architecture10.1016/j.sysarc.2025.103339160(103339)Online publication date: Mar-2025
https://doi.org/10.1016/j.sysarc.2025.103339
Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Wang YPerarnau SChien A(2024)UpDown: Combining Scalable Address Translation with Locality ControlProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00141(1014-1024)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00141
Li BWang YWang TEeckhout LYang JJaleel ATang X(2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00031
Park JKwon OLee YKim SByeon GYoon JNair PHong S(2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00029
Feng YNa SKim HJeon H(2024)Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00065(834-847)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00065
Wang YLi BJaleel AYang JTang X(2024)GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00085(1080-1094)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00085
Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605593
Hwang SLee SKim JKim HHuh J(2023)mNPUsim: Evaluating the Effect of Sharing Resources in Multi-core NPUs2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00018(167-179)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten