skip to main content
10.1145/3123939.3123975acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Mosaic: a GPU memory manager with application-transparent support for multiple page sizes

Published:14 October 2017Publication History

ABSTRACT

Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual memory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page.

In modern GPUs, we face a trade-off on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages.

We introduce Mosaic, a GPU memory manager that provides application-transparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7% on average, respectively, coming within 6.8% and 15.4% of the performance of an ideal TLB where all TLB requests are hits.

References

  1. A. Abrevaya, "Linux Transparent Huge Pages, JEMalloc and NuoDB," 2014.Google ScholarGoogle Scholar
  2. Advanced Micro Devices, Inc., "OpenCL: The Future of Accelerated Application Performance Is Now," https://www.amd.com/Documents/FirePro_OpenCL_Whitepaper.pdf.Google ScholarGoogle Scholar
  3. N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking Bandwidth for GPUs in CC-NUMA Systems," in HPCA, 2015.Google ScholarGoogle Scholar
  4. J. Ahn, S. Jin, and J. Huh, "Revisiting Hardware-Assisted Page Walks for Virtualized Systems," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Ahn, S. Jin, and J. Huh, "Fast Two-Level Address Translation for Virtualized Systems," IEEE TC, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Apple Inc., "Huge Page Support in Mac OS X," http://blog.couchbase.com/often-overlooked-linux-os-tweaks, 2014.Google ScholarGoogle Scholar
  7. ARM Holdings, "ARM Cortex-A Series," http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/DEN0024A_v8_architecture_PG.pdf, 2015.Google ScholarGoogle Scholar
  8. R. Ausavarungnirun, "Techniques for Shared Resource Management in Systems with Throughput Processors," Ph.D. dissertation, Carnegie Mellon Univ., 2017.Google ScholarGoogle Scholar
  9. R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, "Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes," Carnegie Mellon Univ., SAFARI Research Group, Tech. Rep. TR-2017--003, 2017.Google ScholarGoogle Scholar
  12. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.Google ScholarGoogle Scholar
  13. T. W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don't Walk (the Page Table)," in ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Bhattacharjee, "Large-Reach Memory Management Unit Caches," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-level TLBs for Chip Multiprocessors," in HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Bhattacharjee and M. Martonosi, "Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors," in PACT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB for Chip Multiprocessors," in ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Catanzaro, M. Garland, and K. Keutzer, "Copperhead: Compiling an Embedded Data Parallel Language," in PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM," in HPCA, 2016.Google ScholarGoogle Scholar
  22. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Cong, Z. Fang, Y. Hao, and G. Reinman, "Supporting Address Translation for Accelerator-Centric Architectures," in HPCA, 2017.Google ScholarGoogle Scholar
  24. J. Corbet, "Transparent Hugepages," https://lwn.net/Articles/359158/, 2009.Google ScholarGoogle Scholar
  25. Couchbase, Inc., "Often Overlooked Linux OS Tweaks," http://blog.couchbase.com/often-overlooked-linux-os-tweaks, 2014.Google ScholarGoogle Scholar
  26. G. Cox and A. Bhattacharjee, "Efficient Address Translation for Architectures with Multiple Page Sizes," in ASPLOS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Aérgia: Exploiting Packet Latency Slack in On-chip Networks," in ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, "Supporting Superpages in Non-Contiguous Physical Memory," in HPCA, 2015.Google ScholarGoogle Scholar
  31. S. Eyerman and L. Eeckhout, "System-Level Performance Metrics for Multiprogram Workloads," IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Eyerman and L. Eeckhout, "Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance," IEEE CAL, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Gandhi, M. D. Hill, and M. M. Swift, "Agile Paging: Exceeding the Best of Nested and Shadow Paging," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quema, "Large Pages May Be Harmful on NUMA Systems," in USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Gay and A. Aiken, "Memory Management with Explicit Regions," in PLDI, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Gorman, "Huge Pages Part 2 (Interfaces)," https://lwn.net/Articles/375096/, 2010.Google ScholarGoogle Scholar
  39. M. Gorman and P. Healy, "Supporting Superpage Allocation Without Additional Hardware Support," in ISMM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Gorman and P. Healy, "Performance Characteristics of Explicit Superpage Support," in WIOSCA, 2010.Google ScholarGoogle Scholar
  41. Intel Corp., "Introduction to Intel® Architecture," http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-introduction-basics-paper.pdf, 2014.Google ScholarGoogle Scholar
  42. Intel Corp., "Intel® 64 and IA-32 Architectures Optimization Reference Manual," https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, 2016.Google ScholarGoogle Scholar
  43. Intel Corp., "6th Generation Intel® Core™ Processor Family Datasheet, Vol. 1," http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/desktop-6th-gen-core-family-datasheet-vol-1.pdf, 2017.Google ScholarGoogle Scholar
  44. A. Jog, "Design and Analysis of Scheduling Techniques for Throughput Processors," Ph.D. dissertation, Pennsylvania State Univ., 2015.Google ScholarGoogle Scholar
  45. A. Jog, O. Kayıran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das, "Anatomy of GPU Memory System for Multi-Application Execution," in MEMSYS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Jog, O. Kayıran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Jog, O. Kayıran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Exploiting Core Criticality for Enhanced GPU Performance," in SIGMETRICS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-Driven Study," in ISCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, "Redundant Memory Mappings for Fast Access to Large Memories," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Unsal, "Energy-Efficient Address Translation," in HPCA, 2016.Google ScholarGoogle Scholar
  52. I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and C. Still, "Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application," in IPDPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. I. Karlin, J. Keasler, and R. Neely, "LULESH 2.0 Updates and Changes," Lawrence Livermore National Lab, Tech. Rep. LLNL-TR-641973, 2013.Google ScholarGoogle Scholar
  54. O. Kayıran, N. Chidambaram, A. Jog, R. Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, and C. Das, "Managing GPU Concurrency in Heterogeneous Architectures," in MICRO, 2014.Google ScholarGoogle Scholar
  55. Khronos OpenCL Working Group, "The OpenCL Specification," http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf, 2008.Google ScholarGoogle Scholar
  56. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers," in HPCA, 2010.Google ScholarGoogle Scholar
  57. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," in ISCA, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, "Coordinated and Efficient Huge Page Management with Ingens," in OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," https://developer.amd.com/wordpress/media/2012/10/hsa10.pdf, Advanced Micro Devices, Inc., 2012.Google ScholarGoogle Scholar
  61. D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. Lee, M. Samadi, and S. Mahlke, "VAST: The Illusion of a Large Memory Space for GPUs," in PACT, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," ACM TACO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Mark Mumy, "SAP IQ and Linux Hugepages/Transparent Hugepages," http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-and-linux-hugepagestransparent-hugepages, SAP SE, 2014.Google ScholarGoogle Scholar
  66. X. Mei and X. Chu, "Dissecting GPU Memory Hierarchy Through Microbench-marking," IEEE TPDS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. T. Merrifield and H. R. Taheri, "Performance Implications of Extended Page Tables on Virtualized x86 Processors," in VEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Microsoft Corp., Large-Page Support in Windows, https://msdn.microsoft.com/en-us/library/windows/desktop/aa366720(v=vs.85).aspx.Google ScholarGoogle Scholar
  69. MongoDB, Inc., "Disable Transparent Huge Pages (THP)," 2017.Google ScholarGoogle Scholar
  70. S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," in OSDI, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. K. Nguyen, L. Fang, G. Xu, B. Demsky, S. Lu, S. Alamian, and O. Mutlu, "Yak: A High-Performance Big-Data-Friendly Garbage Collector," in OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. NVIDIA Corp., "CUDA C/C++ SDK Code Samples," http://developer.nvidia.com/cuda-cc-sdk-code-samples, 2011.Google ScholarGoogle Scholar
  77. NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf, 2011.Google ScholarGoogle Scholar
  78. NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google ScholarGoogle Scholar
  79. NVIDIA Corp., "NVIDIA GeForce GTX 750 Ti," http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.Google ScholarGoogle Scholar
  80. NVIDIA Corp., "CUDA C Programming Guide," http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2015.Google ScholarGoogle Scholar
  81. NVIDIA Corp., "NVIDIA RISC-V Story," https://riscv.org/wp-content/uploads/2016/07/Tue1100_Nvidia_RISCV_Story_V2.pdf, 2016.Google ScholarGoogle Scholar
  82. NVIDIA Corp., "NVIDIA Tesla P100," https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016.Google ScholarGoogle Scholar
  83. NVIDIA Corp., "NVIDIA GeForce GTX 1080," https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf, 2017.Google ScholarGoogle Scholar
  84. M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-Based Superpage-Friendly TLB Designs," in HPCA, 2015.Google ScholarGoogle Scholar
  85. PCI-SIG, "PCI Express Base Specification Revision 3.1a," 2015.Google ScholarGoogle Scholar
  86. G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, "A Case for Toggle-aware Compression for GPU Systems," in HPCA, 2016.Google ScholarGoogle Scholar
  87. Peter Zaitsev, "Why TokuDB Hates Transparent HugePages," https://www.percona.com/blog/2014/07/23/why-tokudb-hates-transparent-hugepages/,Percona LLC, 2014.Google ScholarGoogle Scholar
  88. B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, "Increasing TLB Reach by Exploiting Clustering in Page Translations," in HPCA, 2014.Google ScholarGoogle Scholar
  89. B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. B. Pham, J. Vesely, G. Loh, and A. Bhattacharjee, "Large Pages and Lightweight Memory Management in Virtualized Systems: Can You Have It Both Ways?" in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. J. Power, M. D. Hill, and D. A. Wood, "Supporting x86--64 Address Translation for 100s of GPU Lanes," in HPCA, 2014.Google ScholarGoogle Scholar
  93. Redis Labs, "Redis Latency Problems Troubleshooting," http://redis.io/topics/latency.Google ScholarGoogle Scholar
  94. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory Access Scheduling," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. T. G. Rogers, "Locality and Scheduling in the Massively Multithreaded Era," Ph.D. dissertation, Univ. of British Columbia, 2015.Google ScholarGoogle Scholar
  96. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, "Dandelion: A Compiler and Runtime for Heterogeneous Systems," in SOSP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. SAFARI Research Group, "Mosaic - GitHub Repository," https://github.com/CMU-SAFARI/Mosaic/.Google ScholarGoogle Scholar
  99. SAFARI Research Group, "SAFARI Software Tools - GitHub Repository," https://github.com/CMU-SAFARI/.Google ScholarGoogle Scholar
  100. A. Saulsbury, F. Dahlgren, and P. Stenström, "Recency-Based TLB Preloading," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization," in ISCA, 2013.Google ScholarGoogle Scholar
  102. V. Seshadri and O. Mutlu, "Simple Operations in Memory to Reduce Data Movement," in Advances in Computers, 2017.Google ScholarGoogle Scholar
  103. T. Shanley, Pentium Pro Processor System Architecture, 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. R. L. Sites and R. T. Witek, ALPHA Architecture Reference Manual. Boston, Oxford, Melbourne: Digital Press, 1998.Google ScholarGoogle Scholar
  105. B. Smith, "Architecture and Applications of the HEP Multiprocessor Computer System," SPIE, 1981.Google ScholarGoogle Scholar
  106. B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in ICPP, 1978.Google ScholarGoogle Scholar
  107. Splunk Inc., "Transparent Huge Memory Pages and Splunk Performance," http://docs.splunk.com/Documentation/Splunk/6.1.3/ReleaseNotes/SplunkandTHP, 2013.Google ScholarGoogle Scholar
  108. S. Srikantaiah and M. Kandemir, "Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Univ. of Illinois at Urbana-Champaign, IMPACT Research Group, Tech. Rep. IMPACT-12--01, 2012.Google ScholarGoogle Scholar
  110. A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun, "Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-specific Languages," TECS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. M. Talluri and M. D. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," in ASPLOS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. J. E. Thornton, "Parallel Operation in the Control Data 6600," in AFIPS FJCC, 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. J. E. Thornton, Design of a Computer-The Control Data 6600. Scott Foresman & Co, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, "Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems," in ISPASS, 2016.Google ScholarGoogle Scholar
  115. N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A Holistic Approach to Resource Virtualization in GPUs," in MICRO, 2016.Google ScholarGoogle Scholar
  116. N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. VoltDB, Inc., "VoltDB Documentation: Configure Memory Management," https://docs.voltdb.com/AdminGuide/adminmemmgt.php.Google ScholarGoogle Scholar
  118. T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, "Towards High Performance Paged Memory for GPUs," in HPCA, 2016.Google ScholarGoogle Scholar
  119. W. K. Zuravleff and T. Robinson, "Controller for a Synchronous DRAM That Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order," US Patent No. 5,630,096, 1997.Google ScholarGoogle Scholar

Index Terms

  1. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
      October 2017
      850 pages
      ISBN:9781450349529
      DOI:10.1145/3123939

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate484of2,242submissions,22%

      Upcoming Conference

      MICRO '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader