Abstract
With the growing size of real-world datasets running on CPUs, address translation has become a significant performance bottleneck. To translate virtual addresses into physical addresses, modern operating systems perform several levels of page table walks (PTWs) in memory. Translation look-aside buffers (TLBs) are used as caches to keep recently used translation information. However, as datasets increase in size, both the TLB miss rate and the overhead of PTWs worsen, causing severe performance bottlenecks. Using a diverse set of workloads, we show the PTW overhead consumes an average of 20% application execution time.
In this paper, we propose CoPTA, a technique to speculate the memory address translation upon a TLB miss to hide the PTW latency. Specifically, we show that the operating system has a tendency to map contiguous virtual memory pages to contiguous physical pages. Using a real machine, we show that the Linux kernel can automatically defragment physical memory and create larger chunks for contiguous mapping, particularly when transparent huge page support is enabled. Based on this observation, we devise a speculation mechanism that finds nearby entries present in the TLB upon a miss and predicts the address translation of the missed address assuming contiguous address allocation. This allows CoPTAto speculatively execute instructions without waiting for the PTW to complete. We run the PTW in parallel, compare the speculated and the translated physical addresses, and flush the pipeline upon a wrong speculation with similar techniques used for handling branch mispredictions.
We comprehensively evaluate our proposal using benchmarks from three suites: SPEC CPU 2006 for server-grade applications, GraphBIG for graph applications, and the NAS benchmark suite for scientific applications. Using a trace-based simulation, we show an average address prediction accuracy of 82% across these workloads resulting in a 16% performance improvement.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dynamic instrumentation tool platform. https://dynamorio.org/
Linux kernel documentation. https://www.kernel.org/doc/
Advanced micro devices. AMD x86–64 architecture programmer’s manual (2002)
Bailey, D.H., et al.: The NAS parallel benchmarks summary and preliminary results. In: Supercomputing 1991: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. IEEE (1991)
Barr, T.W., Cox, A.L., Rixner, S.: Translation caching: skip, don’t walk (the page table). In: ACM SIGARCH Computer Architecture News, vol. 38, pp. 48–59. ACM (2010)
Barr, T.W., Cox, A.L., Rixner, S.: SpecTLB: a mechanism for speculative address translation. In: ACM SIGARCH Computer Architecture News, vol. 39, pp. 307–318. ACM (2011)
Basu, A., Gandhi, J., Chang, J., Hill, M.D., Swift, M.M.: Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA 2013, pp. 237–248. ACM, New York (2013). https://doi.org/10.1145/2485922.2485943. http://doi.acm.org/10.1145/2485922.2485943
Bhattacharjee, A., Lustig, D., Martonosi, M.: Shared last-level TLBS for chip multiprocessors. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pp. 62–63, February 2011. https://doi.org/10.1109/HPCA.2011.5749717
Bhattacharjee, A., Lustig, D.: Architectural and Operating System Support for Virtual Memory. Synthesis Lectures on Computer Architecture 12(5), pp. 1–175 (2017)
Bhattacharjee, A., Martonosi, M.: Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 29–40. IEEE (2009)
Bhattacharjee, A., Martonosi, M.: Inter-core cooperative TLB for chip multiprocessors. SIGARCH Comput. Archit. News 38(1), 359–370 (2010). https://doi.org/10.1145/1735970.1736060
Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011). https://doi.org/10.1145/2024716.2024718. http://doi.acm.org/10.1145/2024716.2024718
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 37–48 (2011)
Bruening, D.L.: Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. thesis, Cambridge, MA, USA (2004). aAI0807735
Chen, J.B., Borg, A., Jouppi, N.P.: A simulation based study of TLB performance. In: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA 1992, pp. 114–123. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/139669.139708
Cox, G., Bhattacharjee, A.: Efficient address translation for architectures with multiple page sizes. ACM SIGOPS Oper. Syst. Rev. 51(2), 435–448 (2017)
Gandhi, J., Basu, A., Hill, M.D., Swift, M.M.: Efficient memory virtualization: reducing dimensionality of nested page walks. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 178–189, December 2014. https://doi.org/10.1109/MICRO.2014.37
Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006)
Kocher, P., et al.: Spectre attacks: exploiting speculative execution. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1–19 (2018)
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data
Lipp, M., et al.: Meltdown: reading kernel memory from user space. In: 27th USENIX Security Symposium (USENIX Security 2018), pp. 973–990. USENIX Association, Baltimore, August 2018. https://www.usenix.org/conference/usenixsecurity18/presentation/lipp
Luszczek, P.R., et al.: The HPC challenge (HPCC) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, vol. 213, pp. 1188455–1188677. Citeseer (2006)
Margaritov, A., Ustiugov, D., Bugnion, E., Grot, B.: Prefetched address translation. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1023–1036. ACM (2019)
McCurdy, C., Cox, A., Vetter, J.: Investigating the TLB behavior of high-end scientific applications on commodity microprocessors, pp. 95–104, May 2008. https://doi.org/10.1109/ISPASS.2008.4510742
Mittal, S.: A survey of techniques for architecting TLBs. Concurr. Comput. Pract. Experience 29(10), e4061 (2017)
Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.: GraphBIG: understanding graph computing in the context of industrial solutions. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2015. https://doi.org/10.1145/2807591.2807626
Navarro, J., Iyer, S., Druschel, P., Cox, A.: Practical, transparent operating system support for superpages. SIGOPS Oper. Syst. Rev. 36(SI), 89–104 (2003). https://doi.org/10.1145/844128.844138
Park, C.H., Heo, T., Jeong, J., Huh, J.: Hybrid TLB coalescing: improving TLB translation coverage under diverse fragmented memory allocations. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 444–456 (2017)
Pham, B., Vaidyanathan, V., Jaleel, A., Bhattacharjee, A.: CoLT: coalesced large-reach TLBs. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258–269. IEEE Computer Society (2012)
Pham, B., Veselỳ, J., Loh, G.H., Bhattacharjee, A.: Large pages and lightweight memory management in virtualized environments: can you have it both ways? In: Proceedings of the 48th International Symposium on Microarchitecture, pp. 1–12 (2015)
Ryoo, J.H., Gulur, N., Song, S., John, L.K.: Rethinking TLB designs in virtualized environments: a very large part-of-memory TLB. ACM SIGARCH Comput. Archit. News 45(2), 469–480 (2017)
Saulsbury, A., Dahlgren, F., Stenström, P.: Recency-based TLB preloading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 117–127 (2000)
Talluri, M., Hill, M.D.: Surpassing the TLB performance of superpages with less operating system support. SIGOPS Oper. Syst. Rev. 28(5), 171–182 (1994). https://doi.org/10.1145/381792.195531
Fang, Z., Zhang, L., Carter, J.B., Hsieh, W.C., McKee, S.A.: Reevaluating online superpage promotion with hardware support. In: Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pp. 63–72, January 2001. https://doi.org/10.1109/HPCA.2001.903252
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Y. et al. (2020). CoPTA: Contiguous Pattern Speculating TLB Architecture. In: Orailoglu, A., Jung, M., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2020. Lecture Notes in Computer Science(), vol 12471. Springer, Cham. https://doi.org/10.1007/978-3-030-60939-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-60939-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60938-2
Online ISBN: 978-3-030-60939-9
eBook Packages: Computer ScienceComputer Science (R0)