Abstract
Hardware transactional memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occurs due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.
Similar content being viewed by others
Notes
If a function is called in two different transactions, we create one AP version for each call context. AP versions are transaction specific because the selection of the instructions for each AP depends on how the memory updates performed within the function affect its callers.
References
Ansari M, Khan B, Luján M, Kotselidis C, Kirkham C, Watson I (2010) Improving performance by reducing aborts in hardware transactional memory. In: High Performance Embedded Architectures and Compilers, pp 35–49
Ansari M, Luján M, Kotselidis C, Jarvis K, Kirkham C, Watson I (2009) Steal-on-abort: improving transactional memory performance through dynamic transaction reordering. In: Proceedings of the High Performance Embedded Architectures and Compilers, pp 4–18
ARM Ltd Transactional Memory Extension (TME) intrinsics. https://developer.arm.com/documentation/101028/0011/Transactional-Memory-Extension--TME--intrinsics
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. Comput Arch News 39(2):1–7
Dash A, Demsky B (2010) Automatically generating symbolic prefetches for distributed transactional memories. In: Middleware 2010. Lecture Notes in Computer Science, vol 6452
Dash A, Demsky B (2011) Integrating caching and prefetching mechanisms in a distributed transactional memory. IEEE Trans Parallel Distrib Syst 22(8):1284–1298
Dice D, Herlihy M, Kogan A (2018) Improving parallelism in hardware transactional memory. ACM Trans Arch Code Optim 15(1):1–24
Diegues N, Romano P (2014) Time-warp: lightweight abort minimization in transactional memory. In: Proceedings of the Symposium on Principles and Practice of Parallel Programming, pp 167–178
Diegues N, Romano P, Garbatov S (2017) Seer: probabilistic scheduling for hardware transactional memory. ACM Trans Comput Syst 35(3):1–41
Dragojevic A, Guerraoui R (2010) Predicting the scalability of an stm. In: 5th ACM SIGPLAN Workshop on Transactional Computing
Harris T, Larus J, Rajwar R (2010) Transactional Memory, 2nd edn. Morgan & Claypool Publishers Series
Jacobi C, Slegel T, Greiner D (2012) Transactional memory architecture and implementation for IBM system Z. In: Proceedings of the International Symposium on Microarchitecture, pp 25–36
Jimborean A, Koukos K, Spiliopoulos V, Black-Schaffer D, Kaxiras S (2014) Fix the code. Don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Proceedings of the International Symposium on Code Generation and Optimization, pp 262–272
Koukos K, Ekemark P, Zacharopoulos G, Spiliopoulos V, Kaxiras S, Jimborean A (2016) Daedal decoupled access-execute LLVM tools repository. https://github.com/etascale/daedal
Koukos K, Ekemark P, Zacharopoulos G, Spiliopoulos V, Kaxiras S, Jimborean A (2016) Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs. In: Proceedings of the 25th International Conference on Compiler Construction, pp 121–131
Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization, pp 75–88
Le HQ, Guthrie GL, Williams DE, Michael MM, Frey BG, Starke WJ, May C, Odaira R, Nakaike T (2015) Transactional memory support in the IBM POWER8 processor. IBM J Res Dev 59(1):8:1-8:14
Litz H, Cheriton D, Firoozshahian A, Azizi O, Stevenson JP (2014) Si-TM: reducing transactional memory abort rates through snapshot isolation. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pp 383–398
Maldonado W, Marlier P, Felber P, Suissa A, Hendler D, Fedorova A, Lawall JL, Muller G (2009) Scheduling support for transactional memory contention management. In: Proceedings of 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 79–90
Minh CC, Chung J, Kozyrakis C, Olukotun K (2009) STAMP: Stanford transactional applications for multi-processing. In: Proceedings of The IEEE International Symposium on Workload Characterization, pp 35–46
Moravan MJ, Bobba J, Moore KE, Yen L, Hill MD, Liblit B, Swift MM, Wood DA (2006) Supporting nested transactional memory in LogTM. In: Proceedings of the 12th international conference on Architectural Support for Programming Languages and Operating Systems, pp 359–370
Nakaike T, Odaira R, Gaudet M, Michael MM, Tomari H (2015) Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp 144–157
Negi A, Armejach A, Cristal A, Unsal OS, Stenstrom P (2012) Transactional prefetching: narrowing the window of contention in hardware transactional memory. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp 181–190
Negi A, Walliullah M, Stenstrom P (2010) Lv*: a low complexity lazy versioning htm infrastructure. In: Proceedings of the 25th International Conference on Embeded Computer Systems: Architectures, Modeling, and Simulation, pp 231–240
Nguyen D, Pingali K (2017) What scalable programs need from transactional memory. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp 105–118
Ritson C, Barnes F (2013) An evaluation of intel’s restricted transactional memory for cpas. Commun Process Arch 2013:271–292
Sui Y, Xue J (2016) SVF: interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, pp 265–266
Sui Y, Ye D, Xue J (2014) Detecting memory leaks statically with full-sparse value-flow analysis. IEEE Trans Softw Eng 40(2):107–122
Titos-Gil R, Fernández-Pascual R, Ros A, Acacio ME (2020) PfTouch: Concurrent page-fault handling for Intel restricted transactional memory. J Parallel Distrib Comput 145:111–123
Tran KA, Carlson TE, Koukos K, Själander M, Spiliopoulos V, Kaxiras S, Jimborean A (2017) Clairvoyance: look-ahead compile-time scheduling. In: Proceedings of the 2017 International Symposium on Code Generation and Optimization, pp 171–184
Wang Q, Su P, Chabbi M, Liu X (2019) Lightweight hardware transactional memory profiling. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp 186–200
Weiser M (1981) Program slicing. In: Proceedings of the 5th International Conference on Software Engineering, pp 439–449
Weiser M (1984) Program slicing. IEEE Trans Softw Eng 10:352–357
Xiang L, Scott ML (2015) Conflict reduction in hardware transactions using advisory locks. In: Proceedings of the Symposium on Parallelism in Algorithms and Architectures, pp 234–243
Yoo R, Hughes C, Lai K, Rajwar R (2013) Performance evaluation of Intel transactional synchronization extensions for high performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–11
Funding
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 819134), the Spanish MCIU and AEI, as well as the European Commission FEDER funds, under grant RTI2018-098156-B-C53, and the Swedish VR grant number 2016-05086.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shimchenko, M., Titos-Gil, R., Fernández-Pascual, R. et al. Analysing software prefetching opportunities in hardware transactional memory. J Supercomput 78, 919–944 (2022). https://doi.org/10.1007/s11227-021-03897-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03897-z