Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems

doi:10.1016/j.jpdc.2010.02.003

Journal of Parallel and Distributed Computing

Volume 70, Issue 10, October 2010, Pages 1042-1052

https://doi.org/10.1016/j.jpdc.2010.02.003 Get rights and content

Abstract

We investigate how transactional memory can be adapted for embedded systems. We consider energy consumption and complexity to be driving concerns in the design of these systems and therefore adapt simple hardware transactional memory (HTM) schemes in our architectural design. We propose several different cache structures and contention management schemes to support HTM and evaluate them in terms of energy, performance, and complexity. We find that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms. We conclude that with the right balance of energy efficiency and simplicity, HTM will become an attractive choice for future embedded system designs.

Introduction

High-end embedded systems are increasingly coming to resemble their general-purpose counterparts. Embedded systems such as smart phones, game consoles, “net-tops”, GPS-enabled automotive systems, and home entertainment centers are becoming ubiquitous. In the same way that smart phones are gradually usurping many of the functions of laptops, specialized high-end embedded systems will eventually displace many general-purpose systems. Eventually, such devices will affect every aspect of modern life, and their energy consumption profiles will have a broad economic impact.

Like their general-purpose counterparts, and for many of the same energy-related reasons, embedded systems are turning to multicore architectures. This switch has profound implications for software, which must now manage concurrent activities. It is well established that traditional synchronization mechanisms such as locks have substantial drawbacks. Transactional memory [21] has emerged as a promising alternative.

Here, we investigate how transactional memory can be adapted for embedded systems. We then describe different designs for our Embedded-TM architecture. Transactional memory for embedded systems makes different demands than transactional memory for general-purpose systems. The principal difference is the central importance of energy consumption: although embedded systems are becoming more sophisticated, they are and will continue to be energy-constrained, either because they run on batteries, or simply because energy consumption is increasingly a concern for systems at all levels. We are the first to use energy consumption as a guide for designing transactional memory mechanisms. While most of the earlier work on transactional memory has neglected the question of energy consumption, we claim that it should be one of the driving concerns in transactional memory design. Taking energy into account requires revisiting and revising many widely accepted assumptions, as well as widely accepted architectures.

Because embedded systems are energy-constrained, there is an overriding need for simplicity: many techniques suited for general-purpose systems, such as out-of-order instruction or hardware multithreading, are too complex and power-hungry for today’s embedded systems. Any realistic transactional memory design for embedded systems must make do by combining simple components. We are willing to propose minor changes to existing standards, but not (what we consider) radical changes.

The need for energy efficiency and simplicity makes software transactional memory (STM) unattractive. (Klein et al. [23] provide an analysis of the energy costs of a typical STM system.) For most embedded applications, it is unacceptable, both in terms of performance and energy consumption, to place a software “barrier” at each memory access. Indeed, embedded applications often run without an operating system. By contrast, we will see that a simple hardware transactional memory (HTM) can both enhance performance and conserve energy.

While hardware transactional memory makes fewer resource demands than software transactional memory, limitations on cache size and associativity bound transactions’ sizes and durations. While proposals exist for “unbounded” transactional memory [2], [33] that allow transactions to survive certain kinds of resource exhaustion, these schemes are much too complex to be considered for embedded systems. For most embedded systems, however, applications’ resource requirements are well understood, and transactions that exceed those expectations are likely to be rare. Nevertheless, it is important to understand how to structure caches for HTM in embedded systems to maximize transaction sizes without compromising performance or increasing energy consumption. We will describe several such designs.

We evaluate HTM designs using three criteria: energy, performance, and complexity. Sometimes these criteria reinforce one another, and sometimes not. Here, we investigate a sequence of HTM designs, starting from a simple baseline, and moving on to a sequence of redesigns, each intended to address a specific problem limiting energy efficiency and performance. Structuring the presentation of Embedded-TM as a sequence of redesigns makes it possible to quantify the contribution of each incremental improvement.

We take as the baseline Embedded-TM an HTM based on a simple cache architecture [13], [21] in which non-transactional data is stored in a large L1 cache, and a smaller, fully associative transactional cache stores the data accessed within a transaction. The principal drawback of this architecture is that the transactional cache consumes too much energy. As a first line of defense, we consider how to conserve energy by powering down the cache without adversely effecting performance.

Another drawback of the baseline Embedded-TM architecture is the limited size of the transactional cache. Any transaction whose data set cannot fit in that cache cannot complete, and must continue in a less-efficient serial mode described below. To alleviate this problem, we consider an alternative design in which both transactional and non-transactional data are kept together in the L1 cache. The L1 cache is substantially larger than the transactional cache, and eliminates the need to maintain coherence across two same-level caches.

While this design supports larger transactions, it is still limited by resource constraints. To keep energy consumption down, the L1 must have limited associativity, so a transaction unlucky enough to overflow a cache line must run in serial mode. We address this problem by introducing a small victim cache to catch transactional entries evicted from the main cache [15]. Although we are back to a two-cache architecture, the victim cache is needed only when the main cache overflows, so it can be small, and powered down for longer durations.

As often happens, alleviating one problem exposes another. Transactions can also be prevented from making progress by data conflicts that occur when two transactions access the same memory location, and at least one access is a write. The first approach we consider, called eager conflict resolution, works well when transactions have few data conflicts, but less well when transactions have many data conflicts.

We examine two approaches to the problem of high-conflict transactions: a brute-force approach where a transaction that fails to make progress is eventually restarted in serial mode, and a more complicated approach where conflicts are resolved in a “lazy” manner. Lazy conflict resolution postpones the decision on aborting transactions to a later time, when more data on detected conflicts is available, thus potentially increasing concurrency.

Each approach is successful in some circumstances. The brute-force approach is attractive for its simplicity and wide range of effectiveness, and it works moderately well most of the time. The lazy mode algorithm incurs a higher overhead cost, sometimes penalizing low-conflict transactions, but is effective for high-conflict transactions.

We use a cycle-accurate simulator to investigate how well each of these Embedded-TM designs works on a range of benchmarks, as well as how simple TM designs compare to locking. Confirming prior observations [13], we find that even simple TM designs outperform locking with respect to both energy and performance. Each of the successive designs we consider improves the energy-performance product of most benchmarks. This improvement is workload-dependent in the sense that it is possible to find some configuration of some benchmark where any particular design does not improve on its predecessor, but overall each successive redesign is an improvement.

These results confirm that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms. As architectures progress, and the demands of embedded platforms evolve, the further design and evaluation of energy-efficient cache architectures for HTM remains a promising direction for further work.

Section snippets

Architecture

We developed and tested our Embedded-TM designs using the MPARM simulation framework [3], [26], a cycle-accurate, multi-processor simulator written in SystemC. MPARM models any simple instruction set architecture with a complex memory hierarchy (supporting, for example, caches, scratch pad memories, and multiple types of interconnects). MPARM also includes cycle-accurate power models for many of its simulated devices. The power models reflect a 0.13 μm technology provided by STMicroelectronics

Experimental results

In this section we evaluate our proposed Embedded-TM design using a mix of applications. We first describe the benchmarks used in our experiments as well as our experimental setup, followed by a detailed discussion of our results.

Related work

There are many mechanisms for synchronizing access to shared memory. Today, the two most prominent are locks and transactions. While most of the literature evaluates these proposals with respect to performance and ease of use, we focus here on a third criterion important for embedded devices: energy efficiency.

Prior work includes techniques for increasing the efficiency of lock-based synchronization for real-time embedded systems. Tumeo et al. [37] proposed new techniques for efficient

Conclusions

Like general-purpose systems, today’s embedded systems are adopting multicore architectures. In the medium term, advances in technology will provide increased parallelism, but not increased single-thread performance. System designers and software engineers can no longer rely on increasing clock speed to enable ever more ambitious applications. Instead, they must learn to make effective use of increasing parallelism. Transactional memory is an attractive way to structure concurrent programs.

Cesare Ferri is an Electrical Engineering Ph.D. student at Brown University. His research interests concern the exploration of multiprocessing techniques for low-power embedded systems, and the development of design methods to improve the yield of 3D integrated circuits. He received is B.S. degree from the University of Bologna, Italy, in 2005.

References (40)

AMBA, ARM Ltd. The advanced microcontroller bus architecture (AMBA) homepage....
C.S. Ananian, K. Asanovic, B.C. Kuszmaul, C.E. Leiserson, S. Lie, Unbounded transactional memory, in: International...
F. Angiolini, J. Ceng, R. Leupers, F. Ferrari, C. Ferri, L. Benini, An integrated open framework for heterogeneous...
R.I. Bahar, G. Albera, S. Manne, Power and performance tradeoffs using various caching strategies, in: International...
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel, Scratchpad memory: Design alternative for cache...
C. Blundell, J. Devietti, E.C. Lewis, M. Martin, Making the fast case common and the uncommon case simple in unbounded...
J. Bobba, K.E. Moore, H. Volos, L. Yen, M.D. Hill, M.M. Swift, D.A. Wood, Performance pathologies in hardware...
L. Ceze, J. Tuck, C. Cascaval, J. Torrellas, Bulk disambiguation of speculative threads in multiprocessors, in:...
H. Cho, B. Ravindran, E.D. Jensen, Lock-free synchronization for dynamic embedded real-time systems, in: Design...
STMicroelectronics-Cortex, STMicroelectronics Cortex-M3 CPU....

P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, D. Nussbaum, Hybrid transactional memory, in: International...

A. Efthymiou, J.D. Garside, An adaptive serial-parallel cam architecture for low-power cache blocks, in: International...

C. Ferri, R.I. Bahar, T. Moreshet, A. Viescas, M. Herlihy, Energy efficient synchronization techniques for embedded...

C. Ferri et al.

A hardware/software framework for supporting transactional memory in a MPSoC environment

ACM SIGARCH Computer Architecture News

(2007)

C. Ferri, S. Wood, T. Moreshet, R.I. Bahar, M. Herlihy, Energy and throughput efficient transactional memory for...

Freescale-QE, Freescale low-power QE family processor....

J. Goodacre et al.

Parallelism and the ARM instruction set architecture

Computer

(2005)

M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B. Brown, Mibench: A free, commercially...

L. Hammond et al.

Programming with transactional coherence and consistency (TCC)

ACM SIGOPS Operating Systems Review

(2004)

M. Herlihy, E. Koskinen, Transactional boosting: A methodology for highly-concurrent transactional objects, in:...

Cited by (29)

Hardware transactional memory architecture with adaptive version management for multi-processor FPGA platforms
2017, Journal of Systems Architecture
Citation Excerpt :
Moreover, memory architecture varies depending on the targeted application and platform resources. Based on existing works in literature [3–11,14,17–22], FPGA implementation of HTM can be classified into two main architectures: shared memory (Fig. 1a) and distributed memory (Fig. 1b), where both have a single shared memory partition. The distributed memory acts as a transactional memory in the individual cache on each processor, whereas for shared memory, transactions are kept in the same memory.
Multiprocessor embedded systems integrates diverse dedicated processing units to handle high performance applications such as in multimedia and network processing. However, lock-based synchronization limits the efficiency of such heterogeneous concurrent systems. Hardware Transactional Memory (HTM) is a promising approach in creating an abstraction layer for multi-threaded programming. However, HTM performance is application-specific and determined by version and conflict management configurations. Most previous HTM implementations for embedded system in literature were built on fixed version management that result in significant performance loss when transaction behaviour changes. In this paper, we propose a HTM targeted for embedded applications which is able to adapt its version management based on application behaviour at runtime. It is prototyped and analysed on Altera Cyclone IV platform. Random requests at different contention levels and different transaction sizes are used to verify the performance of the proposed HTM. Based on our experiments, lazy version management is able to obtain up to 12.82% speed-up compared to eager version management at high contention level. Meanwhile, eager version management obtains up to 37.84% speed-up compared to lazy version management at low contention. The adaptive mechanism is able to switch configuration at runtime based on applications behaviour for maximum performance.
Transactional memories for multi-processor FPGA platforms
2011, Journal of Systems Architecture
Citation Excerpt :
TokenTM [13] proposes a similar approach to support efficient execution of large transactions. Ferri et al. [14] presented a transactional memory for embedded systems. Although, this scheme is smaller than the traditional high performance transactional memories, it is still based on cache coherence protocols which is usually absent in multi-processor FPGA platforms.
Programming efficiency of heterogeneous concurrent systems is limited by the use of lock-based synchronization mechanisms. Transactional memories can greatly improve the programming efficiency of such systems. In field-programmable computing machines, a conventional fixed transactional memory becomes inefficient use of the silicon. We propose configurable transactional memory (CTM) as a mechanism to implement application specific synchronization that utilizes the field-programmability of such devices to match with the requirements of an application. The proposed configurable transactional memory is targeted at embedded applications and is area efficient compared to conventional schemes that are implemented with cache-coherent protocols. In particular, the CTM is designed to be incorporated in to compilation and synthesis paths of either high-level languages or during system creation process using tools such as Xilinx EDK. The proposed system supports an OpenMP-based programming paradigm for the efficient use of transactional memories. In addition, the conflict detection scheme can be configured to work either in lazy or in eager mode, depending on the application requirements. We study the impact of deploying a CTM using both micro-benchmarks and real applications as compared to a lock-based synchronization scheme. We have implemented the proposed scheme in a Xilinx Virtex4 device and found that the CTM can provide both higher programming efficiency, lower energy consumption and higher speedup than a fine-grained lock-based scheme.
PIM-STM: Software Transactional Memory for Processing-In-Memory Systems
2024, arXiv
Investigating transactional memory for high performance embedded systems
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures
2018, International Journal of Parallel Programming
Energy-aware scheduling in transactional memory systems
2016, Proceedings - SBCCI 2016: 29th Symposium on Integrated Circuits and Systems Design: Chip on the Mountains

View all citing articles on Scopus

Samantha Wood is currently pursuing her A.B. in Computer Science from Bryn Mawr College in Pennsylvania. She was granted a Distributed Research Experience for Undergraduates (DREU) award by the Computing Research Association for the summer of 2009. During that time, she worked at Brown University, where she was mentored by Prof. Bahar.

Tali Moreshet is an assistant professor at the Department of Engineering at Swarthmore College. Her research interests are in computer architecture, energy-efficient multiprocessor, many-core, and embedded systems. Her research is funded by NSF. Tali Moreshet earned a B.Sc. in Computer Science from the Technion, Israel Institute of Technology and a M.Sc. (2003) and Ph.D. (2006) in Computer Engineering from Brown University.

R. Iris Bahar received the B.S. and M.S. degrees in computer engineering from the University of Illinois, Urbana-Champaign, in 1986 and 1987, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Colorado, Boulder, in 1995. From 1987 to 1992, she was with Digital Equipment Corporation. Since 1996, she has been with the Division of Engineering, Brown University, Providence, RI, where she is currently an Associate Professor. Her research interests include computer architecture; computer-aided design for synthesis, verification, and low-power applications; and design, test, and reliability issues for nanoscale systems.

Maurice Herlihy received an A.B. degree in Mathematics from Harvard University and a Ph.D. degree in Computer Science from MIT. He has been an Assistant Professor in the Computer Science Department at Carnegie Mellon University, a member of the research staff at Digital Equipment Corporation’s Cambridge (MA) Research Lab, and a consultant for Sun Microsystems. He is now a Professor of Computer Science at Brown University. His 1991 paper “Wait-Free Synchronization” won the 2003 Dijkstra Prize in Distributed Computing, and he shared the 2004 Goedel Prize for his 1999 paper “The Topological Structure of Asynchronous Computation”. He is a Fellow of the ACM.

^☆: This work is supported in part by NSF grants CCF-0903295, CCF-0903384, and CCF-0811289 as well as SRC grant 2009-HJ-1983.

View full text

Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems☆