Dealing with Layers of Obfuscation in Pseudo-Uniform Memory Architectures

Rotta, Randolf; Kuban, Robert; Schöps, Mark Simon; Nolte, Jörg

doi:10.1007/978-3-319-58943-5_55

Randolf Rotta²⁶,
Robert Kuban²⁶,
Mark Simon Schöps²⁶ &
…
Jörg Nolte²⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10104))

Included in the following conference series:

European Conference on Parallel Processing

Abstract

Pseudo-Uniform Memory Architectures hide the memory’s throughput bottlenecks and the network’s latency differences in order to provide near-peak average throughput for computations on large datasets. This obviates the need for application-level partitioning and load balancing between NUMA domains but the performance of cross-core communication still depends on the actual placement of the involved variables and cores, which can result in significant variation within applications and between application runs.

This paper analyses the pseudo-uniform memory latency on the Intel Xeon Phi Knights Corner processor, derives strategies for the optimised placement of important variables, and discusses the role of localised coordination in pUMA systems. For example, a basic cache line ping-pong benchmark showed a 3x speedup between adjacent cores. Therefore, pUMA systems combined with support for controlled placement of small datasets are an interesting option when processor-wide load balancing is difficult while localised coordination is feasible.

You have full access to this open access chapter, Download conference paper PDF

Keep the PokerFace on! Thwarting cache side channel attacks by memory bus monitoring and cache obfuscation

Article Open access 15 December 2017

NUMAPROF, A NUMA Memory Profiler

Securing Data Analytics on SGX with Randomization

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Large-scale multi- and many-core processors have to compromise between the scalability of the memory architecture, its space and power consumption, and the usability for application developers. Efficient memory interconnects are usually inherently non-uniform and their latency varies with the distance between core and memory while the peak throughput diminishes with growing distance. Therefore, tasks and their data should be placed close together in order to reduce latency and increase throughput but, at the same time, should be distributed in order to increase parallelism and balance the load over multiple bottlenecks [1].

Coherent caching layers further complicate the situation. Directory-based coherency protocols [2] as well as distributed shared caches [3, 4] employ global directory components that route requests to recent copies and coordinate global invalidation and updating. In order to resolve throughput bottlenecks at these components, multiple of them are distributed across the network and the request load should be distributed uniformly across them.

Non-Uniform Memory Architectures group memory channels, directories, and compute cores such that an almost uniform low latency and high throughput is provided within each group—also known as NUMA domain or node. In order to utilise the system’s peak throughput, it is the application’s responsibility to balance data and compute tasks across these domains. This requires basically the same strategies as in distributed systems, for example a domain decomposition with bin-packing for load balancing. As a positive side effect, this also results in a more localised coordination, which enables synchronisation with low latency and low congestion. However, while successful on medium-sized NUMA systems, the effort and load balancing challenge increases with the ratio between a growing number of domains and the size of the shared memory.

A more convenient alternative are pseudo-Uniform Memory Architectures that use hardware-based address interleaving, for example with cache line granularity, in order to uniformly distribute the load over many memory channels and coherence directories. Provided that the network can cope with the aggregated peak throughput, applications do not need to worry neither about throughput bottlenecks nor the co-located placement of data and tasks.

Unfortunately, this is true only for throughput-bound computations on large-enough datasets: any synchronisation between cores is still dominated by the cache coherence latency, which depends on the distance between the involved cores and coherence directory. While the hardware’s interleaving has no mitigating effect on the usually small synchronisation variables, their seemingly random spatial placement leads to difficult-to-predict overheads and performance variations. For synchronisation, the convenient pseudo-uniformity becomes a layer of obfuscation [5]. A few badly positioned synchronisation variables can slow down the whole application. Analysing such performance bottlenecks is further impaired by placement-dependent variation between repeated runs of the same application outside of the developer’s control.

This paper studies the pseudo-uniform architecture of the Intel Xeon Phi Knights Corner (KNC) many-core processor [6] and derives strategies for the optimised placement of synchronisation variables and similar latency-bound objects. The KNC provides 59–61 cores with four hardware threads each, four memory controllers, and 64 cache coherence directories—all connected via a shared point-to-point ring network. Compared to previous Xeon processors, the path between a core, the responsible directory, and the destination cache or memory controller can be very long, which results in considerable placement-dependent latency variation.

To this end, we reconstruct a mapping from cache line addresses to neighbouring cores based on latency measurements and use this mapping to initialise a pUMA-aware cache line allocator. For a basic cache line ping-pong pattern, this pUMA-aware placement enabled a 3x speedup between adjacent cores.

The next section reviews related work with respect to memory architectures and locality awareness. Section 3 devises generic experiments to study the effects of interleaving across directories and memory channels. Then, Sect. 4 discusses the experiment results obtained on the Intel Xeon Phi KNC. Finally, Sect. 5 discusses the broader implications on placement and coordination on the KNC and similar pUMA architectures.

2 Related Work

This section surveys performance studies related to the Intel XeonPhi KNC processor and uniform memory architectures. The last part reviews coordination strategies from NUMA systems with relevance to uniform memory architectures.

Studies Related to the Intel Xeon Phi KNC. The Larrabee architecture for visual computing [7] is the ancestor of the Knights Corner processors. The article proposes a many-core architecture based on simple x86 cores with SIMD short vector units, private L2 caches, and an on-chip ring network for cache coherency. In order to keep the ring latency small compared to the latency of the DRAM memory channels, multiple “short linked” rings are proposed without discussing implications for the cache coherency. The authors point out that synchronisation between threads within a core is fast because of the shared L1 cache and cross-core synchronisation is inherently much slower. Hence, computations that access the same data should be placed onto the same core.

Based on the available technical documentation and micro-benchmarks on the KNC 5110P, Ramos and Hoefler [8] provide a detailed overview of the KNC’s directory-based cache coherence and present a quantitative performance model for cross-core communication. Likewise, Fang et al. [9, 10] published extensive studies of the KNC. Both groups consider the average latency over a large number of cache lines and report similar results. Reading from any other cache takes 243 cycles in average and reading from the memory takes 318–346 cycles in average. The latency of reading a single cache line from another core’s cache is examined in [10]. There, a latency variation from 160–340 cycles depending on the partner core is visible. The authors note, that the latency does not relate to the distance of both cores because of the distributed coherence directories.

Gerofi et al. [5] studied the “hidden non-uniformity” of the KNC processor with respect to reading from main memory. They show a 60% variation in latency when reading different cache lines from the main memory and propose a respective memory allocator that reduces this cache miss latency. The authors argue, that such placement could speed up algorithms that exhibit difficult to predict access patterns, for example, because of recursive data structures like linked lists, trees, and graphs. Their evaluation demonstrates a 17–28% throughput improvement for an A* shortest path algorithm with optimised allocation of the graph nodes. In contrast to [5], the present paper focuses on cross-core communication, that is the latency of accessing another core’s cache.

Other UMA and pUMA Systems. The IBM Cyclops processor [11] has 16 embedded memory banks and the contiguous address space can be interleaved over caches and memory banks in order to balance the congestion. A crossbar switch is used to provide uniform latency between all cores, caches, and memory banks. Similarly, the Oracle Sparc T5 processors [12] use a crossbar for uniform latency.

Multi-socket Intel Xeon processors are operated as NUMA systems usually with one (pseudo-)uniform domain per socket. However, the address interleaving is configurable and can span multiple sockets [13] by combining bits of the physical address into a 3 bit target index. The “low-order” interleave uses bits 6–8 as target, which distributes consecutive lines over adjacent targets, and the “low/mid-hash” interleave uses bits 6–8 exclusive-or bits 16–18. In addition, the “hemisphere” variant replaces the first target bit with an exclusive-or of the bits 6, 10, 13, and 19 in order to better distribute accesses with a fixed stride.

To a limited degree, interleaving can be implemented by software. The processor’s virtual address spaces can be used for interleaving on page granularity [1, 14] and applications can distribute the placement of their data structures [15].

The Tilera Tile processors use a distributed shared L2 cache with a local cache at each core [4]. Requests are routed to the line’s home cache, which is configured on page-size granularity. While interleaving over multiple L2 components is possible, synchronisation variables can simply be allocated in dedicated pages with known placement.

Coordination in NUMA Systems. Alongside the ratio of parallel to sequential computations, the scalability depends considerably on the overhead associated with distributing tasks across threads and synchronising the actions of concurrently active tasks. This overhead depends on the communication latency and the congestion on memory channels and network [1] and, thus, also on the contention as number of threads competing for a shared resource [16].

Some NUMA strategies reduce the latency by moving shared variables closer to their threads. One example is frequent polling on locally cached flags and rare signalling to remote flags as done by queue locks [17, 18] and work stealing [19]. Tightly related are strategies that reduce contention by distributing the load over multiple peers. Examples are the replication of services [20] and hierarchically distributed locks [21]. Software Combining generalises both aspects by combining multiple local accesses into fewer remote messages [16, 17, 22].

Finally, some strategies reduce the data migration between NUMA domains, for example, by keeping related tasks in the same domain as in hierarchical work stealing [19], preferring threads of the same domain as in cohort locking [23], or moving tasks to specific domains as in delegation locks [24].

3 Measuring Latency: Reading from Caches vs. Memory

Latency-bound phases can be accelerated by reducing the stall time when reading from main memory with unpredictable access patterns (like [5]) and by reducing the latency when synchronising nearby threads via shared variables. Both aspects cannot be mitigated by hardware or software prefetching. The objective therefore is to reduce the latency by placing the data into cache lines that are locally managed and stored. Unfortunately, the pUMA address interleaving, while balancing the congestion for improved throughput, obfuscates the actual placement. In lack of documentation about the interleaving, latency measurements can uncover sufficient information for a pUMA-aware allocator, for example by assigning lines to the cores with lowest latency. This section devises latency measurements that provide such information.

Assuming a processor with cache coherence based on a shared distributed directory and private caches per core, the latency depends on the distances between the client core (C), the responsible directory component (D), and the remote cache that currently owns the line (O) or respectively the responsible memory channel (M). The responsible directory and memory channel are selected by the hardware’s interleaving scheme. The directory tracks the sharing state of previously accessed lines and routes read requests accordingly to the current owner cache or to a memory channel. Similar to Ramos and Hoefler [8] two cases can be distinguished as illustrated in Fig. 1: Cache Read is routed to the current owner cache (O) and Memory Read is routed to the off-chip memory (M) because the line is invalid (not present) in all caches.

The read latency for both cases is $d_{C,D} + d_{D,O/M} + d_{O/M,C} + o$, where $d_{x,y}$ is the latency introduced by the network between x and y, and o is the processing overhead for cache, directory, and memory lookups. The network latency grows with the distance and the link congestion, whereas the processing overhead grows with the contention. The unwanted influence of the congestion and contention can be circumvented by recording the minimum latency over multiple measurements and putting all unneeded cores into sleep.

Cache Read Benchmark. Given the address of a cache line, the directory D is fixed while C and O can be chosen. Intuitively, any two cores C, O that minimise the latency for a fixed line must be neighbours in the network. For such pairs, the latency is approximately $2 d_{C,D} + o$ and can be used to study the placement of the directories relative to cores. Basically, each line can be assigned to the core that has the lowest read latency with one of its neighbours.

Similar to [8], the measurement proceeds as follows for each line and client core: An arbitrary neighbour core writes to the line in order to become the owner (O) and invalidate the line in all other caches. Then, it sets a helper flag in an unrelated line to notify the client core (C). This core (C) then measures the time needed for reading from the cache line. The n-smallest latency values and according core IDs are recorded inside each line.

The basic benchmark can be accelerated by considering two lines and two adjacent cores: Each line contains one flag for notification and measurement and the line is initially owned by the respective core. One core (C) measures the time needed for accessing the other core’s flag by using the atomic fetch-and-add instruction while the other core (O) reads the same flag using the atomic fetch-and-add instruction with zero increment. Thus, the line’s ownership is transferred just once for the measurement and immediately back to the other core (O) due to the polling. The other core (O) is notified about the finished measurement by seeing the incremented value. Then, both cores change their role (C, O) and operate on the other cache line.

Memory Read Benchmark. Given the address of a cache line, both D and M are fixed while the core C can be chosen freely. When selecting a core with minimal distance to the directory, the latency is approximately $2 d_{C,M} + o$ and can be used to study the placement of the memory channels relative to directories. By taking the smallest memory read latency over all cores, the best core for each line can be found without needing to know the responsible directory.

Similar to [5], the measurement proceeds as follows for each line and core: The core (C) writes to the line in order to invalidate it in all other caches and then uses the wbinvd or similar instructions to write the line back to main memory. Then, the time needed for reading the line is measured. The n-smallest latency values and according core IDs are recorded inside each line.

4 Two Layers of Interleaving on the Xeon Phi KNC

This section discusses results of the Cache Read and Memory Read benchmarks obtained on the KNC processor. Subsequently, a ping-pong micro-benchmark like in [8] is examined as prototype of many synchronisation protocols.

The processor (B1PRQ-5110P) used in this study has 60 in-order cores with fair time multiplexing among 4 hardware threads per core and a frequency of 1.05 GHz. In order to reduce fluctuations caused by the other threads, they are put into sleep with the delay instruction. The measurements use the core’s time stamp counter via the rdtsc instruction, which has quite low fluctuations because of the simple cores and sleeping threads.

The cores, directories, and memory channels are spread across a ring network and, thus, each core has two adjacent neighbours. Actually, multiple rings in both directions are used and these rings do not necessarily take the same path across the chip area. An exact assignment of cache lines to directories will be difficult because each directory has multiple nearby cores that should observe similarly low latency in the benchmark’s described above. Messages on the ring can “bounce” [6] at their destination due to contention. This causes the message to traverse the whole ring until reaching the destination again. Hence, unrelated memory traffic should be avoided in order to reduce contention at the directories.

Each core has a hardware prefetcher that discovers access patterns [9] and reads the next lines speculatively. In order to protect the Memory Read benchmark we considered power-of-two large address ranges and selected the next line by reversing the bit order in the cache line’s index.

In order to reduce interference as much as possible, we implemented the benchmarks as kernel extension of the MyThOS operating system prototype. During the boot sequence, the studied address range is reserved to keep other data structures away. The timer interrupts were disabled on all hardware threads.

Cache Read Results. Figure 2 shows the Cache Read latency measured from core 0 to 1 as well as the best latency over all pairs. For a fixed pair of cores, the latency ranges from 136 to 396 cycles with an average latency of 262 cycles (248,9 ns). This is comparable to the 243 cycles [10], respectively 235.8 ns [8], found in the literature. Considering the best pair, the latency ranges from 135 to just 152 cycles with 95% of the lines below 140 cycles.

In conclusion, the responsible directory of each line is near to at least one core and its neighbours. Thus, synchronisation between nearby cores has a good potential for acceleration by placing the synchronisation variables in lines managed by nearby directories. The average latency for a single access can be reduced from 260 to 140 cycles and, more importantly, the worst case latency of 400 cycles can be avoided systematically.

Figure 3(a) shows the distribution of lines over cores based on the minimal latency. Most cores have the lowest latency for around $1.7\%$ of the lines as can be expected for 60 cores. However, distributing 64 directories over 60 cores cannot be completely fair. While some cores get less or no lines, their neighbours seem to be nearer to these directories. Fortunately, the excess amount can be balanced over neighbours without increasing the latency much. For example, we assigned lines greedily to one of the three-best cores with fewest assigned lines.

After careful examination, we were able to partially recover the mapping from cache line address to directory for the 256 KiB range starting at 4 GiB in the physical address space. Figure 3(b) shows the latency from core 0 to 1 for 60 different lines for each directory. The bidirectional ring topology is clearly visible: The latency raises until the directory is located at the opposite of the ring and then falls again.

For our KNC the mapping worked as follows: Let $c_{17\dots 0}$ be the bits of the line’s physical address excluding the 6 lowest bits of the offset inside the line. The directory index $d_{5\dots 0}$ then is

$$\begin{aligned} d_{5\dots 0} = ( c_2 \oplus c_5 \oplus c_{11} ;\, c_1 \oplus c_4 \oplus c_{10} ;\, c_0 \oplus c_3 \oplus c_{9} ;\, c_2 \oplus c_8 ;\, c_1 \oplus c_7 ;\, c_0 \oplus c_6 ), \end{aligned}$$

where $\oplus $ denotes the exclusive-or and; divides the individual bits. This scheme is reasonably close to the interleaving documented for multisocket Intel Xeon processors [13] as described in Sect. 2. Please note, that bits from outside the examined 256 KiB address range are missing above and the mapping may vary between variants of the KNC processor. In addition, the distance between the cores and these directories can vary depending on disabled cores.

Memory Read Results. Figure 4 shows the Memory Read latency measured from core 0 as well as the best latency across all cores. For a fixed core, the latency varies from 211 to 441 cycles with an average of 350 cycles (332.5 ns). When accessing the lines from the respective best core, the latency still varies from 195 to 400 cycles with an average of 314 cycles. The latencies for memory read latencies found in the literature are 302 cycles [9] for reads with a stride of 64 byte when the dataset is larger than 512 KiB. Ramos and Hoefler [8] report an even lower mean memory read latency of 278.8 ns. The repeating pattern in Fig. 4 suggest that there are address ranges where such low average latency can be observed.

In conclusion, reading from memory can be accelerated only by selecting a subset of lines with sufficiently low latency. Following Sect. 3, the best core’s latency corresponds to the distance between directory and memory channel. If the lines would be interleaved across the memory channels near the responsible directory, the worst latency would be much better than the worst ring distance of 200 cycles. Therefore we can assume, that the lines are interleaved across the memory channels independently of the interleaving across directories.

Ping-Pong Benchmark Results. In practise, the latency of reading from a shared variable is just half the story because actual synchronisation protocols have write to the variable. The time needed to acquire exclusive write access from the directory and the time until other cores observe the new value has to be taken into account. Furthermore, protocols may consist of multiple write/read steps which can amplify the impact of the placement-dependent overhead.

As first micro-benchmark for synchronisation scenarios, we implemented a single-line ping-pong similar to [8]. The average ping-pong latency over 1000 runs was measured for multiple cache lines with optimal and worst placement as well as for different distances between the participating cores. A read and an atomic variant have been examined. The read variant implements the polling by repeatedly reading from the flag until the value changes. This temporarily brings the line into a shared state between both cores. The atomic fetch-and-add variant polls by adding zero. Here, the line is never shared and just the ownership travels between cores [25].

Figure 5 shows the distribution of the round-trip time as boxplots for cache lines placed near to one of the two cores (“best”) and farthest away from both cores (“worst”). For the read variant and adjacent cores, the average round-trip time is 577 ns for the best placement and 1434 ns for the worst placement. With growing distance, the placement’s impact diminished, reaching an average of 990 ns. For the atomic variant and adjacent cores, the average round-trip time is 222 ns for the best placement and just 685 ns for the worst placement. Again, the placement’s impact diminished with growing distance and reaches an average of 433 ns. In comparison, Ramos et al. reported 497 ns for this situation [8]. The atomic variant also shows much smaller fluctuations.

The results show that the placement information obtained by the Cache Read benchmark can be used to improve the average latency of actual communication schemes provided that the communicating cores are near to each other. Without pUMA-aware placement the round-trip time would fluctuate up to 3x over the best-case time depending on the distance between cores and directory. For communication patterns that involve shared cache lines, the invalidation broadcasts caused by the request for ownership add considerable overhead and fluctuation. Polling by non-mutating writes or write-hint prefetches [25] can reduce the round-trip time up to 2.5x over read-based polling.

5 Implications for pUMA-Aware Coordination

Pseudo-uniform memory may improve the usability of NUMA architectures as data and tasks do not need to be partitioned over a large number of domains. One example are nested parallel computation like in OpenMP and Cilk. However, the scalability of coordination-intensive computations still depends on minimising communication overheads while the pseudo-uniform address interleaving spreads the local communication involuntarily across the whole system.

This paper analysed the address interleaving across memory channels and cache coherence directories on the Intel Xeon Phi KNC processor. The micro-benchmarks show that both layers of interleaving are independent and, hence, different placement strategies are needed for optimised reading from memory versus optimised communication between cores. Just a subset of the available cache lines is useful for large linked data structures like linked lists, trees, and graphs as studied in [5]. In contrast, a significant latency reduction is achievable for access to shared variables provided that the communicating cores are in proximity to the responsible coherence directory.

As [8, 10] pointed out, the impact of contention at the coherence directories is considerable with 60 ns extra latency per concurrent thread in the ping-pong example. This situation arises naturally when a large number of threads accesses the same synchronisation variables, for example global semaphores and barriers. Hierarchical strategies and software combining strategies as reviewed in Sect. 2 can mitigate this contention bottleneck. These approaches lead naturally to a spatial partitioning of the cores in order to keep the majority of the communication localised. In such settings, the pUMA-aware placement of the local synchronisation variables should provide noticeable additional acceleration while reducing placement-dependent latency and throughput variations.

Ideal candidates for such improvements are scalable services of parallel runtime environments and of the operating system, for example the distributed memory and thread management, cross-core thread synchronisation, basic messaging and notification primitives, and application-level task schedulers.

On the practical side, improved system support is needed: For a pure user-land implementation, pinning of mapped pages in the virtual memory management, the translation from virtual to physical addresses, and the assignment from cache lines to nearby cores is needed. The KNC’s Linux supports the first two aspects but leaves the assignment to the application. Without control over the used physical address ranges, applications would need on-line measurements or a large database like in [5]. Instead, a pUMA kernel module could provide a mmap service that returns pages pre-initialised with the assignment to nearby cores.

References

Dashti, M., Fedorova, A., Funston, J., Gaud, F., Lachaize, R., Lepers, B., Quema, V., Roth, M.: Traffic management: a holistic approach to memory placement on NUMA systems. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2013, pp. 381–394. ACM, New York (2013)
Google Scholar
Agarwal, A., Simoni, R., Hennessy, J., Horowitz, M.: An evaluation of directory schemes for cache coherence. SIGARCH Comput. Archit. News 16(2), 280–298 (1988)
Article Google Scholar
Hackenberg, D., Molka, D., Nagel, W.E.: Comparing cache architectures and coherency protocols on x86–64 multicore SMP systems. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 413–422. ACM, New York (2009)
Google Scholar
Choi, I., Zhao, M., Yang, X., Yeung, D.: Experience with improving distributed shared cache performance on tilera’s tile processor. Comput. Archit. Lett. 10(2), 45–48 (2011)
Article Google Scholar
Gerofi, B., Takagi, M., Ishikawa, Y.: Exploiting hidden non-uniformity of uniform memory access on manycore CPUs. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8806, pp. 242–253. Springer, Cham (2014). doi:10.1007/978-3-319-14313-2_21
Google Scholar
Intel Corporation: Intel Xeon Phi Coprocessor System Software Developers Guide. https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., et al.: Larrabee: a many-core x86 architecture for visual computing. In: ACM Transactions on Graphics (TOG), vol. 27, p. 18. ACM (2008)
Google Scholar
Ramos, S., Hoefler, T.: Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 97–108. ACM, New York (2013)
Google Scholar
Fang, J., Sips, H., Zhang, L., Xu, C., Che, Y., Varbanescu, A.L.: Test-driving Intel Xeon Phi. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, ICPE 2014, pp. 137–148. ACM, New York (2014)
Google Scholar
Fang, Z., Mehta, S., Yew, P.C., Zhai, A., Greensky, J., Beeraka, G., Zang, B.: Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Archit. Code Optim. 11(4) 55:1–55:26 (2015)
Google Scholar
Cascaval, C., Castanos, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S.: Evaluation of a multithreaded architecture for cellular computing. In: 2002 Proceedings of Eighth International Symposium on High-Performance Computer Architecture, pp. 311–321, February 2002
Google Scholar
Feehrer, J., Jairath, S., Loewenstein, P., Sivaramakrishnan, R., Smentek, D., Turullols, S., Vahidsafa, A.: The Oracle Sparc T5 16-core processor scales to eight sockets. IEEE Micro 33(2), 48–57 (2013)
Article Google Scholar
Intel Corporation: Intel Xeon Processor 7500 Series Datasheet, vol. 2, March 2010. http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-7500-series-vol-2-datasheet.html
Lameter, C.: NUMA (non-uniform memory access): an overview. Queue, 11(7) 40:40–40:51 (2013)
Google Scholar
Bianchini, R., Crovella, M.E., Kontothanassis, L., LeBlanc, T.J.: Software interleaving. In: 1994 Proceedings of Sixth IEEE Symposium on Parallel and Distributed Processing, pp. 56–65, October 1994
Google Scholar
Tang, P., Yew, P.C.: Software combining algorithms for distributing hot-spot addressing. J. Parallel Distrib. Comput. 10(2), 130–139 (1990)
Article Google Scholar
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Article Google Scholar
Magnusson, P., Landin, A., Hagersten, E.: Queue locks on cache coherent multiprocessors. In: 1994 Proceedings of Eighth International Parallel Processing Symposium, pp. 165–171, April 1994
Google Scholar
Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)
Google Scholar
Gamsa, B., Krieger, O., Appavoo, J., Stumm, M.: Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system. In: Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI 1999, Berkeley, CA, USA, pp. 87–100. USENIX Association (1999)
Google Scholar
Radovic, Z., Hagersten, E.: Hierarchical backoff locks for nonuniform communication architectures. In: Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, HPCA-9 2003, pp. 241–252, February 2003
Google Scholar
Yew, P.C., Tzeng, N.F., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C-36(4) 388–395 (1987)
Google Scholar
Dice, D., Marathe, V.J., Shavit, N.: Lock cohorting: a general technique for designing NUMA locks. SIGPLAN Not. 47(8), 247–256 (2012)
Article Google Scholar
Fatourou, P., Kallimanis, N.D.: Revisiting the combining synchronization technique. SIGPLAN Not. 47(8), 257–266 (2012)
Article Google Scholar
David, T., Guerraoui, R., Trigonakis, V.: Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles SOSP 2013, pp. 33–48. ACM, New York (2013)
Google Scholar

Download references

Acknowledgments

This work was financed by the German Federal Ministry of Education and Research (BMBF) in the MyThOS project, grant no. 01IH13003C.

Author information

Authors and Affiliations

Brandenburg University of Technology Cottbus-Senftenberg, Cottbus, Germany
Randolf Rotta, Robert Kuban, Mark Simon Schöps & Jörg Nolte

Authors

Randolf Rotta
View author publications
You can also search for this author in PubMed Google Scholar
Robert Kuban
View author publications
You can also search for this author in PubMed Google Scholar
Mark Simon Schöps
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Nolte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Randolf Rotta or Jörg Nolte .

Editor information

Editors and Affiliations

Inria, Université Grenoble Alpes, Grenoble, France
Frédéric Desprez
LIG, Université Grenoble Alpes, Grenoble, France
Pierre-François Dutot
Computer Technology Institute, University of Patras, Patras, Greece
Christos Kaklamanis
CNRS, University of Lyon, Lyon, France
Loris Marchal
Agilient Technologies, Santa Clara, California, USA
Korbinian Molitorisz
Department of Computer Science, University of Pisa, Pisa, Italy
Laura Ricci
Università di Salerno, Salerno, Italy
Vittorio Scarano
University of Extremadura, Caceres, Spain
Miguel A. Vega-Rodríguez
University of Amsterdam, Amsterdam, The Netherlands
Ana Lucia Varbanescu
TU Wien, Vienna, Austria
Sascha Hunold
Oak Ridge National Laboratory, Tennessee Tech University, Oak Ridge, Tennessee, USA
Stephen L. Scott
RWTH Aachen University, Aachen, Germany
Stefan Lankes
TU München, Garching, Bayern, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rotta, R., Kuban, R., Schöps, M.S., Nolte, J. (2017). Dealing with Layers of Obfuscation in Pseudo-Uniform Memory Architectures. In: Desprez, F., et al. Euro-Par 2016: Parallel Processing Workshops. Euro-Par 2016. Lecture Notes in Computer Science(), vol 10104. Springer, Cham. https://doi.org/10.1007/978-3-319-58943-5_55

Download citation

DOI: https://doi.org/10.1007/978-3-319-58943-5_55
Published: 28 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58942-8
Online ISBN: 978-3-319-58943-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics