Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Large-scale multi- and many-core processors have to compromise between the scalability of the memory architecture, its space and power consumption, and the usability for application developers. Efficient memory interconnects are usually inherently non-uniform and their latency varies with the distance between core and memory while the peak throughput diminishes with growing distance. Therefore, tasks and their data should be placed close together in order to reduce latency and increase throughput but, at the same time, should be distributed in order to increase parallelism and balance the load over multiple bottlenecks [1].

Coherent caching layers further complicate the situation. Directory-based coherency protocols [2] as well as distributed shared caches [3, 4] employ global directory components that route requests to recent copies and coordinate global invalidation and updating. In order to resolve throughput bottlenecks at these components, multiple of them are distributed across the network and the request load should be distributed uniformly across them.

Non-Uniform Memory Architectures group memory channels, directories, and compute cores such that an almost uniform low latency and high throughput is provided within each group—also known as NUMA domain or node. In order to utilise the system’s peak throughput, it is the application’s responsibility to balance data and compute tasks across these domains. This requires basically the same strategies as in distributed systems, for example a domain decomposition with bin-packing for load balancing. As a positive side effect, this also results in a more localised coordination, which enables synchronisation with low latency and low congestion. However, while successful on medium-sized NUMA systems, the effort and load balancing challenge increases with the ratio between a growing number of domains and the size of the shared memory.

A more convenient alternative are pseudo-Uniform Memory Architectures that use hardware-based address interleaving, for example with cache line granularity, in order to uniformly distribute the load over many memory channels and coherence directories. Provided that the network can cope with the aggregated peak throughput, applications do not need to worry neither about throughput bottlenecks nor the co-located placement of data and tasks.

Unfortunately, this is true only for throughput-bound computations on large-enough datasets: any synchronisation between cores is still dominated by the cache coherence latency, which depends on the distance between the involved cores and coherence directory. While the hardware’s interleaving has no mitigating effect on the usually small synchronisation variables, their seemingly random spatial placement leads to difficult-to-predict overheads and performance variations. For synchronisation, the convenient pseudo-uniformity becomes a layer of obfuscation [5]. A few badly positioned synchronisation variables can slow down the whole application. Analysing such performance bottlenecks is further impaired by placement-dependent variation between repeated runs of the same application outside of the developer’s control.

This paper studies the pseudo-uniform architecture of the Intel Xeon Phi Knights Corner (KNC) many-core processor [6] and derives strategies for the optimised placement of synchronisation variables and similar latency-bound objects. The KNC provides 59–61 cores with four hardware threads each, four memory controllers, and 64 cache coherence directories—all connected via a shared point-to-point ring network. Compared to previous Xeon processors, the path between a core, the responsible directory, and the destination cache or memory controller can be very long, which results in considerable placement-dependent latency variation.

To this end, we reconstruct a mapping from cache line addresses to neighbouring cores based on latency measurements and use this mapping to initialise a pUMA-aware cache line allocator. For a basic cache line ping-pong pattern, this pUMA-aware placement enabled a 3x speedup between adjacent cores.

The next section reviews related work with respect to memory architectures and locality awareness. Section 3 devises generic experiments to study the effects of interleaving across directories and memory channels. Then, Sect. 4 discusses the experiment results obtained on the Intel Xeon Phi KNC. Finally, Sect. 5 discusses the broader implications on placement and coordination on the KNC and similar pUMA architectures.

2 Related Work

This section surveys performance studies related to the Intel XeonPhi KNC processor and uniform memory architectures. The last part reviews coordination strategies from NUMA systems with relevance to uniform memory architectures.

Studies Related to the Intel Xeon Phi KNC. The Larrabee architecture for visual computing [7] is the ancestor of the Knights Corner processors. The article proposes a many-core architecture based on simple x86 cores with SIMD short vector units, private L2 caches, and an on-chip ring network for cache coherency. In order to keep the ring latency small compared to the latency of the DRAM memory channels, multiple “short linked” rings are proposed without discussing implications for the cache coherency. The authors point out that synchronisation between threads within a core is fast because of the shared L1 cache and cross-core synchronisation is inherently much slower. Hence, computations that access the same data should be placed onto the same core.

Based on the available technical documentation and micro-benchmarks on the KNC 5110P, Ramos and Hoefler [8] provide a detailed overview of the KNC’s directory-based cache coherence and present a quantitative performance model for cross-core communication. Likewise, Fang et al. [9, 10] published extensive studies of the KNC. Both groups consider the average latency over a large number of cache lines and report similar results. Reading from any other cache takes 243 cycles in average and reading from the memory takes 318–346 cycles in average. The latency of reading a single cache line from another core’s cache is examined in [10]. There, a latency variation from 160–340 cycles depending on the partner core is visible. The authors note, that the latency does not relate to the distance of both cores because of the distributed coherence directories.

Gerofi et al. [5] studied the “hidden non-uniformity” of the KNC processor with respect to reading from main memory. They show a 60% variation in latency when reading different cache lines from the main memory and propose a respective memory allocator that reduces this cache miss latency. The authors argue, that such placement could speed up algorithms that exhibit difficult to predict access patterns, for example, because of recursive data structures like linked lists, trees, and graphs. Their evaluation demonstrates a 17–28% throughput improvement for an A* shortest path algorithm with optimised allocation of the graph nodes. In contrast to [5], the present paper focuses on cross-core communication, that is the latency of accessing another core’s cache.

Other UMA and pUMA Systems. The IBM Cyclops processor [11] has 16 embedded memory banks and the contiguous address space can be interleaved over caches and memory banks in order to balance the congestion. A crossbar switch is used to provide uniform latency between all cores, caches, and memory banks. Similarly, the Oracle Sparc T5 processors [12] use a crossbar for uniform latency.

Multi-socket Intel Xeon processors are operated as NUMA systems usually with one (pseudo-)uniform domain per socket. However, the address interleaving is configurable and can span multiple sockets [13] by combining bits of the physical address into a 3 bit target index. The “low-order” interleave uses bits 6–8 as target, which distributes consecutive lines over adjacent targets, and the “low/mid-hash” interleave uses bits 6–8 exclusive-or bits 16–18. In addition, the “hemisphere” variant replaces the first target bit with an exclusive-or of the bits 6, 10, 13, and 19 in order to better distribute accesses with a fixed stride.

To a limited degree, interleaving can be implemented by software. The processor’s virtual address spaces can be used for interleaving on page granularity [1, 14] and applications can distribute the placement of their data structures [15].

The Tilera Tile processors use a distributed shared L2 cache with a local cache at each core [4]. Requests are routed to the line’s home cache, which is configured on page-size granularity. While interleaving over multiple L2 components is possible, synchronisation variables can simply be allocated in dedicated pages with known placement.

Coordination in NUMA Systems. Alongside the ratio of parallel to sequential computations, the scalability depends considerably on the overhead associated with distributing tasks across threads and synchronising the actions of concurrently active tasks. This overhead depends on the communication latency and the congestion on memory channels and network [1] and, thus, also on the contention as number of threads competing for a shared resource [16].

Some NUMA strategies reduce the latency by moving shared variables closer to their threads. One example is frequent polling on locally cached flags and rare signalling to remote flags as done by queue locks [17, 18] and work stealing [19]. Tightly related are strategies that reduce contention by distributing the load over multiple peers. Examples are the replication of services [20] and hierarchically distributed locks [21]. Software Combining generalises both aspects by combining multiple local accesses into fewer remote messages [16, 17, 22].

Finally, some strategies reduce the data migration between NUMA domains, for example, by keeping related tasks in the same domain as in hierarchical work stealing [19], preferring threads of the same domain as in cohort locking [23], or moving tasks to specific domains as in delegation locks [24].

3 Measuring Latency: Reading from Caches vs. Memory

Latency-bound phases can be accelerated by reducing the stall time when reading from main memory with unpredictable access patterns (like [5]) and by reducing the latency when synchronising nearby threads via shared variables. Both aspects cannot be mitigated by hardware or software prefetching. The objective therefore is to reduce the latency by placing the data into cache lines that are locally managed and stored. Unfortunately, the pUMA address interleaving, while balancing the congestion for improved throughput, obfuscates the actual placement. In lack of documentation about the interleaving, latency measurements can uncover sufficient information for a pUMA-aware allocator, for example by assigning lines to the cores with lowest latency. This section devises latency measurements that provide such information.

Assuming a processor with cache coherence based on a shared distributed directory and private caches per core, the latency depends on the distances between the client core (C), the responsible directory component (D), and the remote cache that currently owns the line (O) or respectively the responsible memory channel (M). The responsible directory and memory channel are selected by the hardware’s interleaving scheme. The directory tracks the sharing state of previously accessed lines and routes read requests accordingly to the current owner cache or to a memory channel. Similar to Ramos and Hoefler [8] two cases can be distinguished as illustrated in Fig. 1: Cache Read is routed to the current owner cache (O) and Memory Read is routed to the off-chip memory (M) because the line is invalid (not present) in all caches.

Fig. 1.
figure 1

Communication path for reading a line from another cache or the memory.

The read latency for both cases is \(d_{C,D} + d_{D,O/M} + d_{O/M,C} + o\), where \(d_{x,y}\) is the latency introduced by the network between x and y, and o is the processing overhead for cache, directory, and memory lookups. The network latency grows with the distance and the link congestion, whereas the processing overhead grows with the contention. The unwanted influence of the congestion and contention can be circumvented by recording the minimum latency over multiple measurements and putting all unneeded cores into sleep.

Cache Read Benchmark. Given the address of a cache line, the directory D is fixed while C and O can be chosen. Intuitively, any two cores CO that minimise the latency for a fixed line must be neighbours in the network. For such pairs, the latency is approximately \(2 d_{C,D} + o\) and can be used to study the placement of the directories relative to cores. Basically, each line can be assigned to the core that has the lowest read latency with one of its neighbours.

Similar to [8], the measurement proceeds as follows for each line and client core: An arbitrary neighbour core writes to the line in order to become the owner (O) and invalidate the line in all other caches. Then, it sets a helper flag in an unrelated line to notify the client core (C). This core (C) then measures the time needed for reading from the cache line. The n-smallest latency values and according core IDs are recorded inside each line.

The basic benchmark can be accelerated by considering two lines and two adjacent cores: Each line contains one flag for notification and measurement and the line is initially owned by the respective core. One core (C) measures the time needed for accessing the other core’s flag by using the atomic fetch-and-add instruction while the other core (O) reads the same flag using the atomic fetch-and-add instruction with zero increment. Thus, the line’s ownership is transferred just once for the measurement and immediately back to the other core (O) due to the polling. The other core (O) is notified about the finished measurement by seeing the incremented value. Then, both cores change their role (CO) and operate on the other cache line.

Memory Read Benchmark. Given the address of a cache line, both D and M are fixed while the core C can be chosen freely. When selecting a core with minimal distance to the directory, the latency is approximately \(2 d_{C,M} + o\) and can be used to study the placement of the memory channels relative to directories. By taking the smallest memory read latency over all cores, the best core for each line can be found without needing to know the responsible directory.

Similar to [5], the measurement proceeds as follows for each line and core: The core (C) writes to the line in order to invalidate it in all other caches and then uses the wbinvd or similar instructions to write the line back to main memory. Then, the time needed for reading the line is measured. The n-smallest latency values and according core IDs are recorded inside each line.

4 Two Layers of Interleaving on the Xeon Phi KNC

This section discusses results of the Cache Read and Memory Read benchmarks obtained on the KNC processor. Subsequently, a ping-pong micro-benchmark like in [8] is examined as prototype of many synchronisation protocols.

The processor (B1PRQ-5110P) used in this study has 60 in-order cores with fair time multiplexing among 4 hardware threads per core and a frequency of 1.05 GHz. In order to reduce fluctuations caused by the other threads, they are put into sleep with the delay instruction. The measurements use the core’s time stamp counter via the rdtsc instruction, which has quite low fluctuations because of the simple cores and sleeping threads.

The cores, directories, and memory channels are spread across a ring network and, thus, each core has two adjacent neighbours. Actually, multiple rings in both directions are used and these rings do not necessarily take the same path across the chip area. An exact assignment of cache lines to directories will be difficult because each directory has multiple nearby cores that should observe similarly low latency in the benchmark’s described above. Messages on the ring can “bounce” [6] at their destination due to contention. This causes the message to traverse the whole ring until reaching the destination again. Hence, unrelated memory traffic should be avoided in order to reduce contention at the directories.

Each core has a hardware prefetcher that discovers access patterns [9] and reads the next lines speculatively. In order to protect the Memory Read benchmark we considered power-of-two large address ranges and selected the next line by reversing the bit order in the cache line’s index.

In order to reduce interference as much as possible, we implemented the benchmarks as kernel extension of the MyThOS operating system prototype. During the boot sequence, the studied address range is reserved to keep other data structures away. The timer interrupts were disabled on all hardware threads.

Fig. 2.
figure 2

Cache Read latency (in cycles) from one core pair versus the best pair.

Fig. 3.
figure 3

Distribution of the directories across the ring and distance to core 0.

Cache Read Results. Figure 2 shows the Cache Read latency measured from core 0 to 1 as well as the best latency over all pairs. For a fixed pair of cores, the latency ranges from 136 to 396 cycles with an average latency of 262 cycles (248,9 ns). This is comparable to the 243 cycles [10], respectively 235.8 ns [8], found in the literature. Considering the best pair, the latency ranges from 135 to just 152 cycles with 95% of the lines below 140 cycles.

In conclusion, the responsible directory of each line is near to at least one core and its neighbours. Thus, synchronisation between nearby cores has a good potential for acceleration by placing the synchronisation variables in lines managed by nearby directories. The average latency for a single access can be reduced from 260 to 140 cycles and, more importantly, the worst case latency of 400 cycles can be avoided systematically.

Figure 3(a) shows the distribution of lines over cores based on the minimal latency. Most cores have the lowest latency for around \(1.7\%\) of the lines as can be expected for 60 cores. However, distributing 64 directories over 60 cores cannot be completely fair. While some cores get less or no lines, their neighbours seem to be nearer to these directories. Fortunately, the excess amount can be balanced over neighbours without increasing the latency much. For example, we assigned lines greedily to one of the three-best cores with fewest assigned lines.

After careful examination, we were able to partially recover the mapping from cache line address to directory for the 256 KiB range starting at 4 GiB in the physical address space. Figure 3(b) shows the latency from core 0 to 1 for 60 different lines for each directory. The bidirectional ring topology is clearly visible: The latency raises until the directory is located at the opposite of the ring and then falls again.

For our KNC the mapping worked as follows: Let \(c_{17\dots 0}\) be the bits of the line’s physical address excluding the 6 lowest bits of the offset inside the line. The directory index \(d_{5\dots 0}\) then is

$$\begin{aligned} d_{5\dots 0} = ( c_2 \oplus c_5 \oplus c_{11} ;\, c_1 \oplus c_4 \oplus c_{10} ;\, c_0 \oplus c_3 \oplus c_{9} ;\, c_2 \oplus c_8 ;\, c_1 \oplus c_7 ;\, c_0 \oplus c_6 ), \end{aligned}$$

where \(\oplus \) denotes the exclusive-or and; divides the individual bits. This scheme is reasonably close to the interleaving documented for multisocket Intel Xeon processors [13] as described in Sect. 2. Please note, that bits from outside the examined 256 KiB address range are missing above and the mapping may vary between variants of the KNC processor. In addition, the distance between the cores and these directories can vary depending on disabled cores.

Fig. 4.
figure 4

Memory Read latency (in cycles) from one core versus the best core.

Memory Read Results. Figure 4 shows the Memory Read latency measured from core 0 as well as the best latency across all cores. For a fixed core, the latency varies from 211 to 441 cycles with an average of 350 cycles (332.5 ns). When accessing the lines from the respective best core, the latency still varies from 195 to 400 cycles with an average of 314 cycles. The latencies for memory read latencies found in the literature are 302 cycles [9] for reads with a stride of 64 byte when the dataset is larger than 512 KiB. Ramos and Hoefler [8] report an even lower mean memory read latency of 278.8 ns. The repeating pattern in Fig. 4 suggest that there are address ranges where such low average latency can be observed.

In conclusion, reading from memory can be accelerated only by selecting a subset of lines with sufficiently low latency. Following Sect. 3, the best core’s latency corresponds to the distance between directory and memory channel. If the lines would be interleaved across the memory channels near the responsible directory, the worst latency would be much better than the worst ring distance of 200 cycles. Therefore we can assume, that the lines are interleaved across the memory channels independently of the interleaving across directories.

Fig. 5.
figure 5

Ping-pong round-trip time depending on distance between the cores.

Ping-Pong Benchmark Results. In practise, the latency of reading from a shared variable is just half the story because actual synchronisation protocols have write to the variable. The time needed to acquire exclusive write access from the directory and the time until other cores observe the new value has to be taken into account. Furthermore, protocols may consist of multiple write/read steps which can amplify the impact of the placement-dependent overhead.

As first micro-benchmark for synchronisation scenarios, we implemented a single-line ping-pong similar to [8]. The average ping-pong latency over 1000 runs was measured for multiple cache lines with optimal and worst placement as well as for different distances between the participating cores. A read and an atomic variant have been examined. The read variant implements the polling by repeatedly reading from the flag until the value changes. This temporarily brings the line into a shared state between both cores. The atomic fetch-and-add variant polls by adding zero. Here, the line is never shared and just the ownership travels between cores [25].

Figure 5 shows the distribution of the round-trip time as boxplots for cache lines placed near to one of the two cores (“best”) and farthest away from both cores (“worst”). For the read variant and adjacent cores, the average round-trip time is 577 ns for the best placement and 1434 ns for the worst placement. With growing distance, the placement’s impact diminished, reaching an average of 990 ns. For the atomic variant and adjacent cores, the average round-trip time is 222 ns for the best placement and just 685 ns for the worst placement. Again, the placement’s impact diminished with growing distance and reaches an average of 433 ns. In comparison, Ramos et al. reported 497 ns for this situation [8]. The atomic variant also shows much smaller fluctuations.

The results show that the placement information obtained by the Cache Read benchmark can be used to improve the average latency of actual communication schemes provided that the communicating cores are near to each other. Without pUMA-aware placement the round-trip time would fluctuate up to 3x over the best-case time depending on the distance between cores and directory. For communication patterns that involve shared cache lines, the invalidation broadcasts caused by the request for ownership add considerable overhead and fluctuation. Polling by non-mutating writes or write-hint prefetches [25] can reduce the round-trip time up to 2.5x over read-based polling.

5 Implications for pUMA-Aware Coordination

Pseudo-uniform memory may improve the usability of NUMA architectures as data and tasks do not need to be partitioned over a large number of domains. One example are nested parallel computation like in OpenMP and Cilk. However, the scalability of coordination-intensive computations still depends on minimising communication overheads while the pseudo-uniform address interleaving spreads the local communication involuntarily across the whole system.

This paper analysed the address interleaving across memory channels and cache coherence directories on the Intel Xeon Phi KNC processor. The micro-benchmarks show that both layers of interleaving are independent and, hence, different placement strategies are needed for optimised reading from memory versus optimised communication between cores. Just a subset of the available cache lines is useful for large linked data structures like linked lists, trees, and graphs as studied in [5]. In contrast, a significant latency reduction is achievable for access to shared variables provided that the communicating cores are in proximity to the responsible coherence directory.

As [8, 10] pointed out, the impact of contention at the coherence directories is considerable with 60 ns extra latency per concurrent thread in the ping-pong example. This situation arises naturally when a large number of threads accesses the same synchronisation variables, for example global semaphores and barriers. Hierarchical strategies and software combining strategies as reviewed in Sect. 2 can mitigate this contention bottleneck. These approaches lead naturally to a spatial partitioning of the cores in order to keep the majority of the communication localised. In such settings, the pUMA-aware placement of the local synchronisation variables should provide noticeable additional acceleration while reducing placement-dependent latency and throughput variations.

Ideal candidates for such improvements are scalable services of parallel runtime environments and of the operating system, for example the distributed memory and thread management, cross-core thread synchronisation, basic messaging and notification primitives, and application-level task schedulers.

On the practical side, improved system support is needed: For a pure user-land implementation, pinning of mapped pages in the virtual memory management, the translation from virtual to physical addresses, and the assignment from cache lines to nearby cores is needed. The KNC’s Linux supports the first two aspects but leaves the assignment to the application. Without control over the used physical address ranges, applications would need on-line measurements or a large database like in [5]. Instead, a pUMA kernel module could provide a mmap service that returns pages pre-initialised with the assignment to nearby cores.