Keywords

1 Introduction

Multicasts send a message to a selected group of receivers. One of its most important uses in operating systems is the software-controlled invalidation of caches, most notably the invalidation of Translation Lookaside Buffer (TLB) entries after changes to shared address spaces [3,4,5, 18, 19, 22]. Although a number of mechanisms have been proposed, often a variant of the TLB shootdown algorithm [4] is used. The most costly operation in such algorithms is the handling of inter-processor interrupts (IPIs) [3, 5, 18], which were necessary in order to enforce the multicast’s completion.

Multi- and many-core architectures provide a large number of processor cores. For example, the Intel XeonPhi processors contain more than 60 cores with four hardware threads per core. This poses a scalability challenge [8]: The propagation and acknowledgment overhead per multicast and the number of concurrent multicasts grow with the number of threads. In addition, dynamic membership updates in multicast groups can become more frequent.

One worst case scenario is bulk synchronous parallel processing: All threads may reconfigure their part of the shared address space after a synchronizing barrier. Restricting the updates to a few single manager threads would miss parallelization benefits and complicate the applications. Likewise, introducing partitioned address spaces [8] would require careful use by the applications.

Many applications do not need all of the available hardware threads to fully utilize the numeric execution resources [7]. With simultaneous multi-threading (SMT) [21] the threads in a core share the local caches and, most importantly, the TLB. Hence, with minimal hardware support, an idle thread can invalidate its neighbor thread’s TLB entries without interrupting the currently running application.

This paper proposes an interrupt-free TLB invalidation algorithm that exploits dedicated hardware threads for cross-thread invalidation on shared TLBs in order to avoid disturbing application threads. Interrupt-free multicasts raise the question, whether tree- or ring-based multicast topologies can outperform conventional flat approaches and provide better trade-offs between latency and throughput. Hence, the potential performance gains on the Intel XeonPhi Knights Corner many-core processor are evaluated. However, complex multicast topologies increase the costs of dynamic membership updates [1] and do not necessarily reflect the hardware’s optimal topology. Consequently, this paper compiles a number of strategies that exploit shared memory to skip non-members dynamically on top of an optimized static topology.

The paper is structured as follows: Sect. 2 surveys related work on multicasts and TLB invalidation mechanisms. Section 3 outlines an interrupt-free invalidation algorithm which uses that the TLB is shared between multiple hardware threads. Section 4 explores the design space for dynamic multicast algorithms that operate on top of static topologies. Section 5 evaluates the potential performance gains and compares dynamic multicasts over static topologies against adapted topologies.

2 Preliminary and Related Work

The first subsection surveys related work on multicast algorithms and topologies. Then, their application to TLB invalidation on multi-core processors is reviewed.

2.1 Multicast Topologies

Multicast algorithms distribute messages to multiple receivers over unicast networks. In the propagation phase, the message is delivered to each receiver. This can be carried out in parallel by letting intermediate receiver or support nodes forward the message, which forms the logical multicast topology. A preemptive notification, e.g. via interrupt signals, ensures the timely forwarding and processing on all receivers. After processing the message at each receiver, an acknowledgment of the global completion is returned to the multicast’s sender, for example, to ensure ordering. This can be achieved by aggregating the individual acknowledgments along the multicast topology.

The choice of the topology provides different performance trade-offs. The throughput, as number of concurrent multicasts per time unit, is limited by the node with the highest per-message processing overhead and the most congested network link. The overhead roughly increases with the number of direct successors, which favors the simple ring topology. The latency, as time between issuing the multicast and receiving the acknowledgment, depends on the time needed to propagate the messages and acknowledgments on the longest path. This favors flat tree topologies. Finally, the reconfiguration overhead describes the cost of inserting and removing receivers in the multicast group. Adapting the logical topology appropriately to the network’s physical topology tends to come with high construction overhead, which favors simple topologies like rings [1].

A wide range of literature exists on the construction of optimal topologies. A recent review for many-core architectures can be found in [11]. Low-latency strategies have been found for many network topologies [10] and performance models such as the POSTAL model [2, 6] and the LogP model [12]. Fractional trees [15] provide a trade-off between latency and throughput in sparsely connected networks. Similarly, diamond rings [13] balance both by unifying acknowledgment and propagation in a ring-like topology.

A model for optimizing the throughput is the k-item broadcast problem in which the number of rounds to multicast k messages should be minimized. Santos [17] provide a near optimal solution in the LogP model and the circulant graphs [20] in the simultaneous send–receive model. However, the 2Tree algorithms [16] are easier to implement while achieving near optimal throughput.

This paper does not aim to identify the best topology. Instead, the focus lies on re-evaluating how much the choice of topology matters on many-core processors with dynamic multicast groups. Additional optimizations are available on cache-coherent shared memory systems. Instead of the message-based aggregation of the acknowledgments, tree combining [23] by a hierarchy of counters in shared memory can be more efficient.

2.2 Multi-core TLB Invalidation

The translation lookaside buffer (TLB) is a small cache that speeds up the mapping from logical to physical addresses and access permissions. Each core contains one or more local TLBs. For various reasons, the TLBs in many-core architectures are not invalidated by the hardware’s cache-coherence. Instead, the operating system has to send invalidation requests to all cores (or hardware threads) that currently use the affected address space. Especially when removing mappings, the sender has to wait for the global completion in order to ensure that all threads can no longer access the removed pages.

Thus, the TLB invalidation is a major application of acknowledged multicasts. Dynamic groups are maintained in order to not bother unrelated threads and reduce the system noise. Several algorithms have been proposed in the literature [3,4,5, 18, 19, 22] and, often, a variant of [4] is used, which sequentially sends interrupts to all cores to be invalidated. Barrelfish is a notable exception by building efficient multicast topologies from a hardware description using a constraint solver [3]. However, in scenarios where membership can change rapidly rebuilding the entire topology would not be efficient. The following summarizes the algorithm used by Linux 4.11 on x86 architecturesFootnote 1.

Each hardware thread owns a linked list as a task queue and an array of pre-allocated per-thread task structures. The tasks consist of a list handle for the task queue, a function pointer, a generic argument pointer for the function, and an acknowledgment flag. The interrupt handler processes each task from its queue and sets the flag. The multicast groups are maintained as bit mask in each address space. By iterating over the mask, the task for each receiver is initialized and enqueued. Then, an interrupt is sent to each receiver by either sending individual interrupts or using hardware multicast support if available. Finally, the mask is iterated again to wait on each acknowledgment flag.

In conclusion, concurrent multicasts are propagated in parallel with minimal overhead for the receiver. This results in good throughput and simplicity but comes with high overhead on the sender side and, hence, high latency.

As preliminary work we investigated the relevant parameters of a 60-core Intel XeonPhi 5110P (KNC) processor with 1.053 GHz clock using microbenchmarks. Hence, one processor cycle equates to roughly one nanosecond. The message transmission overhead is around 1200 cycles. Sending an interrupt between cores takes around 400 cycles and the next interrupt can be sent when the interrupt controller is ready again after approximately 1000 cycles. The interrupt latency from issuing the signal to reaching the interrupt handler was around 1000 cycles.

Typical HPC application on this processor use 60–120 application threads. Hence, sending the 60–120 messages sequentially would take 72k–144k cycles. Sending the interrupts sequentially would cost another 60k–120k cycles. By interleaving the interrupt and message transmissions, this could be reduced to 24k–48k cycles. In summary, a TLB invalidation across 120 threads would take at least 264k cycles (250 µs) without the acknowledgment when using a flat topology. This is 10x higher than other multicast topologies on the same processor, see for example [11, 13].

3 Interrupt-Free TLB Invalidation

This section outlines an interrupt-free invalidation algorithms that avoids operation system noise on application threads. It exploits that multiple hardware threads share a TLB. Many applications do not need all of the available hardware threads to fully utilize the numeric execution resources [7]. Thus, one thread per core can be dedicated to the propagation and processing of TLB invalidation multicast messages and other operating system tasks.

The first challenge is to avoid interrupting the applications running at a core. Therefore, the dedicated thread must invalidate the TLB entries for the core’s neighbor threads without sending interrupts. This can be achieved by exploiting that threads using exactly the same address space share their TLB entries and, thus, invalidation requests through the INVLPG instruction become effective for the neighbor threads. On x86 processors that support process context identifiers (PCIDs), the PCID of the target address space can be used for invalidation through the INVPCID instruction. Finally, reverse-engineering of the TLB structure can be exploited [9].

The x86 PCIDs are currently not used by Linux because the overhead of multicasts to unused address spaces would quickly offset the performance gain. However, the Alpha architecture has a similar feature called address space numbers (ASNs). There, Linux maintains a small per-core mapping from used address spaces to their local ASN. Invalidation multicasts are received only for the actively used address space(s). The others are invalidated upon reloading if a generation counter inside the address space indicates a skipped invalidation. The same approach can be used for interrupt-free TLB invalidation on x86 by tracking the PCIDs on the core level instead of individual hardware threads.

The invalidation requests need to be multicasted only to the non-sleeping cores. Each core’s dedicated system thread checks if one of its application threads is affected. Cores waking up from deep sleep invalidate their TLBs anyway.

The second challenge is to avoid sending an interrupt to the dedicated thread. On processors with MONITOR/MWAIT support, the behavior of MWAIT ensures that the dedicated thread directly continues its execution whenever a message arrives in its queue. Without such support, the operating system can implement a similar behavior by polling. The dedicated thread goes to sleep when all application threads are idle, and is woken up by the first resuming application thread. Multicasts should skip cores that are in deep sleep, which is achieved by the mechanisms presented in the next section.

4 Dynamic Membership in a Static Broadcast Topology

Hierarchical multicast topologies may outperform the conventional simple flat algorithm used by Linux. However, complex topologies increase the cost of dynamic membership updates. On the other hand, using a static topology, similar to a broadcast, bothers non-member threads and leads to high latencies for small groups. Therefore, mechanisms are needed that emulate dynamic multicast groups on top of a static hardware-optimized broadcast topology.

The basic idea is to decouple the logical from the physical topology: The role of intermediate non-member nodes that should not receive a multicast can be taken over by other nodes via shared memory. The logical topology can contain additional support nodes that do not represent actual processor cores or hardware threads.

This enables three mechanisms: Shared Memory Acknowledgment and Helping avoid bothering non-member nodes by taking over their role. Skipping can speed up the helping by jumping over larger subgroups of non-member nodes.

The first subsection defines the necessary node types for such topologies. Then, the three mechanisms are discussed in more detail.

4.1 Node Types for Hierarchical Topologies

In Fig. 1, the three nodes types are illustrated by an example topology: Scatter nodes have a single predecessor and multiple successors. Their role is to parallelize the multicast’s propagation. Gather nodes have multiple predecessors and a single successor. Their role is to distribute the aggregation of the acknowledgements. Center nodes are in-between by having just a single predecessor and successor. The root of the topology is a node without predecessor. At the opposite end, the tail node has no successor and its role is to notify the multicast’s source about the global completion.

Fig. 1.
figure 1

An example topology labeled with node types.

Places represent the possible multicast receivers. For example, these can be processor cores or hardware threads. A multicast group is the set of places that shall process each multicast exactly once. The membership information needs to be readable from all places and can be implemented, for example, by an array of membership flags in shared memory. Each node of the topology is assigned to a place. Member nodes belong to places that are part of the multicast group. Some topologies require additional support nodes, which never will be a member of the multicast group to prevent repeated processing of the same message.

In tree topologies, for example, the tree leaves become center nodes. Each intermediate tree node consists of a scatter node and a supporting gather node. On member scatter nodes, the message can be first forwarded, then processed, and, after that, acknowledged to the associated gather node. In contrast, ring-alike topologies have no support nodes. Their gather nodes can be members and are responsible for message processing on their assigned place. Here, all member nodes have to forward the message only after processing, because the propagation of the message implicitly acknowledges that it has been processed.

4.2 Shared Memory Acknowledgment (SmAck)

Gather nodes aggregate the acknowledgment from their predecessors. In a pure message passing implementation, each predecessor would send an acknowledgment message, which is counted by the gather node. Such message transmissions over shared memory would cause more cache traffic than simply decrementing a shared counter. With SmAck, each predecessor decrements the gather node’s atomic counter via shared memory. Only when it reaches zero, a single message is sent to the gather node.

4.3 Helping Non-member Nodes (Help)

Multicasts should not disrupt places that are currently not members of the multicast group. With a static topology however the respective non-member nodes are still needed for propagation and acknowledgment aggregation. Each node can check another node’s membership via shared memory. The Help mechanism forwards messages only to member nodes. For non-member successors, the sender node performs the successor’s propagation or aggregation role.

In other words, each node traverses the topology recursively along its non-member nodes and propagates the multicast message only to members. In combination with SmAck, this strategy reduces the acknowledgment aggregation on support gather nodes to the classic tree combining [23].

4.4 Skipping Non-member Subgroups (Skip)

The Help mechanism has a drawback: With many non-member nodes, a few nodes will have to scan most of the membership flags and carry out most or all message transfers. As highlighted in Fig. 2, pairs of gather and scatter nodes recursively form brackets around a smaller group of nodes. The Skip mechanism uses this information to jump over entire hierarchical subgroups if such a subgroup contains no member. Checking a large set of membership flags at each node would be inefficient. Instead, tree combining [23] can be applied to track the membership state of each subgroup hierarchically.

Fig. 2.
figure 2

An example skipping hierarchy. The dotted arrows indicate skip targets.

5 Evaluation

This section evaluates the performance of dynamic multicast groups on top of static topologies with focus on the latency. Our basic assumption is that cross-thread TLB invalidation has negligible overhead compared to the multicast itself. This allows to implement a better portable benchmark based on multicasts without actual TLB invalidation. As described in Sect. 2.2, interrupt-based multicasts work the same just with additional latency along the longest path.

The first subsection summarizes the benchmark setup and the second subsection presents the latency results. This evaluation has obvious room for improvement: Comparing the overhead of group membership updates, integrating latency-optimized trees [11], and investigating the throughput with 2Tree algorithms [16] is open for future work.

5.1 Setup

The benchmark environment is based on user-space threads on top of Linux. Each thread is pinned to an individual hardware thread using a affinity mask. The multicast is propagated through active messages via shared memory FIFO queues based on [14]. All experiments were performed on a 60-core Intel XeonPhi 5110P (KNC) processor with 1.053 GHz clock. For this platform, one thread per core is used and each polls actively for messages with a delay of 200 cycles when its queue is empty. If available, as in the more recent Intel XeonPhi Knights Landing architecture, MONITOR/MWAIT can be used to minimize the polling overhead.

The impact of the multicast group size is compared for \(n = 2, 4, 8, 16, 32, 60\) members. For each size, 32 configurations are generated by selecting random members and the measurement is repeated 16 times for each configuration. The median over all measurements is used to reduce the impact of the operating system noise. The Static variant uses a single topology across the 60 cores for all group sizes. In contrast, the Dynamic variant uses smaller topologies that span just the members.

Different topologies are used to investigate the impact on the multicast mechanisms and the general benefit of deeper topologies compared to the classic flat TLB invalidation. The Flat topology, see Fig. 3(a), mimics the conventional strategy as described in Sect. 2.2. The acknowledgments are counted in a single support node. The Tree topology, see Fig. 3(b), uses a 2-ary balanced tree. Finally, the Diamond topology, see Fig. 3(c), is based on diamond rings, in which the gather nodes are responsible for their own cores.

Fig. 3.
figure 3

Topologies for 8 threads. Helper nodes are grey, dotted arrows indicate possible skipping paths.

As baseline mechanism, SmAck uses just shared memory acknowledgment to decrease the amount of messages sent to gather nodes. The Help mechanism combines helping and shared memory acknowledgment. Finally, the Skip mechanism combines skipping, helping, and shared memory acknowledgment. Because there are no non-member nodes in the topologies of the Dynamic variant, Skip does not provide any benefits there and Help just implements tree combining for the gather phase.

5.2 Results

Figure 4 shows the results of the latency benchmark. The columns represent the three topologies (Flat, Tree, Diamond) and the rows represent the three mechanisms (SmAck, Help, Skip). The circles represent the Dynamic topology variants that consist only of the group members, and the triangles denote the Static variant that includes all 60 cores. The x-axis is the size of the multicast groups and the y-axis shows the medium latency.

The latency of the SmAck mechanism on the Dynamic topologies increases linearly with the group size for the flat topology and logarithmically for the hierarchical typologies. With the Static variant, the latency is almost constant with a median around 77k cycles for Flat, 39k for Tree, and 38k for Diamond. This can be expected because it involves all 60 cores independent of the group size. Thus, the overhead for a single message is around 1280 cycles based on the Flat topology. The longest path in the Tree and Diamond topologies is 6 scatter nodes with 2 messages per node plus 6 gather nodes. This predicts a latency of 23k cycles. The remaining 16k cycles might be caused by additional overhead from navigating through the more complex topology.

Fig. 4.
figure 4

Median latency for the different topologies and propagation mechanisms.

The latency of the Help mechanism on the Dynamic topologies is roughly equal to the pure SmAck mechanism, except on the Tree topology. The Static variants have a much higher latency than the Dynamic variants. On the Static Tree topology, the latency decreases from 38k to 31k cycles for growing group size. For all 60 members on the Tree topology, Help is 8k cycles faster than pure SmAck. This difference can be attributed to the tree combining during the gather phase. Based on the large difference on the Flat topology, it seems that the membership test of our implementation has a high overhead. This equally impacts the Static Help on the other topologies. With growing group size, this overhead is hidden by the parallel propagation.

The Static Skip mechanism performs similar to the Dynamic Help on each topology but has slightly higher overhead. Compared to Static Help on the Tree and Diamond topology, it has a much smaller latency for small groups. On the Flat topology, Skip never happens as long as there is at least one member. Skip actually eliminates the overhead of the Static Help mechanism on small groups. However, this advantage vanishes for larger group sizes. There, the additional overhead to check for possible skipping becomes visible.

In summary, Help on the Tree topology performed best for the Dynamic variant and Skip on the Tree topology performed best for the Static variant. The difference is negligible for large groups. For groups with just two members, the Dynamic variant (5.6k cycles) is 3x faster than the Static variant (15.5k cycles). However, the static variant has likely a larger overhead for topology updates when the group changes, which was not evaluated in this paper.

Comparing the Flat versus Tree topology, the latency can be halved from 77k to 37k. Of course this difference increases with the number of cores or hardware threads. Interrupts during the propagation would increase the latency to 78k on the Flat and up to 43k on the Tree topology. Hierarchical topologies benefit more from interrupt-free multicasts than the conventional flat approach.

6 Conclusions

The first part of the paper examined TLB shootdowns as a practical example for invalidation multicasts on many-core processors. We proposed an interrupt-free TLB invalidation algorithm that exploits simultaneous multiprocessing by dedicating superfluous hardware threads to the multicast processing. Similar algorithms are applicable to other kinds of locally shared caches, for example non-coherent instruction caches. Such interrupt-free multicasts reduce the operating system noise by not interrupting applications.

The second part evaluated the potential performance gains on the Intel XeonPhi Knights Corner many-core processor with focus on hierarchical multicast topologies and strategies that exploit shared memory to skip non-members dynamically on top of an optimized static topology. The results show that the latency can be significantly reduced for large groups (2x for 60 cores) and benefits more from interrupt-free propagation than the conventional flat approach. Therefore, TLB shootdowns should be redesigned for many-core processors. The impact on the peak throughput needs further investigation.