1 Introduction

Graphs are ubiquitous in nature and can be used to represent a wide variety of phenomena such as road networks, dependencies in databases, communications in distributed algorithms, interactions in social networks, and so forth. Nevertheless, phenomena where interactions between entities are not necessarily pairwise are more adequately modeled by hypergraphs, which can capture higher-order interactions [1]. With the massive proliferation of data, processing large-scale (hyper)graphs on distributed systems and databases becomes a necessity for a wide range of applications. When processing a (hyper)graph in parallel, k processors operate on distinct portions of the (hyper)graph while communicating to one another through message-passing. To make the parallel processing efficient, an important preprocessing step consists of partitioning the vertices of the (hyper)graph into k roughly balanced blocks such that few (hyper)edges run between blocks. (Hyper)graph partitioning is NP-hard [2] and there can be no approximation algorithm with a constant ratio for general (hyper)graphs [3]. Thus, heuristics are used in practice. A current trend for partitioning huge (hyper)graphs quickly and using low computational resources are streaming algorithms [4,5,6,7,8,9,10,11,12].

The most popular streaming approach in literature is the one-pass model [13], where vertices arrive one at a time including their (hyper)edges and then have to be permanently assigned to blocks. In the domain of graphs, most algorithms are either very fast but do not care for solution quality at all (such as Hashing [14]), or are still fast, but much slower and capable of computing significantly better solutions than just random assignments (such as such Fennel [4]). Recently, the gap between these groups of algorithms has been closed by a streaming multi-section algorithm [8] which is up to two orders of magnitude faster than Fennel while cutting only \(5\%\) more edges than it on average. In the domain of hypergraphs, there is a similar gap that has not yet been closed. In particular, there is the same trivial Hashing -based algorithm on one side, and more sophisticated and expensive algorithms [11, 12] on the other side.

In this work, we propose FREIGHT: a Fast stREamInG Hypergraph parTitioning algorithm that can optimize for the cut-net as well as the connectivity metric. By using an efficient data structure, we make the overall running time of FREIGHT linearly dependent on the pin-count of the hypergraph and the memory consumption linearly dependent on the numbers of nets and blocks. Our proposed algorithm demonstrates remarkable efficiency, with a running time comparable to the Hashing algorithm and a maximum discrepancy of only four in three quarters of the instances used in our main experimental evaluation. Importantly, our study establishes the superiority of FREIGHT over all current (buffered) streaming algorithms and even the in-memory algorithm HYPE, in both cut-net and connectivity measures. This shows the potential of our algorithm as a valuable tool for partitioning hypergraphs in the context of large and constantly changing data processing environments.

2 Preliminaries

2.1 Basic Concepts

Hypergraphs and Graphs. Let \(H=(V=\{0,\ldots , n-1\},E)\) be an undirected hypergraph with no multiple or self hyperedges, with \(n = |V|\) vertices and \(m = |E|\) hyperedges (or nets). A net is defined as a subset of V. The vertices that compose a net are called pins. A vertex \(v\in V\) is incident to a net \(e\in E \) if \(v \in e\). Let \(c: V \rightarrow \mathbb {R}_{\ge 0}\) be a vertex-weight function, and let \(\omega : E \rightarrow \mathbb {R}_{>0}\) be a net-weight function. We generalize c and \(\omega \) functions to sets, such that \(c(V') = \sum _{v\in V'}c(v)\) and \(\omega (E') = \sum _{e\in E'}\omega (e)\). Let I(v) be the set of incident nets of v, let \(d(v) :=|I(v)|\) be the degree of v, let \(d_{w}(v) :=w(I(v))\) be the weighted degree of v, and let \(\Delta \) be the maximum degree of H. We generalize the notations d(.) and \(d_{w}(.)\) to sets, such that \(d(V') = \sum _{v\in V'}d(v)\) and \(d_{w}(V') = \sum _{v\in V'}d_{w}(v)\). Two vertices are adjacent if both are incident to the same net. Let the number of pins |e| in a net e be the size of e, let \(\xi = \max _{e \in E}\{|e|\}\) be the maximum size of a net in H. The pin-count of a hypergraph is the total number of pins in all of its nets, which is the summed cardinality of its nets.

Let \(G=(V=\{0,\ldots , n-1\},E)\) be an undirected graph with no multiple or self edges, such that \(n = |V|\), \(m = |E|\). Let \(c: V \rightarrow \mathbb {R}_{\ge 0}\) be a vertex-weight function, and let \(\omega : E \rightarrow \mathbb {R}_{>0}\) be an edge-weight function. We generalize c and \(\omega \) functions to sets, such that \(c(V') = \sum _{v\in V'}c(v)\) and \(\omega (E') = \sum _{e\in E'}\omega (e)\). Let \(N(v) = \left\{ u\,:\,\left\{ v,u\right\} \in E\right\} \) denote the neighbors of v. A graph \(S=(V', E')\) is said to be a subgraph of \(G=(V, E)\) if \(V' \subseteq V\) and \(E' \subseteq E \cap (V' \times V')\). When \(E' = E \cap (V' \times V')\), S is an induced subgraph. Let d(v) be the degree of vertex v and \(\Delta \) be the maximum degree of G.

Partitioning. The (hyper)graph partitioning problem consists of assigning each vertex of a (hyper)graph to exactly one of k distinct blocks respecting a balancing constraint in order to minimize the weight of the (hyper)edges running between the blocks, i.e., the edge-cut (resp. cut-net). More precisely, it partitions V into k blocks \(V_1\),...,\(V_k\) (i.e., \(V_1\cup \cdots \cup V_k=V\) and \(V_i\cap V_j=\emptyset \) for \(i\ne j\)), which is called a k-partition of the (hyper)graph. The edge-cut (resp. cut-net) of a k-partition consists of the total weight of the cut edges (resp. cut nets), i.e., edges (resp. nets) crossing blocks. More formally, let the edge-cut (resp. cut-net) be \(\sum _{i<j}\omega (E')\), in which \(E' :=\) \(\big \{e\in E, \exists \left\{ u,v\right\} \subseteq e: u\in V_i,v\in V_j, i \ne j\big \}\) is the cut-set (i.e., the set of all cut nets). The balancing constraint demands that the sum of vertex weights in each block does not exceed a threshold \(L_{\max }\) that is defined as \((1 + \epsilon )\) times the average block weight, where \(\epsilon \) is the permitted imbalance. More specifically, \(\forall i~\in ~\{1,\ldots ,k\} :\) \(c(V_i)\le L_{\max }:=\big \lceil (1+\epsilon ) \frac{c(V)}{k} \big \rceil \). For each net e of a hypergraph, \(\Lambda (e):= \{V_i~|~V_i \cap e \ne \emptyset \}\) denotes the connectivity set of e. The connectivity \(\lambda (e)\) of a net e is the cardinality of its connectivity set, i.e., \(\lambda (e):= |\Lambda (e)|\). The so-called connectivity metric is computed as \(\sum _{e\in E'} (\lambda (e) -1)~\omega (e)\), where \(E'\) is the cut-set. Due to its mathematical definition, the connectivity metric is also called the \(\lambda \)-1 metric.

Streaming. Streaming algorithms usually follow an iterative load-compute-store logic. Our focus and the most used streaming model is the one-pass model. In this model, vertices of a (hyper)graph are loaded one at a time alongside with their incident (hyper)edges, then some logic is applied to permanently assign them to blocks, as illustrated in Fig. 1. Typically, incident edges are loaded with both their endpoints, whereas incident hyperedges are loaded without their pins. A similar sequence of operations is used to partition a stream of edges of a graph on the fly. In this case, edges of a graph are loaded one at a time alongside with their end-points, then some logic is applied to permanently assign them to blocks. This logic can be as simple as a Hashing function or as complex as scoring all blocks based on some objective and then assigning the vertex to the block with highest score. There are other, more sophisticated, streaming models such as the sliding window [15] and the buffered streaming [6, 7], but are beyond the scope of this work. With respect to graphs, it may be reasonably assumed that the streaming model allows for a memory capacity of \(O(n+k)\). However, in the context of hypergraphs, there are no strict memory constraints, as evidenced by the fact that some algorithms in the literature [11, 12] require O(mk) memory.

Row-Net Model. The so-called row-net model is a procedure proposed by Çatalyürek and Aykanat [16] for converting graphs into hypergraphs. This model involves converting a graph G with n vertices into a hypergraph H with n vertices and nets. Each net v of H has as pins v and all the neighbors of vertex v in G. The original edges of G are not directly retained. This transformation model has effective applications in sparse matrix–vector multiplication. [16]

Fig. 1
figure 1

Typical layout of streaming algorithm for hypergraph partitioning

2.2 Related Work

There is a huge body of research on (hyper)graph partitioning. The most prominent tools to partition (hyper)graphs in memory include PaToH [17], Metis [18], hMetis [19], Scotch [20], HYPE [21], KaHIP [22], KaMinPar [23], KaHyPar [24], Mt-KaHyPar [25], and mt-KaHIP [26]. The readers are referred to [27,28,29] for extensive material and references. Here, we focus on the results specifically related to the scope of this paper. In particular, we provide a detailed review for the following problems based on the one-pass streaming model: hypergraph partitioning and graph vertex partitioning.

Streaming Hypergraph Partitioning. Alistarh et al. [11] propose Min-Max, a one-pass streaming algorithm to assign the vertices of a hypergraph to blocks. For each block, this algorithm keeps track of nets which contain pins in it. This implies a memory consumption of O(mk). When a vertex is loaded, Min-Max allocates it to the block containing the largest intersection with its nets while respecting a hard constraint for load balance. The authors theoretically prove that their algorithm is able to recover a hidden co-clustering with high probability, where a co-clustering is defined as a simultaneous clustering of vertices and hyperedges. In the experimental evaluation, Min-Max outperforms five intuitive streaming approaches with respect to load imbalance, while producing solutions up to five times more imbalanced than internal-memory algorithms such as hMetis. The objective function of FREIGHT differs from that of Min-Max. In particular, FREIGHT optimizes simultaneously for a partitioning quality metric (connectivity or cut-net) and load balancing, which has proven to be an effective approach in the context of streaming graph partitioning [4, 14]. Furthermore, FREIGHT exhibits a lower runtime and complexity than Min-Max

Taşyaran et al. [12] propose improved versions of the algorithm Min-Max [11]. The authors present Min-Max-N2P, a modified version of Min-Max that stores blocks containing each net’s pins instead of storing nets per block, as done in Min-Max. In their experiments, Min-Max-N2P is three orders of magnitude faster than Min-Max while keeping the same cut-net. The authors also introduce three algorithms with reduced memory usage compared to Min-Max: Min-Max-L \(\ell \), a modification of Min-Max-N2P that employs an upper-bound \(\ell \) to limit memory consumption per net, Min-Max-BF which utilizes Bloom filters for membership queries, and Min-Max-MH that uses hashing functions to replace the connectivity information between blocks and nets. In their experiments, their three algorithms reduce the running time in comparison to Min-Max, especially Min-Max-L \(\ell \) and Min-Max-MH, which are up to four orders of magnitude faster. On the other hand, the three algorithms generate solutions with worse cut-net than Min-Max, especially Min-Max-MH, which increases the cut-net by up to an order of magnitude. Moreover, the authors propose a technique to improve the partitioning decision in the streaming setting by including a buffer to store some vertices and their net sets. This approach operates similarly to Min-Max-N2P, but with the added ability to revisit buffered vertices and adjust their partition assignment based on the connectivity metric. The authors propose three algorithms using this buffered approach: REF that buffers every incoming vertex but only reassigns those that may improve connectivity, REF_RLX that buffers all vertices and reassigns all vertices in the buffer, and REF_RLX_SV that only buffers vertices with small net sets and reassigns all vertices in the buffer. Their experimental results show that the use of buffered approaches leads to a 5-\(20\%\) improvement in partitioning quality compared to non-buffered approaches, but with a trade-off of increased runtime. The objective function of FREIGHT differs from those of the algorithms above. In particular, FREIGHT jointly optimizes for partitioning quality and load balancing, whereas the other algorithms optimize for connectivity under self-imposed performance constraints. Furthermore, the runtime and memory complexity of FREIGHT is considerably lower than that of those algorithms that exhibit superior solution quality.

Streaming Graph Vertex Partitioning. Stanton and Kliot [14] introduced graph partitioning in the streaming model and proposed some heuristics to solve it. Their most prominent heuristic include the one-pass methods Hashing and linear deterministic greedy (LDG). In their experiments, LDG had the best overall edge-cut. In this algorithm, vertex assignments prioritize blocks containing more neighbors and use a penalty multiplier to control imbalance. Particularly, a vertex v is assigned to the block \(V_i\) that maximizes \(|V_i \cap N(v)|*\lambda (i)\) with \(\lambda (i)\) being a multiplicative penalty defined as \((1-\frac{|V_i|}{L_\text {max}})\). The intuition is that the penalty avoids to overload blocks that are already very heavy. In case of ties on the objective function, LDG assigns the vertex to the block with fewer vertices. Overall, LDG partitions a graph in \(O(m+nk)\) time. On the other hand, Hashing has running time O(n) but produces a poor edge-cut. Both LDG and FREIGHT optimize for cut quality and load balancing at the same time, but FREIGHT uses an additive combination, unlike the multiplicative approach used by LDG.

Tsourakakis et al. [4] proposed Fennel, a one-pass partitioning heuristic based on the widely-known clustering objective modularity [30]. Fennel assigns a vertex v to a block \(V_i\), respecting a balancing threshold, in order to maximize an expression of type \(|V_i\cap N(v)|-f(|V_i|)\), i.e., with an additive penalty. This expression is an interpolation of two properties: attraction to blocks with many neighbors and repulsion from blocks with many non-neighbors. When \(f(|V_i|)\) is a constant, the expression coincides with the first property. If \(f(|V_i|) = |V_i|\), the expression coincides with the second property. In particular, the authors defined the Fennel objective with \(f(|V_i|) = \alpha * \gamma * |V_i|^{\gamma -1}\), in which \(\gamma \) is a free parameter and \(\alpha = m \frac{k^{\gamma -1}}{n^{\gamma }}\). After a parameter tuning made by the authors, Fennel uses \(\gamma =\frac{3}{2}\), which provides \(\alpha =\sqrt{k}\frac{m}{n^{3/2}}\). As LDG, Fennel partitions a graph in \(O(m+nk)\) time. While Fennel and FREIGHT are mathematically equivalent for graphs, FREIGHT has a considerably more efficient implementation. In the domain of hypergraphs, the mathematical definition of FREIGHT can be regarded as an extension of that of Fennel.

Faraj and Schulz [8] propose a shared-memory streaming algorithm for vertex partitioning which performs recursive multi-sections on the fly. As a preliminary phase, their algorithm decomposes a k-way partitioning problem into a hierarchy containing \(\lceil \log _b k\rceil \) layers of b-way partitioning subproblems. This hierarchy can either reflect the topology of a high performance system to solve a process mapping [31, 32] or be computed for an arbitrary k to solve a regular vertex partitioning. Then, an adapted version of Fennel is used to solve each of the subproblems in such a way that the whole k-partition is computed on the fly during a single pass over the graph. While producing an edge-cut around \(5\%\) worse than Fennel, their algorithm has theoretical complexity \(O((m+nb)\log _b k)\) and experimentally ran up to two orders of magnitude faster than Fennel. As Fennel and freight are mathematically identical in the domain of graphs, FREIGHT is also \(5\%\) better than recursive multi-section on the fly in terms of edge-cut. Furthermore, FREIGHT has a lower runtime complexity than it.

Besides the one-pass model, other streaming models have also been used to solve vertex partitioning. Nishimura and Ugander [33] introduce a restreaming approach to partition the vertices of a graph. Their approach is motivated by scenarios where the same graph is streamed multiple times. In their model, a one-pass partitioning algorithm can pass multiple times through the entire input while the edge-cut is iteratively reduced. The authors propose ReLDG and ReFennel, which are respective restreaming adaptations of linear deterministic greedy [14] (LDG) and Fennel [4]. On the one hand, ReLDG modifies the objective of LDG to account only for vertex assignments performed during the current pass when computing block weights. On the other hand, ReFennel uses the same objective as Fennel during restreaming, but its additive balancing degrading factor is increased after each pass in order to enforce balance. Additionally, the authors prove that ReFennel converges after a finite number of restreams even without increasing the degrading factor. Their experiments confirm that their restreaming methods can iteratively reduce edge-cut. As the restreaming model is iterative, it falls outside the remit of FREIGHT. However, as Fennel and FREIGHT are mathematically identical in the domain of graphs, a restreaming version of FREIGHT would be mathematically identical to ReFennel, with the advantage of a considerably lower runtime complexity.

Awadelkarim and Ugander [5] investigate how the order in which vertices are streamed influences one-pass graph partitioning. The authors introduce the notion of prioritized streaming, where (re)streamed vertices are statically or dynamically reordered based on some predefined priority. Their approach, which is a prioritized version of ReLDG, uses multiplicative weights of restreaming algorithms and adapts the ordering of the streaming process inspired by balanced label propagation. In their experiments, the authors consider a wide range of stream orderings. The minimum overall edge-cut is obtained using a dynamic vertex ordering based on their own metric ambivalence. This approach is closely followed by a static ordering based on vertex degree. Unlike methods relying on vertex ordering, FREIGHT aims to partition (hyper)graphs in a single pass without static or dynamic vertex reordering.

An extended streaming model is the buffered streaming model. During execution, a buffer or batch of input vertices and their neighborhoods are repeatedly loaded, or a sliding window is maintained in memory. Patwary et al. [15] propose WStream, a simple streaming graph partitioning algorithm that keeps a sliding window in memory. The authors allow a few hundred vertices in the sliding window in order to obtain more information about a vertex before it is permanently assigned to a block based on a greedy function. As soon as a vertex is allocated to a block, one more vertex is loaded from the input stream into the sliding window, which keeps the window size constant. In their experiments, WStream cuts fewer edges than LDG and more edges than offline multilevel partitioning for most tested graphs. The streaming model and the objective function utilized by WStream diverge from those employed by FREIGHT.

Jafari et al. [6] perform graph partitioning using a buffered streaming computational model. The authors propose a shared-memory algorithm which repeatedly loads a batch of vertices from the stream input, partitions it using a multilevel scheme, and then permanently assigns the vertices to blocks. Their multilevel scheme is based on a simplified structure where the one-pass algorithm LDG is used for coarsening, computing an initial partition, and refining it. They parallelize LDG in a vertex-centric way by simply splitting vertices among processors, which yields a parallelization of the three steps of their multilevel scheme. In their experiments, their algorithms cuts fewer edges than LDG while scaling better than offline partitioning algorithms. FREIGHT operates on a different computational model, with a distinct objective function and lower runtime complexity than the algorithm from Jafari et al. [6].

Faraj and Schulz [7] propose HeiStream, an algorithm which also partitions vertices in a buffered streaming model. Their algorithm loads a batch of vertices, builds a graph model, and then partitions this model with a multilevel algorithm. In their graph model, the vertices from previous batches assigned to each block are represented as a single big vertex fixed to the respective block. Analogously, an edge between a vertex v from the current batch b and a vertex \({\bar{v}}\) from a previous batch \({\bar{b}}\) is represented by an edge \(({\bar{b}},b)\). In addition, when a vertex from a current batch has a neighbor from a future batch (i.e., not yet streamed), their model compactly represents this neighbor in a contracted form. Their multilevel algorithm has a traditional structure and components, except that the initial partitioning is a one-pass execution of Fennel and the label propagation refinement also optimizes the objective function used by Fennel to assign vertices to blocks. In particular, the Fennel objective is extended to weighted graphs. In experiments, HeiStream cuts fewer edges than LDG, Fennel, and the buffered streaming algorithm proposed by Jafari et at. [6], while being faster than Fennel for large numbers of blocks. As HeiStream, FREIGHT also extends from the mathematical definition of Fennel. However, FREIGHT has linear runtime complexity in general, while HeiStream only has linear complexity in the case where the number k of blocks is much smaller than the buffer size. Otherwise, HeiStream has a much larger runtime complexity.

3 FREIGHT: Fast Streaming Hypergraph Partitioning

In this section, we provide a detailed explanation of our algorithmic contribution. First, we define our algorithm named FREIGHT. Next, we present the advantages and disadvantages of using two different formats for streaming hypergraphs and partitioning them using FREIGHT. Additionally, we explain how we have removed the dependency on k from the complexity of FREIGHT by implementing an efficient data structure for block sorting.

3.1 Mathematical Definition

In this section, we provide a mathematical definition for FREIGHT by expanding the idea of Fennel to the domain of hypergraphs. Recall that, assuming the vertices of a graph being streamed one-by-one, the Fennel algorithm assigns an incoming vertex v to a block \(V_d\) where d is computed as follows:

$$\begin{aligned} d = \mathop {\textrm{argmax}}\limits _{i,~|V_i| < L_{\max }}\big \{|V_i\cap N(v)|-\alpha * \gamma * |V_i|^{\gamma -1}\big \} \end{aligned}$$
(1)

The term \(-\alpha * \gamma * |V_i|^{\gamma -1}\), which penalizes block imbalance in Fennel, is directly used in FREIGHT without modification and with the same meaning. The term \(|V_i\cap N(v)|\), which minimizes edge-cut in Fennel, needs to be adapted in FREIGHT to minimize the intended metric, i.e., either cut-net or connectivity. Before explaining how this is adapted, recall that, in contrast to graph partitioning, in hypergraph partitioning the incident nets I(v) of an incoming vertex v might contain nets that are already cut, i.e., with pins assigned to multiple blocks. The version of FREIGHT designed to optimize for connectivity accounts for already cut nets by keeping track of the block \(d_e\) to which the most recently streamed pin of each net e has been assigned. More formally, the connectivity version of FREIGHT assigns an incoming vertex v of a hypergraph to a block \(V_d\) with d given by Equation (2), where \(I^i_{obj}(v) = I^i_{con}(v) =\) \(\{ e \in I(v): d_e = i\}\). On the other hand, the version of FREIGHT designed to optimize for cut-net ignores already cut nets, since their contribution to the overall cut-net of the hypergraph k-partition is fixed and cannot be changed anymore. More formally, the cut-net version of FREIGHT assigns an incoming vertex v of a hypergraph to a block \(V_d\) with d given by Equation (2), where \(I^i_{obj}(v) = I^i_{cut}(v) =\) \(I^i_{con}(v) \setminus E'\) and \(E'\) is the set of already cut nets.

$$\begin{aligned} d = \mathop {\textrm{argmax}}\limits \limits _{i,~|V_i| < L_{\max }}\big \{|I^i_{obj}(v)|-\alpha * \gamma * |V_i|^{\gamma -1}\big \} \end{aligned}$$
(2)

Both configurations of FREIGHT interpolate two objectives: favoring blocks with many incident (uncut) nets and penalizing blocks with large cardinality. We briefly highlight that FREIGHT can be adapted for weighted hypergraphs. In particular, when dealing with weighted nets, the term \(|I^i_{obj}(v)|\) is substituted by \(\omega (I^i_{obj}(v))\). Likewise when dealing with weighted vertices, the term \(-\alpha * \gamma * |V_i|^{\gamma -1}\) is substituted by \(-c(v) * \alpha * \gamma * c(V_i)^{\gamma -1}\), where the weight c(v) of v is used as a multiplicative factor in the penalty term.

3.2 Streaming Hypergraphs

In this section, we present and discuss the streaming model used by FREIGHT. Recall in the streaming model for graphs vertices are loaded one at a time alongside with their adjacency lists. Thus, just streaming the graph (without doing additional computations, implies a time cost \(O(m+n)\). In our model, the vertices of a hypergraph are loaded one at a time alongside with their incident nets, as illustrated in Fig. 1. Our streaming model implies a time cost \(O(\sum _{e \in E}{|e|} + n)\) just to stream the hypergraph, where \(O(\sum _{e \in E}{|e|})\) is the cost to stream each net e exactly |e| times. FREIGHT uses \(O(m+k)\) memory, with O(m) being used to keep track, for each net e, of its cut/uncut status as well as the block \(d_e\) to which its most recently streamed pin was assigned. This net-tracking information, which substitutes the need to keep track of vertex assignments, is necessary for executing FREIGHT. Although FREIGHT consumes more memory than required by graph-based streaming algorithms which often use \(O({n}+k)\) memory, it is still far better than the O(mk) worst-case memory required by the state-of-the-art algorithms for streaming hypergraph partitioning [11, 12], all of which are also based on a computational model that implies a time cost \(O(\sum _{e \in E}{|e|} + n)\) just to stream the hypergraph.

3.3 Efficient Implementation

In this section, we describe an efficient implementation for FREIGHT. Recall that, for every vertex v that is loaded, FREIGHT uses Equation (2) to find the block with the highest score among up to k options. A simple method to accomplish this task consists of explicitly evaluating the score for each block and identifying the one with the highest score. This results in a total of O(nk) evaluations, leading to an overall complexity of \(O(\sum _{e \in E}{|e|}+nk)\). We propose an implementation that is significantly more efficient than this approach.

For each loaded vertex v, our implementation separates the blocks \(V_i\) for which \(|V_i|<L_{\max }\) into two disjoint sets, \(S_1\) and \(S_2\). In particular, the set \(S_1\) comprises blocks \(V_i\) where \(|I^i_{obj}(v)|>0\), while the set \(S_2\) comprises the remaining blocks, i.e., blocks \(V_i\) for which \(|I^i_{obj}(v)|=0\). Using the sets provided, we break down Equation (2) into Equations (3) and (4), which are solved separately. The resulting solutions are compared based on their FREIGHT scores to ultimately find the solution for Equation (2). The overall process is illustrated in Fig. 2.

$$\begin{aligned} d = \mathop {\textrm{argmax}}\limits \limits _{i \in S_1}\big \{|I^i_{obj}(v)|-\alpha * \gamma * |V_i|^{\gamma -1}\big \} \end{aligned}$$
(3)
$$\begin{aligned} d = \mathop {\textrm{argmax}}\limits \limits _{i \in S_2}\big \{|I^i_{obj}(v)|-\alpha * \gamma * |V_i|^{\gamma -1}\big \} = \mathop {\textrm{argmin}}\limits \limits _{i \in S_2}|V_i| \end{aligned}$$
(4)
Fig. 2
figure 2

Illustration of the process to solve Equation (2) for an incoming vertex u with \(k=512\) blocks. a The k blocks are decomposed into \(S_1\) and \(S_2\), with \(|S_1| = O(|I(u)|)\). b Equation (3) is explicitly solved at cost O(|I(u)|). c Equation (4) is implicitly solved at cost O(1). d Both solutions are then evaluated using their FREIGHT scores to determine the final solution for Equation (2)

Now we explain how we solve Equations (3) and (4). The solution to Equation (3) is obtained through explicit enumeration of its terms, resulting in the runtime cost outlined in Theorem 1. In contrast, Equation (4) is implicitly solved by identifying the block with minimal cardinality. We use an efficient data structure to keep all blocks sorted by cardinality throughout the entire execution, which enables us to solve Equation (4) in constant time.

Theorem 1

Equation (3) can be solved in time O(|I(v)|).

Proof

The terms \(|I^i_{obj}(v)|\) in Equation (3) can be computed by iterating through the nets of v at a cost of O(|I(v)|) and determining their status as cut, unassigned, or assigned to a block. The calculation of the factors \(-\alpha * \gamma * |V_i|^{\gamma -1}\) in Equation (3) can be done in time \(O(|S_1|) = O(|I(v)|)\), thus completing the proof. \(\square \)

Now we explain our data structure to keep the blocks sorted by cardinality during the whole algorithm execution. The data structure is implemented with two arrays A and B, both with k elements, and a list L. The array A stores all k blocks always in ascending order. The array B maps the index i of a block \(V_i\) to its position in A. Each element in the list L represents a bucket. Each bucket is associated with a unique block cardinality and contains the leftmost and the rightmost positions \(\ell \) and r of the range of blocks in A which currently have this cardinality. Reciprocally, each block in A has a pointer to the unique bucket in L corresponding to its cardinality. To begin the algorithm, L is set up with a single bucket for cardinality 0 which covers the k positions of A, i.e., its parameters \(\ell \) and r are 1 and k, respectively. The blocks in A are sorted in any order initially, however, as each block starts with a cardinality of 0, they will be ordered by their cardinalities.

Fig. 3
figure 3

Illustration of our data structure used to keep the blocks sorted by cardinality throughout the execution of FREIGHT. The array A is represented as a vertical rectangle. Each region of A is covered by a unique bucket, which is represented by a unique color filling the corresponding region in A. The cardinality associated with each bucket is written in the middle of the region of A covered by it. Here we represent the behavior of the data structure when assigning vertices to the block surrounded by a dotted rectangle five times consecutively

When a vertex is assigned to a block \(V_d\), we update our data structure as detailed in Algorithm 1 and exemplified in Fig. 3. We describe Algorithm 1 in detail now. In line 1, we find the position p of \(V_d\) in A and find the bucket C associated with it. In line 2, we exchange the content of two positions in A: the position where \(V_d\) is located and the position identified by the variable r in C, which marks the rightmost block in A covered by C. This variable r is afterwards decremented in line 3 since \(V_d\) is now not covered anymore by the bucket C. In lines 4 and 5, we check if the new (increased) cardinality of \(V_d\) matches the cardinality of the block located right after it in A. If so, we associate \(V_d\) to the same bucket as it and decrement this bucket’s leftmost position \(\ell \) in line 6; Otherwise, we push a new bucket to L and match it to \(V_d\) adequately in lines 8 and 9. Finally, in line 10, we delete C in case its range \([\ell ,r]\) is empty. Figure 3 shows our data structure through five consecutive executions of Algorithm 1. Theorem 2 proves the correctness of our data structure. Theorem 3 shows that, using our proposed data structure, we need time O(1) to either solve Equation (4) or prove that the solution for Equation (3) solves Equation (2). Note that our data structure can only handle unweighted vertices. In case of weighted vertices, a bucket queue can be used instead of our data structure, resulting in the same overall complexity and requiring \(O(k+L_{\max })\) memory, while our data structure only requires O(k) memory. The overall complexity of FREIGHT, which directly follows from Theorem 1 and Theorem 3, is expressed in Corollary 4.

Algorithm 1
figure a

Increment cardinality of block \(V_d\) in the proposed data structure

Theorem 2

Our proposed data structure keeps the blocks within array A consistently sorted in ascending order of cardinality.

Proof

We inductively prove two claims at the same time: (a) the variables \(\ell \) and r contained in each bucket from L respectively store the leftmost and the rightmost positions of the unique range of blocks in A which currently have this cardinality; (b) the array A contains the blocks sorted in ascending order of cardinality. Both claims are trivially true at the beginning, since all blocks have cardinality 0 and L is initialized with a single bucket with \(\ell =1\) and \(r=k\). Now assuming that (a) and (b) are true at some point, we show that they keep being true after Algorithm 1 is executed. Note that line 2 performs the only position exchange in A throughout the whole algorithm. As (a) is assumed, it is the case that \(V_d\) swaps positions with the rightmost block in A containing the same cardinality of \(V_d\). Since the cardinality of \(V_d\) will be incremented by one and all blocks have integer cardinalities, this concludes the proof of (b). To prove that (a) remains true, note that the only buckets in L that are modified are C (line 3), \(C^\prime \) (line 6), and \(C^{\prime \prime }\) (line 9). Claim (a) remains true for C because \(V_d\), whose cardinality will be incremented, is the only block removed from its range. Claim (a) remains true for \(C^\prime \) because line 6 is only executed if the new cardinality of \(V_d\) equals the cardinality of \(C^\prime \), whose current range starts right after the new position of \(V_d\) in A. Bucket \(C^{\prime \prime }\) is only created if the new cardinality of \(V_d\) is respectively larger and smaller than the cardinalities of C and \(C^\prime \). Since (b) is true, then this condition only happens if there is no block in A with the same cardinality as the new cardinality of \(V_d\). Hence, claim (a) remains true for \(C^{\prime \prime }\), which is created covering only the position of \(V_d\) in A. \(\square \)

Theorem 3

By utilizing our proposed data structure, solving Equation (4) or demonstrating that any solution for Equation (3) is also a solution for Equation (2) can be accomplished in O(1) time.

Proof

Algorithm 1 contains no loops and each command in it has a complexity of O(1), thus the total cost of the algorithm is O(1). Our data structure executes Algorithm 1 once for each assigned vertex, hence it costs O(1) per vertex. Say we are evaluating an incoming vertex v. According to Theorem 2, the block \(V_d\) with minimum cardinality is stored in the first position of the array A, hence it can be accessed in time O(1). In case \(V_d \in S_2\), then d is a solution for Equation (4). On the other hand, if \(V_d\) is in \(S_1\), the FREIGHT score of \(V_d\) will be larger than the FREIGHT score of the solution for Equation (4) by at least \(|I_d(v)| > 0\). In this case, it follows that any solution for Equation (3) solves Equation (2). \(\square \)

Corollary 4

The overall complexity of FREIGHT is \(O\big (\sum _{e \in E}{|e|} + n\big )\).

4 Experimental Evaluation

Setup. We performed our implementations in C++ and compiled them using gcc 11.2 with full optimization turned on (-O3 flag). Unless mentioned otherwise, all experiments are performed on a single core of a machine consisting of a sixteen-core Intel Xeon Silver 4216 processor running at 2.1 GHz, 100 GB of main memory, 16 MB of L2-Cache, and 22 MB of L3-Cache running Ubuntu 20.04.1. The machine can handle 32 threads with hyperthreading. Unless otherwise mentioned we stream (hyper)graphs directly from the internal memory to obtain clear running time comparisons. However, note that FREIGHT as well as most of the other used algorithms can also be run streaming the hypergraphs from hard disk.

Baselines. We compare FREIGHT against various state-of-the-art algorithms. In this section we will list these algorithms and explain our criteria for algorithm selection. We have implemented Hashing in C++, since it is a simple algorithm. It basically consists of hashing the IDs of incoming vertices into \(\{1,\ldots ,k\}\). The remaining algorithms were obtained either from official repositories or privately from the authors, with the exception of Min-Max, for which there is no official implementation available. Here, we use the Min-Max implementations by Taşyaran et al. [12]. All algorithms were compiled with gcc 11.2.

We run Hashing, Min-Max [11] and all its improved versions proposed by Taşyaran et al. [12]: Min-Max-BF, Min-Max-N2P, Min-Max-L \(\ell \), Min-Max-MH, REF, REF_RLX, and REF_RLX_SV. (see Sect. 2.2 for details on the different Min-Max versions), HYPE [21], and PaToH v3.3 [17]. Hashing is relevant because it is the simplest and fastest streaming algorithm, which gives us a lower bound for partitioning time. Min-Max is a current state-of-the-art for streaming hypergraph partitioning in terms of cut-net and connectivity. The improved and buffered versions of Min-Max proposed in [12] are relevant because some of them are orders of magnitude faster than Min-Max while others produce improved partitions in comparison to it. HYPE and PaToH are in-memory algorithms for hypergraph partitioning, hence they are not suitable for the streaming setting. However, we compare against them because HYPE is among the fastest in-memory algorithms while PaToH is very fast and also computes partitions with very good cut-net and connectivity. Note that KaHyPar [24] is the leading tool with respect to solution quality, however it is also much slower than PaToH.

Instances. We selected hypergraphs from various sources to test our algorithm. The considered hypergraphs were used for benchmark in previous works on hypergraph partitioning. Prior to each experiment, we converted all hypergraphs to the appropriate streaming formats required by each algorithm. We removed parallel and empty hyperedges and self loops, and assigned unitary weight to all vertices and hyperedges. In all experiments with streaming algorithms, we stream the hypergraphs with the natural given order of the vertices. We use a number of blocks \(k \in \{512,1024,1536,2048,2560\}\) unless mentioned otherwise. We allow a fixed imbalance of \(3\%\) for all experiments (and all algorithms) since this is a frequently used value in the partitioning literature. All algorithms always generated balanced partitions, except for HYPE which generated highly unbalanced partitions in around \(5\%\) of its experiments.

We use the same benchmark as in [24]. This consists of 310 hypergraphs from three benchmark sets: 18 hypergraphs from the ISPD98 Circuit Benchmark Suite [34], 192 hypergraphs based on the University of Florida Sparse Matrix Collection [35], and 100 instances from the international SAT Competition 2014 [36]. The SAT instances were converted into hypergraphs by mapping each boolean variable and its complement to a vertex and each clause to a net. From the Sparse Matrix Collection, one matrix was selected for each application area that had between 10 000 and 10 000 000 columns. The matrices were converted into hypergraphs using the row-net model, in which each row is treated as a net and each column as a vertex.

Methodology. Depending on the focus of the experiment, we measure running time, cut-net, and-or connectivity. We perform 5 repetitions per algorithm and instance using random seeds for non-deterministic algorithms, and calculate the arithmetic average of the computed objective function and running time per instance. When further averaging over multiple instances, we use the geometric mean in order to give every instance the same influence on the final score.

Given a result of an algorithm A, we express its value \(\sigma _A\) (which can be objective or running time) as improvement over an algorithm B, computed as \(\big (\frac{\sigma _B}{\sigma _A}-1\big )*100\%\); We also use performance profiles to represent results. They relate the running time (quality) of a group of algorithms to the fastest (best) one on a per-instance basis (rather than grouped by k). The x-axis shows a factor \(\tau \) while the y-axis shows the percentage of instances for which A has up to \(\tau \) times the running time (quality) of the fastest (best) algorithm. Bar charts and box plots are also employed to represent our findings. We use bar charts to visualize the average value of an objective function in relation to k, where each algorithm is represented by vertical bars of a given color with origin on the x-axis. The bars for every value of k have a common origin and are arranged in terms of their height, allowing all heights to be visible. We use box plots to give a clear picture of the dataset distribution by displaying the minimum, maximum, median, first and third quartiles, while disregarding outliers.

Fig. 4
figure 4

Comparison against the state-of-the-art streaming algorithms for hypergraph partitioning. We show performance profiles, improvement plots over Hashing, and box plots. Note that PaToH-con, PaToH-cut, and Hashing align almost perfectly with the y-axis in Figs. 4b, 4d, and 4f, respectively. Also the curves and bars of MM-N2P and MM-L5 roughly overlap with one another in Fig. 4d and Fig. 4c

4.1 Results

In this section, we show experiments in which we compare FREIGHT against the current state-of-the-art of streaming hypergraph partitioning. As already mentioned, we also use two internal-memory algorithms [17, 21] as more general baselines for comparison. We focus our experimental evaluation on the comparison of solution quality and running time. Observe that PaToH and FREIGHT have distinct versions designed to optimize for each quality metric (i.e., connectivity and cut-net). For a meaningful comparison, we only take into account the relevant version when dealing with each quality metric, however, both versions are still considered for running time comparisons. To differentiate between the versions, suffixes-con and-cut are added to represent the connectivity-optimized and cut-net versions respectively. For clarity, we refrain from discussing streaming algorithms that are dominated by another algorithm. We define a dominated algorithm as one that has worse running time compared to another without offering a superior solution quality in return, or vice-versa. In particular, we leave out Min-Max and Min-Max-BF since they are dominated by Min-Max-N2P, which is referred to as MM-N2P hereafter. Similarly, we omit Min-Max-MH because it is dominated by Hashing. We use a buffer size of \(15\%\) for testing the buffered algorithms REF, REF_RLX, and REF_RLX_SV, following the best results outlined in [12]. We omit the first two of them since they are dominated by the latter one, which is referred to as RRS(0.15) from now on. Since Min-Max-L \(\ell \) is not dominated by any other algorithm, we exhibit its results with \(\ell =5\), as seen in the best results in [12], and we refer to it as MM-L5 from this point.

Connectivity. We start by looking at the connectivity metric. In Fig. 4a, we plot the average connectivity improvement over Hashing for each value of k. PaToH-con produces the best connectivity on average, yielding an average improvement of \(443\%\) when compared to Hashing. This is in line with previous works in the area of (hyper)graph partitioning, i.e. streaming algorithms typically compute worse solutions than internal memory algorithms, which have access to the whole graph. FREIGHT-con is found to be the second best algorithm in terms of connectivity, outperforming both the internal memory algorithm HYPE and the buffered streaming algorithm RRS(0.15). On average, these three algorithms improve \(194\%\), \(171\%\), and \(136\%\) over Hashing, respectively. Finally, MM-N2P and MM-L5 compute solutions which improve \(111\%\) and \(96\%\) over Hashing on average, respectively. In direct comparison, FREIGHT-con shows average connectivity improvements of \(8\%\), \(24\%\), \(39\%\), and \(50\%\) over HYPE, RRS(0.15), MM-N2P, and MM-L5, respectively. Note that each algorithm retains its relative ranking in terms of average connectivity over all values of k.

In Fig. 4b, we plot connectivity performance profiles across all experiments. PaToH-con produces the best overall connectivity for \(96.4\%\) of the instances, while FREIGHT-con produces the best connectivity for \(3.1\%\) of the instances and no other algorithm computes the best connectivity for more than \(0.35\%\) of the instances. The connectivity produced by FREIGHT-con, HYPE, RRS(0.15), MM-N2P, MM-L5, and Hashing are within a factor 2 of the best found connectivity for \(67\%\), \(61\%\), \(47\%\), \(41\%\), \(34\%\), and \(9\%\) of the instances, respectively. In summary, FREIGHT-con produces the best connectivity among (buffered) streaming competitors, outperforming even in-memory algorithm HYPE.

Cut-Net. Next we examine at the cut-net metric. In Fig. 4c, we plot the cut-net improvement over Hashing. PaToH-cut produces the best overall cut-net, with an average improvement of \(100\%\) compared to Hashing. FREIGHT-cut is found to be the second best algorithm with respect to cut-net, superior to internal-memory algorithm HYPE and buffered streaming algorithm RRS(0.15). These three algorithms improve connectivity over Hashing by \(37\%\), \(30\%\), and \(17\%\) respectively. Finally, both MM-N2P and MM-L5 improve connectivity by \(13\%\) on average over Hashing. In direct comparison, FREIGHT-cut shows average connectivity improvements of \(6\%\), \(18\%\), \(22\%\), and \(22\%\) over HYPE, RRS(0.15), MM-N2P, and MM-L5, respectively. Each algorithm preserves its relative ranking in average cut-net across all values of k.

In Fig. 4d, we plot cut-net performance profiles across all experiments. In the plot, PaToH-cut produces the best overall connectivity for \(98.0\%\) of the instances, while FREIGHT-cut and HYPE produce the best cut-net for \(6.8\%\) and \(5.2\%\) of the instances and all other streaming algorithms (RRS(0.15), MM-N2P, MM-L5, and Hashing) produce the best cut-net for \(4.8\%\) of the instances. The cut-net results produced by FREIGHT-cut, HYPE, RRS(0.15), MM-N2P, MM-L5, and Hashing are within a factor 2 of the best found cut-net for \(83\%\), \(79\%\), \(69\%\), \(66\%\), \(66\%\), and \(58\%\) of the instances, respectively. This shows that FREIGHT-cut produces the best cut-net among all (buffered) streaming competitors and even beats the in-memory algorithm HYPE.

Running Time. Now we compare the algorithms’ runtime. Boxes and whiskers in Fig. 4e display the distribution of the running time per pin, measured in nanoseconds, for all instances. Hashing, FREIGHT-cut, and FREIGHT-con are the three fastest algorithms, with median runtime per pin of 15ns, 38ns, and 41ns, respectively. MM-L5, MM-N2P, HYPE, and RRS(0.15) follow with median runtime per pin of 130ns, 437ns, 792ns, and 833ns, respectively. Lastly, the algorithms with the highest median runtime per pin are PaToH-cut and PaToH-con, with 2516ns and 3333ns respectively. The measured runtime per pin for both HYPE and PaToH align with values reported in prior research [37].

In Fig. 4f, we show running time performance profiles. Hashing is the fastest algorithm for \(98.3\%\) of the instances, while FREIGHT-cut is the fastest one for \(1.2\%\) of the instances and no other algorithm is the fastest one for more than \(0.4\%\) of the instances. Cache misses partially explain why FREIGHT-cut is faster than Hashing for some instances, as Hashing is an algorithm with memory-bounded speed. Another possible explanation is noise, as FREIGHT-cut ignores nets that are cut, hence it behaves similarly to Hashing if all remaining nets are already cut. The running time of FREIGHT-cut and FREIGHT-con is within a factor 4 of that of Hashing for \(82\%\) and \(72\%\) of instances, respectively. In contrast, for only \(16\%\) of instances does this occur for MM-L5, and for less than \(0.4\%\) of instances for all other algorithms. The close running times of FREIGHT to Hashing are surprising given FREIGHT’s superior solution quality compared to Hashing and all other streaming algorithms and even HYPE.

Fig. 5
figure 5

Memory comparison against the state-of-the-art streaming algorithms for hypergraph partitioning. We show a performance profile and box plots. Note that PaToH-con and PaToH-cut curves are almost coincident in Fig. 5b. The same thing happens for FREIGHT-cut and FREIGHT-con

Memory. An examination of the memory utilization of the algorithms is undertaken. In the implementation of competitor streaming algorithms, with the exception of Hashing, the prevalent approach involves loading the entire input hypergraph into a specialized in-memory data structure prior to the partitioning process. Due to the non-trivial nature of these algorithms, it is not feasible to re-implement them with an alternative memory strategy. Nevertheless, their memory strategy does not appear to be the bottleneck influencing their overall memory consumption. This observation is supported by the memory footprint data presented by Taşyaran et al. [12], where the algorithms Min-Max-L5, Min-Max-N2P, and REF_RLX_SV consistently exhibit a significantly higher memory usage compared to the reported size of their respective hypergraphs. In contrast, Hashing and FREIGHT are specifically designed to employ memory resources that scale sub-linearly with the hypergraph size and remain independent of the number k of blocks. In our implementation of Hashing, it uses \(O(n+m)\) memory. There exist alternative implementations that require O(1) memory, meaning that decisions are made on the fly and not stored. However, for many applications, it is realistic to assume that at least constant information is kept about all vertices and edges.

Figure  5a presents the distribution of memory consumption, measured in megabytes, for all instances using boxes and whiskers. Consistently, Hashing, FREIGHT-cut, and FREIGHT-con emerge as the three algorithms exhibiting the least memory usage, displaying median memory footprints of 6.0 MB, 6.1 MB, and 6.2 MB, respectively. Conversely, our competitor streaming algorithms, including MM-L5, MM-N2P, and RRS(0.15), exhibit higher median memory consumption, registering values of 28.9 MB, 30.8 MB, and 34.7 MB, respectively. Notably, the in-memory algorithms PaToH-cut, PaToH-con, and HYPE record the highest median memory footprint, utilizing 79.2 MB, 81.1 MB, and 149.3 MB, respectively. The observed memory variance aligns with the same three groups of algorithms exhibiting similar behavior. Specifically, Hashing, FREIGHT-cut, and FREIGHT-con display maximum memory consumption values of 71.3 MB, 137.6 MB, and 138.2 MB, respectively. In contrast, MM-L5, MM-N2P, and RRS(0.15) reach maximum memory consumption levels of 1.9 GB, 2.0 GB, and 2.2 GB, respectively. Finally, PaToH-cut, PaToH-con, and HYPE exhibit maximum memory consumption values of 6.2 GB, 6.2 GB, and 10.8 GB, respectively.

In Fig. 5b, we present memory performance profiles. The memory usage of Hashing is notably minimal, remaining within \(3\%\) of the minimal for approximately \(92\%\) of instances. Similarly, both FREIGHT-cut and FREIGHT-con maintain their memory consumption within \(3\%\) of the minimum for around \(58\%\) of instances. In contrast, this proximity to minimal memory consumption is observed for only \(3\%\) of instances for MM-N2P, and for less than \(1\%\) of instances for all other algorithms. The similarity in memory consumption between FREIGHT and Hashing is noteworthy, especially considering FREIGHT’s superior solution quality compared to Hashing and all other streaming algorithms, including HYPE.

Fig. 6
figure 6

Runtime comparison against the state-of-the-art streaming algorithm for graph partitioning, Fennel. We show a plot that relates the speedup over Fennel to k and a performance profile plot. Note that the FREIGHT curve is perfectly aligned with the y-axis in Fig. 6b

Table 1 Huge hypergraphs used in experiments
Table 2 Experimental results on huge hypergraphs. Cut(%) represents the percentage of cut nets, while Con(%) indicates the connectivity as a percentage of the number of pins in the hypergraph

4.2 Further Comparisons

Huge Hypergraphs. We now switch to the main use case of streaming algorithms: computing high-quality partitions for huge hypergraphs on small machines. The experiments in this section are run on the relatively small machine that has a four-core Intel Xeon E5420 processor running an Ubuntu 20.04.1 at 2.5 GHz, 16 GB of main memory, and 24 MB of L2-Cache. The hypergraphs used in these experiments are obtained by transforming very large graphs into hypergraphs using the row-net model [16], which we describe in Sect. 2.1. The used huge graphs were obtained from [38, 39], and [40]. All considered graphs were used for benchmark in previous works on streaming graph partitioning [7]. Table 1 lists characteristics of the obtained huge hypergraphs. In these experiments, we utilized the same values of k as before, namely \(\{512,1024,1536,2048,2560\}\), but we did not repeat each test multiple times with different seeds as in previous experiments. We tried to run all competitor algorithms. However, with the exception of Hashing, all other competitors failed for all instances since they require more memory than the machine has. We present our detailed results in Table 2. We exclude from Table 2 the IO delay to load the input hypergraph from the disk, since it depends on the disk and is roughly the same independently of k and the used partitioning algorithm, except for Hashing, which does not need to load the hypergraph. For completeness, we report this average delay (in seconds) for the huge hypergraphs listed in Table 1 following their respective order: 37.9, 54.4, 72.9, 80.4, 53.7, 65.6, 98.1.

Consistent with the primary empirical findings outlined in Sect. 4.1, our experimentation involving huge hypergraphs underscores the superior performance of FREIGHT-cut and FREIGHT-con when compared to Hashing. Note that Hashing consistently cuts almost all of the hyperedges, as for a hyperedge only one pin needs be in a different block in order to be a cut hyperedge – for Hashing the chances of that happening are very high as also evident from the experiments. However, even when considering connectivity for which Hashing computes slightly better results, improvements are very big often even large than a factor of ten. Regarding runtime, unsurprisingly Hashing stands out as the fastest algorithm across all instances, followed by FREIGHT-cut and subsequently by FREIGHT-con. The runtime measurements for FREIGHT-cut range from 6 to 29 times the corresponding runtime outcomes of Hashing. Additionally, discernible differences in runtime are observed between FREIGHT-cut and FREIGHT-con, particularly evident in the cases of social networks such as twitter7 and soc-friendster, as well as for er-fact1.5s26.

Graph Partitioning. For graph vertex partitioning FREIGHT and Fennel are mathematically equivalent. However, FREIGHT exhibits a lower computational complexity of \(O(m+n)\) compared to the standard implementation of Fennel, which has a complexity of \(O(m+nk)\) due to evaluating all blocks for each vertex. To optimize its performance for this use case, we have implemented an optimized version of FREIGHT with a memory consumption of \(O(n+k)\), matching that of Fennel. In our experiments, we utilized the same graphs as in [8] and tested with \(k \in \{512,1024,1536,2048,2560\}\).

In Fig. 6a, we show the speedup of FREIGHT over Fennel as a function of k. Note that the speedup of FREIGHT increases for larger values of k, with FREIGHT reaching up to 196x for the largest value of k which is consistent with the theoretical complexities involved. In Fig. 6b, we show running time performance profiles. The performance of FREIGHT is significantly better than that of Fennel, with a speedup of at least 12x and up to 546x. On average, FREIGHT is 109 times faster than Fennel.

5 Conclusion

In this work, we introduce FREIGHT, a highly efficient and effective streaming algorithm for hypergraph partitioning. Our algorithm leverages an optimized data structure, resulting in linear running time with respect to pin-count and linear memory consumption in relation to the numbers of nets and blocks. The results of our extensive experimentation demonstrate that the running time of FREIGHT is competitive with the Hashing algorithm, with a geometric mean runtime within a factor of four compared to the Hashing algorithm. Importantly, our findings indicate that FREIGHT consistently outperforms all existing (buffered) streaming algorithms and even the in-memory algorithm HYPE, with regards to both cut-net and connectivity measures. This underscores the significance of our proposed algorithm as a highly efficient and effective solution for hypergraph partitioning in the context of large-scale and dynamic data processing.