1 Introduction

Spark [1] is an emerging computing framework which is based on memory for calculation, it is more suitable for iterative data processing and interactive data processing, and improves the real-time of data processing in a big data environment. Meanwhile, Spark SQL components include relational processing and the functional programming API of Spark, which can handle SQL statements that cannot be executed in the MapReduce framework and perform real-time data analysis. In the process of data processing analysis and data query, it is necessary to carry out the operation of equal join. Spark has a high efficiency in dealing with the operation of equal join between small data tables or between large data table and small data table, but it takes a lot of time and costs to process the join operation between large data tables. Therefore, it becomes the focus of this paper about how to optimize the operation of equal join between large data tables.

In literature [2], a two-way connection optimization algorithm based on Bit-map and Distributed Cache mechanism is proposed. This method can reduce the network transmission overhead and filter out some data that does not satisfy the connection condition. Literature [3] proposes an improved equal join algorithm, it first uses Bloom Filter to filter the dataset to be connected, and uses Spark’s own pool sampling algorithm and Spark Statistics library to sample and analyze the connection properties. According to the data analysis result and the greedy algorithm, the one-sided table is split, and finally the connected subset is the final result. Literature [4] performs a de-duplication operation on the Fact table to obtain the Fact UK dataset, and records the corresponding position, and then the Fact UK and Dim tables are connected, which creates the Joined UK, finally, assembling Joined UK and Fact according to the location recorded by Joined UK to get the final result. This method is only suitable for equal join between large and small tables. The scalable hash join algorithm proposed in literature [5] can be used to process coarse-grained distributed data, and it can consistently give approximately consistent connection results even in the case of memory overflow, and the scalability of it is very good. The Spatio-Temporal join algorithm proposed in document [6] is mainly used to process a large number of spatial data. The join operation of this algorithm combines space, time and attribute predicates, and can solve the join problem based on space, time and attribute. Aiming at the large network transmission overhead and data skew problem of the equal join operation between big data tables in Spark, this paper proposes an optimized Spark big table equal join strategy. The specific work is as follows:

  1. (1)

    This paper proposes a Split Compressed Bloom Filter (SCBF) data filtering method to pre-process the data to be connected and reduce the amount of data in the shuffle process.

  2. (2)

    In this paper, the Maxdiff histogram is used to calculate the data distribution of the join key, in order to find the skew data.

  3. (3)

    This paper proposes an RDD splitting mechanism to optimize the data skew problem in equal join operations.

2 Related Works

2.1 Spark Operation Mode and Architecture

Spark [8] is a full-stack computing platform written in Scala language to deal with large-scale data, including local, standalone, yarn and other operating modes. This paper is based on the Standalone running mode of Spark for load balancing research. The Spark-Standalone eco-architecture is shown in Fig. 1. It consists of a four-layer structure that includes the resource management layer, the data storage layer, the Spark core layer, and the Spark component layer [7, 9]. The Spark component layer and Spark core layer are the basic frameworks of Spark. Spark core is a core component of Spark and provides the most basic and core functions of Spark data processing. The Spark component layer provides support for SQL queries, streaming calculations, graph calculations, and machine learning. The data storage layer is mainly composed of Tachyon, HDFS and HBase. Spark can invoke data from the data storage layer to relieve the storage pressure of Spark. The resource management layer adopts standalone mode to dynamically manage and schedule Spark resources and achieve reasonable resource allocation.

Fig. 1.
figure 1

Spark-standlone architecture.

2.2 Spark Common Join Algorithm Analysis

Spark often uses join operations for data processing analysis. The join operation in Spark includes the connection between two tables and the connection between multiple tables. This paper takes the connection between the two tables as the optimization goal. The join operation essentially connects data with equal key values in two RDD[key, value], where key represents the connection property and value represents the other properties in each tuple. Spar mainly includes three commonly used join algorithms [10, 11]: Broadcast Hash Join, Hash Join and Sort Merge Join. Broadcast Hash Join is only suitable for join operations between small tables or between large tables and small tables. Hash Join and Sort Merge Join are suitable for most join operations, but if the data table is too large, network communication and I/O cost will be very high.

2.3 Bloom Filter

Bloom Filter [12, 22] is a data structure with good spatial and temporal efficiency, which is usually used to detect whether an element belongs to a large number of data sets. The basic idea of Bloom Filter is as follows: Initializing an array of bits with a length of m and set each bit of the array to 0. For the set S = {x1, x2,, xn}, each element in the set is mapped to a bit array by k independent hash functions, and the mapping position is set to 1. Assuming that x is a element in the set S, and the mapping function of x is shown in formula 1.

$$ h_{i} (x) = y(1 \le i \le k,1 \le y \le m) $$
(1)

The formula (1) indicates that the element x is mapped to the position y through the i hash function, and the position y is set from 0 to 1. All elements in the set S are mapped by k hash functions, and the w bits (w < m) in the bit array are set to 1, and then the set S completes the set representation of the bit array.

Bloom Filter contains three important parameters: error rate f, number of hash functions k, and bit array size m. The calculation formulas for the three parameters are as follows.

$$ f = (1 - (1 - \frac{1}{m})^{kn} )^{k} \approx (1 - e^{ - kn/m} )^{k} $$
(2)
$$ k_{best} = (m/n)\ln 2 $$
(3)
$$ m = - (n\ln p)/(\ln 2)^{2} $$
(4)

2.4 Histogram Method

The histogram [13, 14] is a two-dimensional statistical graph that approximates represents the distribution of data, such as the center, spread, and shape of the data, by using binning technology. The histogram generally has three basic attributes: segmentation constraint, sorting parameter and source parameter. These attributes respectively define the segmentation mode of the histogram bucket, the data sorting mode and the bucket boundary, which have great significance for the construction of the histogram. Based on the bucket and attribute values, the histogram can be divided into the following types: equal-width histogram, isometric histogram, V-Optimal histogram [15, 16], and Maxdiff histogram. Among them, Maxdiff and V-Optimal are the two most accurate histograms for evaluating the data distribution. The Maxdiff histogram is superior to the V-Optimal histogram in both time complexity and space complexity [17], so this paper uses Maxdiff histogram to statistic data distribution.

3 Optimization Method of Spark Big Table Equal Join

The Spark large table equal join optimization method proposed in this paper is mainly divided into five stages: (1) connection attribute filtering and statistics, (2) analysis of skew data distribution, (3) RDD segmentation, (4) join operation, and (5) result combination. The symbol names involved in this section are shown in Table 1.

Table 1. Symbol name table.

3.1 Data Filtering Based on Split Compressed Bloom Filter

Split Compressed Bloom Filter.

This paper proposes a Split Compressed Bloom Filter (SCBF) algorithm based on the split idea of Split Bloom Filter (SBF) [19] and the compression mechanism of Compressed Bloom Filter (CBF) [18], which can be applied to data processing with unknown data set and the space occupancy rate will not be too high.

The main idea of the algorithm is as follows: Assuming that the initial bit array size is m, the total number of hash functions is k, and the compressed array size is z. The CBF is used to process the elements contained in the data set S. When the limitation of the CBF bit array is reached, a new CBF bit array of the same size as the initial CBF bit array is generated until the elements in the data set S all complete the set representation of the bit array. When querying whether an element exists in the SCBF bit array, as long as the value of the element mapped to one or more sub-CBF bit arrays in the SCBF is 1, the element exists in the SCBF array.

The size of the bit array after compression is as shown in Eq. (5).

$$ z = mH(p) $$
(5)
$$ H(p) = - p\log_{2}^{p} - (1 - p)\log_{2}^{(1 - p)} $$
(6)
$$ p \approx e^{ - kn/m} $$
(7)

SCBF Data Filtering Operation.

In this paper, SCBF is used to compress and filter two tables to be connected, and remove the invalid data in the two tables, which reduces the amount of data in the shuffle process during table connection. The specific process is as follows:

  1. (1)

    Extracting the connection attributes of the two tables to be connected respectively, and performing deduplication operations on them, so that every key value in every group of connection attributes has one and only one;

  2. (2)

    Using SCBF to compress the connection properties of two tables to be joined to obtain two bit arrays;

  3. (3)

    Two newly generated bit arrays do logical AND to obtain the final bit array SCBFfinal;

  4. (4)

    Using SCBFfinal to filter two data tables to be connected, so as to get two new data tables.

The processing flow at this stage is shown in Fig. 2.

Fig. 2.
figure 2

SCBF filter flow.

3.2 Skew Data Distribution Statistics Based on Maxdiff Histogram

Maxdiff Histogram.

Maxdiff histogram [20, 21] takes the key value extracted from the filtered data table as the sorting parameter and the frequency difference fd of the adjacent key value as the source parameter to represent the data distribution. The data with large frequency difference of the adjacent key value are divided into different buckets, and the frequency of the key value in the same bucket remains small difference. By dividing the Maxdiff histogram into bucket operation, we can get the tilted data in the two data tables.

Among them, the frequency difference fd is calculated as formula (8).

$$ fd_{ij} = f_{i} - f_{j} $$
(8)

fi and fj represent the frequencies of keyi and keyj respectively.

Maxdiff histogram is shown as Fig. 3.

Fig. 3.
figure 3

Maxdiff histogram.

Research on Statistical Method of Skew Data Distribution.

The Maxdiff histogram is used to analyze the skew data distribution, and its specific steps are as follows:

  1. (1)

    Extracting the connection properties from RDDA_new and RDDB_new respectively, and sampling them with the Spark Sample operator.

  2. (2)

    According to the sampling results of two sets of connection attributes, the frequency f corresponding to each key value of two tables is calculated, and the Maxdiff histograms corresponding to two sets of connection attributes are constructed respectively.

  3. (3)

    Finding out the most inclined key values corresponding to the two tables through the segmentation result of Maxdiff histogram barrel.

The process of count data distribution based on Maxdiff histogram is shown in Fig. 4.

Fig. 4.
figure 4

The flow of counting data distribution.

3.3 RDD Split and Result Combination

RDD splitting connection and result combination are mainly completed in three steps, and its specific steps are as follows:

  1. (1)

    Splitting RDD to RDDA_new and RDDB_new according to the skew distribution of the two data tables. First, the skew key in the two RDD do logical AND to obtain the full skew key SkewAB of the two RDD, and then the tuples corresponding to SkewAB in the two RDDs are separately split to generate new RDD, and finally the remaining tuples corresponding to the ordinary key generate a new RDD. Supposing that there are i skew key in RDDA_new and j skew key in RDDB_new. The data corresponding to each value generates a new RDD, which in turn generates RDDA_skew1, RDDA_skew2, …, RDDA_skewm and RDDB_skew1, RDDB_skew2, …, RDDB_skewm,and \( \hbox{min} \left\{ {i,j} \right\} \le m \le i + j \). The remaining data of RDDA_new generates RDDA_original, and the remaining data of RDDB_new generates RDDB_original.

  2. (2)

    The RDD with the same key value are joined, and the split RDD are mainly divided into two categories: the join between big tables, and the join between big table and small table. The Hash Join method is used to connect two large tables, and the Broadcast Join method is used to connect the two tables between large and small.

  3. (3)

    All the join results are combined by the union operator, and the result is the join result of the two data tables.

The flow of RDD split, RDD join and result combination are shown in Fig. 5.

Fig. 5.
figure 5

The flow of RDD split and result combination.

4 Experimental Result and Analysis

The Spark big table equal join method proposed in this paper is experimentally verified in the Spark cluster environment, including three aspects: (1) verifying the effectiveness of the SCBD algorithm in data filtering; (2) verifying the effectiveness of RDD splitting mechanism on tilting data process; (3) verifying the effectiveness of the Spark big table equal join method in terms of the task overall running time. All experiments in this paper compare and analyze three sets of data sets of different sizes. The data set size is shown in Table 2.

Table 2. Experimental data set.

4.1 Experimental Environment Configuration

The optimization algorithm proposed in this paper is verified on a Spark cluster. The cluster was set up on a desktop and two laptops. One desktop computer and one notebook computer are win7 operating system, and the other notebook computer is win10 operating system. A Spark cluster consists of six nodes. One node serves as the master node of the Spark cluster, and the other nodes serve as slave nodes of the Spark cluster.

4.2 Data Filtering Comparison Experiment

The two data tables to be connected are filtered by the SCBF method, and then the hash join are performed respectively on the original data tables and the filtered tables. The shuffle read and shuffle write results of the shuffle phase are as shown in Tables 3 and 4 shows.

Table 3. Shuffle read.
Table 4. Shuffle write.

It can be seen from Tables 3 and 4 that compared with performing Hash Join on the data set directly, the method this paper proposed that using SCBF to filter data of two tables, and then performing Hash Join, is relatively low in shuffle read and shuffle write and has obvious advantages.

4.3 Data Skew Degree Comparison Experiment

Comparing the data skew degree of the two data sets before and after RDD split, and the experimental results are shown in Fig. 6. The data skew degree is defined in definition 1.

Fig. 6.
figure 6

Data skew comparision.

Definition 1. Data skew degree DS. This paper uses the standard deviation calculation formula to measure the skew degree of the data partition. The smaller the degree, the more balanced the data partition is. The calculation formula is as follows.

$$ DS = \frac{{\sum\limits_{i = 1}^{m} {DS_{i} } }}{m} $$
(9)
$$ DS_{i} = \sqrt {\frac{{\sum\limits_{j = 1}^{n} {(x_{j} - \bar{x}_{{}} )^{2} } }}{n}} $$
(10)

Where m represents the number of RDD, DSi represents the data skew of the i-th RDD, n represents the partitions number of the i-th RDD, and xj represents the amount of data of the partition j.

It can be seen from Fig. 6 that the split data set has a lower data skew degree and a more balanced data distribution than the un-split data set. Therefore, the RDD split method proposed in this paper can solve the data skew problem encountered in the Spark data join process.

4.4 Task Running Time Comparison Experiment

The Spark big table equal join optimization method proposed in this paper and Spark’s own join method are used to connect the same data set. The running time of the two methods is shown in Fig. 7.

Fig. 7.
figure 7

Running time comparison.

It can be seen from Fig. 7 that when the amount of data is small, the method proposed in this paper has little difference in running time compared with Spark’s own join algorithm. When the amount of data increases gradually, the big table equal join optimization method proposed in this paper takes less time and has obvious advantages compared with the spark’s own algorithm. Therefore, the join optimization method proposed in this paper is superior to Spark’s own join algorithm when dealing with big table equal join problems.

5 Conclusions

Spark is one of the mainstream frameworks for big data processing, and it is of great significance for the processing and analysis of large-scale data. However, Spark itself has some shortcomings, such as when it handles the join operation between two large tables, the efficiency is not high and the cost is too large. Aiming at this problem, this paper proposes a big table equal join method based on filtering and splitting. Firstly, the Split Compressed Bloom Filter algorithm is used to filter the data set to be connected, then the Maxdiff histogram is used to obtain the skew data distribution, and then split them, finally join the split data and merge the result. The method not only reduces the amount of data in the shuffle process of the join operation, but also reduces the overall running time. It greatly improves the efficiency of the Spark big table equal join, and improves the big data processing capability of the Spark. Of course, the research in this paper is not perfect enough, how to make efficient splitting of RDD remains to be researched and improved.