Keywords

1 Introduction

A spatial interpolation algorithm is the method in which the attributes at some known locations (data points) are used to predict the attributes at some unknown locations (interpolated points). Spatial interpolation algorithms, such as the Inverse Distance Weighting (IDW) [15], Kriging [24], Moving Least Squares method (MLS) [19], Radial Basis Functions (RBFs) Interpolation [5,6,7]. Different interpolation methods are widely used in various scientific fields, such as Geographic Information System (GIS) [9, 10], geometric modeling [2, 11], image processing [8, 18], numerical analysis [25, 27].

Interpolation algorithms are widely used in the field of life science applications. Liu et al. [13] proposed a hybrid approach to shape-based interpolation of stereotactic atlases of the human brain. Volkau et al. [26] combined a minimal distance map and cubic splines to reconstruct the subcortical structures of the Talairach-Tournoux atlas. Parrot et al. [23] focused on interpolation of scalar values in the 3-D gird of input data. Pan et al. [22] compared filter interpolation, ordinary interpolation and general partial volume interpolation in medical image interpolation.

In large-scale spatial interpolations, to improve the computational efficiency of interpolating, it always uses a local set of data points rather than the global set of data points to predict the interpolation value of each interpolated points. Thus, it commonly needs to find a local set of data points for each interpolated point using several approaches such as the k Nearest Neighbor (kNN) search procedure.

For example, Li et al. [12] proposed the Random kNN (a novel generalization of traditional nearest-neighbor modeling) for pattern analysis and modeled with high-dimensional data. Al Aghbari [1] studied the multiple kNN queries processing techniques in constrained spatial networks. Nutanong [20] studied an efficient algorithm for moving k Nearest Neighbor queries. Roberto Cavoretto [4] proposed an efficient scheme for the computation of triangular Shepard method. Mei [17] presented an efficient AIDW interpolation algorithm on the GPU by utilizing a fast kNN search method.

The space decomposition data structures such as RP-tree [21], VP-tree [14], k-d tree [3], and uniform grid [17] are employed to accelerate the kNN search procedure. Among those space decomposition structures, the uniform grid is the simplest. And a critical issue in creating the uniform grid is the size of grid cell since it could strongly affect the search efficiency and cannot be too small or too large. To the best of our knowledge, there is currently no research work specifically focusing on determining the optimal size of grid cell in the kNN search.

Based on our previous work [7, 16, 17], in this paper we first evaluate the effect of the size of uniform grid cell on the efficiency of kNN search, and then attempt to find the relatively optimal size of grid cell by considering the distribution of scattered points.

This paper is organized as follows. Section 2 briefly describes the kNN search that is commonly used in spatial interpolation. Section 3 introduces our benchmark tests. Section 4 presents and discusses the test results. Finally, Sect. 5 draws several conclusions.

2 Background: kNN Search in Spatial Interpolation

The kNN search algorithm is directly derived from our previous work [7, 16, 17]. And more details on the process of the kNN search are described as follows.

Step 1: Creating an even grid

The creating of an even planar grid is straightforward. We first determine the planar rectangular region for partitioning by finding the minimum and maximum x and y coordinates of all points. Then, the numbers of rows and columns of the grid can be easily determined by dividing the rectangle with the width of the square cell; see a simple illustration in Fig. 1.

Fig. 1.
figure 1

Creation of an even grid according to the minimum and maximum coordinates of all the data points and interpolated points in two dimensions. (This figure is directly derived from our previous work [17].)

Step 2: Distributing data points into cells

The objective of distributing all data points into the grid cells is to find out in which grid cell each data point is located. The distributing of each data point is in fact to determine the row and column indices of the cell in which it locates. Since the grid cells are indexed sequentially first by rows and then by columns, the procedure of distributing can be easily carried out. First, the differences between the coordinates of a data point and the minimum coordinates of all cells are calculated; then the indices of column and row can be determined by dividing the above differences with the cell width.

Step 3: Determining data points in each cell

The objective of this step is to determine the number and the indices of those data points located in the same cell. The number of data points located in the same cell can be determined with the use of a segmented parallel reduction. After sorting all data points according to cell indices, the data points are sequentially stored in a group of segments; each segment is flagged with the cell index and contains the indices of data points locating in the same cell. The number of those data points located in the same cell can be obtained by performing a reduction for each segment. Moreover, the head index of the first point of each segment can be determined using segmented parallel scan.

Step 4: Searching nearest neighbors

The process of kNN search for each interpolated point can be summarized as the following substeps: (1) locating the interpolate point into the even grid, (2) determining the level of cell expanding (see Fig. 1), and (3) finding the k nearest neighbors within the local region. More details on searching the nearest neighboring data points for each interpolated points were presented in our previous work [17].

3 Methods

In large-scale spatial interpolations, a local set of data points is always to be used to predict the interpolation value for each interpolated point. Therefore, there are commonly two procedures: (1) the kNN search procedure, and (2) the interpolating procedure. An efficient kNN search procedure would be helpful to improve the computational efficiency of the entire process of spatial interpolation.

In the kNN search based on a uniform grid, one of the critical steps is to determine the size of the grid cell and then create the even grid. When attempting to search for k nearest data points, the levels of grid cells are constantly expanded to find required number of data points. When the data points are intensive, the grid cell could be too large and contain too many points. In this case, the number of data points locating in the current level of grid cells is far more than the required k; and the redundant data points need to be removed by sorting. This removal may cost significant extra computational consumption. In contrast, if the grid cell is very small, it needs to expand several times to cover enough number of data points. The expanding could also cost significant extra computational consumption.

In summary, the size of the uniform grid cell could strongly affect the computational efficiency of the kNN search procedure, and it could not be too large or too small. Our objective in this paper is to find the relatively optimal size of grid cell by considering the distribution of scattered points.

In this paper, several factors may affect the determination of grid cell size which include the value of k, the data points’ density, and two metrics of data distribution (i.e., the mean and Standard Deviation).

The basic idea in this paper is as follows. By changing the size of grid cells, the efficiency of kNN search is first analyzed, and then the influences of the several factors on the size of grid cells are discussed. Finally, we fit the relationships between the several factors and the relatively optimal size of the grid cell.

The sizes of grid cells are constant for the same distribution of data points in the original formula, we multiply the original formula by a coefficient w in this paper, the original formula for calculating the cellWidth is described in Eq. (1) in two dimensions. The used formula for changing the cellWidth in two dimensions is described in Eq. (2).

(1)
$$\begin{aligned} cellWidth_{used}^{2D} =w_{2D} \times cellWidth_0^{2D} \end{aligned}$$
(2)

where, \(cellWidth_0^{2D} \) is the size of the original grid cell in two dimensions, \(cellWidth_{used}^{2D} \) is the size of the used grid cell in two dimensions, \(w_{2D} \) is the coefficient in two dimensions, \(dnum_{2D} \) is the number of known data points in two dimensions, \(A_{Box} \) is the area of the Boundary Box, and \(V_{Box} \) is the volume of the Boundary Box. The relationship between each factor and the coefficient w of grid cell size will be directly discussed subsequently.

4 Results and Discussion

4.1 Benchmark Environment and Testing Data

We carry out five groups of benchmark tests in two-dimensions on a powerful workstation computer. The specifications of the employed workstations are listed in Table 1.

Table 1. Specifications of the employed workstation computer for performing benchmark tests

For each group of the two-dimensional testing data, each set of data points is created by randomly distributing on a parametric surface; the equation of the parametric surface is demonstrated in Eq. (3). More specifically, both x and y coordinates are randomly generated in the range of 0−1000, while the associated value is simply calculated according to Eq. (3) after the x and y coordinates have been determined. The generation of five sets of interpolated points is the same as that of the data points. Both x and y coordinates of each interpolated points are randomly generated in the range of 0−1000.

$$\begin{aligned} \begin{array}{c} f\left( {x,y} \right) =750\exp \left[ {\frac{\left( {9x/1000-2} \right) ^2+\left( {9y/1000-2} \right) ^2}{4}} \right] \\ +750\exp \left[ {\frac{\left( {9x/1000+1} \right) ^2}{49}+\frac{\left( {9y/1000+1} \right) }{10}} \right] \\ -200\exp \left[ {\left( {9x/1000-4} \right) ^2+\left( {9y/1000-7} \right) ^2} \right] \\ +500\exp \left[ {\frac{\left( {9x/1000-7} \right) ^2+\left( {9y/1000-3} \right) ^2}{4}} \right] \\ \end{array} \end{aligned}$$
(3)

4.2 Benchmark Results in Two-Dimensions

The test data in two-dimensions are listed in Table 2, including the number of irregularly distributed data points, and the number of interpolated points, respectively. For the irregularly distributed data points, the number of interpolated points is the same.

Table 2. Test data in two-dimensional benchmark tests

Influence of the Value of k on the Relatively Optimal Coefficient w of Grid Cell Size for Irregularly Distributed Scattered Points. This subsection discusses the effect of different k values and different point densities on the relatively optimal coefficient w of grid cell size for irregularly distribution scattered points. When the points’ spatial distribution is irregular, the mean value is 500 and the Standard Deviation value is 166. In the benchmark tests, the k values specified as 10, 20, 50, 100, and 200 for irregularly distribution scattered points is discussed in this section.

Fig. 2.
figure 2

Influence of the value of k on the coefficient w of grid cell size for irregularly distributed scattered points.

The benchmark results illustrated in Fig. 2 indicate that: when the point density is set as the Size 1 and the w is approximately 3.0, the highest efficiency can be achieved for different values of k. Moreover, the trends of the fitted curves are similar when configuring different values of k. For other four-point densities (i.e., the sizes of data points), almost the same conclusions can be drawn. It can be concluded that: the k value is of weak effect on the relatively optimal coefficient w of grid cell size for irregularly distributed scattered points, see Fig. 3.

Fig. 3.
figure 3

The relatively optimal coefficient w of grid cell size when setting different values of k for irregularly distributed scattered points.

Influence of Point Density on the Relatively Optimal Coefficient w of Grid Cell Size for Irregularly Distributed Scattered Points. This subsection specifically discusses the relationship between different point densities and the coefficient w of grid cell size by fixing the k values. In the benchmark tests, the point densities were specified as Size 1, Size 2, Size 3, Size 4, and Size 5 for irregularly distribution scattered points.

Fig. 4.
figure 4

Influence of point densities on the coefficient w of grid cell size for irregularly distributed scattered points.

Fig. 5.
figure 5

The relatively optimal coefficient w of grid cell size when setting different point densities for irregularly distributed scattered points.

The benchmark results illustrated in Fig. 4 indicate that: when the value of k is set as the 10 and the w is approximately 3.0, the highest efficiency can be achieved for different point densities. Moreover, the trends of the fitted curves are similar when configuring different point densities. For other values of k, almost the same conclusions can be drawn for irregularly distributed scattered points. It can be concluded that: the points densities are of weak effect on the relatively optimal coefficient w of grid cell size for irregularly distributed scattered points, see Fig. 5.

Influence of Mean of Points’ Coordinates on the Relatively Optimal Coefficient w of Grid Cell Size for Irregularly Distributed Scattered Points. This subsection specifically discusses the relationship between different mean of points’ coordinates and the coefficient w of grid cell size by fixing other factors. In the benchmark tests, the mean of points’ coordinates was specified as (400,400), (600,400), (600,600), and (400,600). The number of data points is 67766, the number of interpolated points is 72301, the Standard Deviation value is 200, and the value of k is 50.

Fig. 6.
figure 6

Influence of mean on the relatively optimal coefficient w of grid cell size for irregularly distributed scattered points.

The benchmark results illustrated in Fig. 6 indicate that: the trends of the fitted curves are similar when configuring different mean of points’ coordinates, the highest efficiency corresponding to the relatively optimal coefficient w of grid cell size is close to 2.5 for different mean of points’ coordinates. It can be concluded that: the mean of points’ coordinates is of weak effect on the relatively optimal coefficient w of grid cell size for irregularly distributed scattered points.

Fig. 7.
figure 7

The fitted curve indicating the relationships between the Standard Deviation and the relatively optimal coefficient w of grid cell size in two-dimensions.

Influence of Standard Deviation of Points’ Coordinates on the Relatively Optimal Coefficient w of Grid Cell Size for Irregularly Distributed Scattered Points.This subsection specifically discusses the relationship between different Standard Deviation and the coefficient w of grid cell size by fixing other factors. In the benchmark tests, the number of data points is 67766, the number of interpolated points is 72301, the mean of x and y is 500, and the value of k is 50. The Standard Deviation was specified as 100, 130, 160, 190, 200, 250, 300, 350, and 400. The benchmark results indicate that with the increase of the Standard Deviation of points’ coordinates, the relatively optimal size of the grid cell decreases and eventually converges, see Table 3. We have also fitted the relationships between the Standard Deviation of scattered points’ coordinates and the relatively optimal size of the grid cell in two-dimensions, see Fig. 7, the fitted relationship is described in Eq. (4).

$$\begin{aligned} w=\frac{4.09003}{1+\left( {\sigma +178.4079} \right) ^{6.20917}}+1.00546 \end{aligned}$$
(4)

To evaluate the Goodness of Fit, we use the COD (Coefficient of Determination) to measure the fitted equation. The COD of fitted equation is 0.98192, which indicates the fitting is good.

Table 3. The optimal coefficient w of grid cell size corresponding to different Standard Deviations in two-dimensions

5 Conclusion

In this paper, we have investigated the effect of the decomposition of uniform grid on the computational efficiency of the kNN search procedure used in spatial interpolations. More precisely, we have evaluated the influence of the size of grid cell on the efficiency of the kNN search procedure. Our objective is to find a relatively optimal size of the grid cell. We have performed several series of benchmark based on irregularly distributed scattered points, and found that the distribution of scattered points, which is measured by the Standard Deviation of points’ coordinates, is of strong influence on the determination of the relatively optimal size of the grid cell. More specifically, the benchmark results indicate that: in two dimensions, with the increase of the Standard Deviation of points’ coordinates, the relatively optimal size of the grid cell decreases and eventually converges. We have also fitted the relationships between the Standard Deviation of scattered points’ coordinates and the relatively optimal size of the grid cell, the COD of fitted equation is 0.98192, which indicates the fitting is good. The fitted relationships could be employed to determine the relatively optimal grid cell in kNN search, and further, improve the computational efficiency of spatial interpolations that could be commonly used in the geometric modeling for life science applications.

In this paper, we have only evaluated the effect of the size of grid cell on the efficiency of kNN search executed on the CPU. In the kNN search procedure, there are several logic routines. It has been widely learned that the same logic routines executed on the CPU and GPU may lead to dramatically different efficiencies. Thus, the relationships between the distributions of scattered points between the relatively optimal size of the grid cell obtained on the CPU may differ from those achieved on the GPU. In the future, we will address this problem.