1 Introduction

The territory of data volume progressively expands to terabytes and petabytes and expected to direct in Exascale computing [1]. Array based model like Conventional Multidimensional Array (CMA) can dominate other structures for their easy maintenance. But it is not scalable. The Index Array model [2, 3] solves this limitation by dynamically allocating memory during run time as form of subarrays (SA). But it cannot meet the expected demand of memory utilization as per the demand of data volume especially for “Big Data” applications [1] because of address space overflow. Again it is quite difficult to visualize the large volume of data. [4] mentions a structure which enhances the data volume capacity of Index Array by dealing address space overflow. The structure can also visualize the large volume of data by representing n dimension into 2 dimension only. But the most challenging task in large volume application is to get useful information as the volume entail sparsity [1]. Data encoding is an effective way to preserve only those cells that are meaningful and not-empty [5]. In this paper, we acclaim an encoding scheme based on SAI scheme [4]. The scheme is a segment based structure that encodes 2D indices of the SAI structure. We named the proposed encoding as Segment based Encoding Scheme and denoted as SES. The organization of the paper is as follows: Sect. 2 describes some related works. Section 3 revises the SAI model, Sect. 4 explains the proposed encoding scheme, Sect. 5 analyses the performances and Sect. 6 outlines the conclusion.

2 Related Work

Although large volume is the most needed property in various field of computation, the main challenge is to extract effective information from the volume due to sparsity. [6] deals sparsity in collaborative filtering using emotion and semantic based features. [7] handles sparsity for twitter sentiment analysis. The multi-dimensional indexed array based encoding scheme using history-offset has been initiated in [8] can also be found in [9, 10]. [11] shows an encoding scheme that undergoes indexing overhead and it can efficiently operate up to 4th dimension. [12] offers an encoding scheme where the compression ratio is not suitable for higher dimensional array. Most of the index model mentioned above demand nD indexing. In this paper, we represent a scalable segment based encoding scheme that utilize only 2D indices for an nD array. Therefore, can illustrate better performance than the other schemes.

3 Segment Based Array Indexing (SAI)

The proposed scheme is a 2D depiction of an nD array that allocates small segments. Consider an nD Conventional Multidimensional Array (CMA(n)) of size \( A\left[ {l_{1} ,l_{2} , \ldots ,l_{n} } \right] \). Then \( {<}x_{1} ,x_{2} , \ldots ,x_{n} {>} \) be the Real nD Index (RnI); where l i is the length of dimension d i . Among CMA(n), \( \left\lceil {\frac{\varvec{n}}{2}} \right\rceil \) number of odd dimensions fit along row direction \( d_{1}^{'} \) and rest \( \frac{\varvec{n}}{2} \) number of even dimensions fit along column direction \( d_{2}^{{\prime }} \). The CMA(n) is converted to \( A^{{\prime }} \) \( \left[ {l_{1}^{{\prime }} ,l_{2}^{{\prime }} } \right] \) and \( {<} x_{1}^{{\prime }} ,x_{2}^{{\prime }} {>} \) be the Revised 2D Index (R2I) where \( l_{1}^{{\prime }} \) and \( l_{2}^{'} \) are the length of \( d_{1}^{{\prime }} \) and \( d_{2}^{{\prime }} \) respectively. So, <x 1, x 2,…,x n> to \( < x_{1}^{{\prime }} ,x_{2}^{{\prime }} > \) is done as follows:

$$ \begin{aligned} {\text{x}}_{1}^{{\prime }} & = {\text{x}}_{1} l_{3} l_{5} \ldots l_{n - 3} l_{r} + {\text{x}}_{3} l_{5} \ldots l_{n - 3} l_{r} + \ldots + {\text{x}}_{\text{r}} \\ {\text{x}}_{2}^{'} & = {\text{x}}_{2} l_{4} l_{6} \ldots l_{n - 3} l_{c} + {\text{x}}_{4} l_{6} \ldots l_{n - 3} l_{c} + \ldots + {\text{x}}_{\text{c}} \\ {\text{r}} & = \left\{ {\begin{array}{*{20}c} {n - 1, {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{even}}} \\ {n,\quad \quad {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{odd}}} \\ \end{array} } \right. ; c = \left\{ {\begin{array}{*{20}c} {n - 1, {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{odd}}} \\ {n, \quad {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{even}}} \\ \end{array} } \right. \\ \end{aligned} $$
(1)
$$ \begin{aligned} f\left( {{\text{x}}_{1}^{{\prime }} ,{\text{x}}_{2}^{{\prime }} } \right) & = \left\{ {\begin{array}{*{20}c} {{\text{x}}_{1}^{{\prime }} \times l_{2}^{{\prime }} + {\text{x}}_{2}^{{\prime }} , \,{\text{if}}\,d_{1}^{{\prime }} {\text{holds}}\,{\text{the}}\,{\text{SA}}} \\ {{\text{x}}_{2}^{{\prime }} \times l_{1}^{{\prime }} + {\text{x}}_{1}^{{\prime }} , \,{\text{if}}\,d_{2}^{{\prime }} {\text{holds}}\,{\text{the}}\,{\text{SA}}} \\ \end{array} } \right. \\ l_{1}^{{\prime }} & = l_{1} \times l_{3} \times \ldots \times l_{r} ;\quad l_{2}^{{\prime }} = l_{2} \times l_{4} \times \ldots \times l_{c} \\ \end{aligned} $$
(2)

For an extension along d i , the SA size (saz) is calculated as \( saz = \prod\nolimits_{j = 1}^{n} {l_{j} \left( {i \ne j} \right)} \), where \( l_{j} \) is the length of d j . If the direction of SA is on \( d_{2}^{{\prime }} \), then the segment size sgz is \( l_{1}^{{\prime }} \), otherwise \( l_{2}^{{\prime }} \) and the number of segment is calculated as \( nos = \frac{saz}{sgz} \). Figure 1(a) shows a CMA(5) of size [2, 2, 2, 2, 2] by a SAI of \( \left[ {l_{1}^{{\prime }} ,l_{2}^{{\prime }} } \right] \) or [8, 4]. The CMA index <1, 0, 1, 1, 0> is converted to SAI by <6, 1>. Figure 1(b) shows the segmentation of Fig. 1(a). Here, saz = 32, \( l_{2}^{{\prime }} = 4 \) and the nos is \( \frac{32}{4} \) or 8. The SAI includes five types of 2D Supplementary Tables (ST) for attaining scalability as: History Table (HT) stores construction history of the; The Index Table (IT) stores the initial index of the corresponding extended dimension; Extend Dimension (EDT) tracks the scalable direction by assigning value 1 to n; Multiplicative Coefficient Table (MCT) stores co-efficient of \( {\text{x}}_{1}^{{\prime }} \) or \( {\text{x}}_{2}^{{\prime }} \) (Eq. 1); Address Table (AT) stores the first address of the first segment of SA.

Fig. 1.
figure 1

Dimension transformation of a CMA(5)

Figure 2(a) shows a SAI after extending on d2, d1, d4 respectively. The bold dotted SA shows an extension on d 2 . Here, saz is 16 (\( i.e 2^{4} \)), sgz is 8 (\( i.e l_{1}^{{\prime }} \)) and nos is 2 (\( i.e \) \( \frac{16}{8} \)). The 1st address of the 1st segment (\( i.e \) 32) is stored in ST2 [1].AT. The new history is stored in ST2 [1].HT. The new value of l 2 (i.e. 2) is stored in ST2 [1].IT and d2 is stored in ST2 [1].EDT (i.e. 2). To retrieve a data, let, the row indexes are \( \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) \) and column indexes are \( \left( {x_{2} ,x_{4} , \ldots ,x_{c} } \right) \). Let \( max \)() returns the maximum value and \( Cmax \)() returns the count of \( max \)(). Find \( max_{r} = x_{\alpha } = max \) \( \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) \), \( m_{r} = Cmax \) \( \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) \) and \( max_{c} = x_{\beta } = max \) \( \left( {x_{2} ,x_{4} , \ldots ,x_{c} } \right) \), \( m_{c} = Cmax\left( {x_{2} , x_{4} , \ldots x_{c} } \right) \), where \( max_{r} \) is the maximum index value in row direction and \( x_{\alpha } \) is the index position of \( \alpha \) dimension in row direction that contains \( max_{r} \) and \( m_{r} \) is the count of the indexes that contain \( max_{r} \). To find i (or j) from ST1 (or ST2) there can be two cases using \( m_{r} \) (or \( m_{c} \)) as follows:

Fig. 2.
figure 2

A realization of a SES System

  1. i.

    If \( m_{r} = 1, \) find i such that ST1[i].\( {\text{IT}} = max_{r} = x_{\alpha } \) and ST1[i]. \( {\text{EDT}} = \alpha \)

  2. ii.

    If \( m_{r} > 1,m_{r} = a \). Let \( i_{1,} i_{2,} , \ldots ,i_{a} \) contains \( max_{r} \) such that ST1[k].\( {\text{IT}} = max_{r} = x_{\alpha } \) and ST1[k].EDT = \( \alpha \) where \( 1 \le k \le a \). Now from \( i_{1,} i_{2,} , \ldots ,i_{a} \) find \( h_{max} = \hbox{max} \left( {{\text{ST}}_{1} \left[ {i_{1} } \right].HT, {\text{ST}}_{1} \left[ {i_{2} } \right].HT, \ldots ,{\text{ST}}_{1} \left[ {i_{a} } \right].HT} \right). \) Find i such that \( h_{max} = {\text{ST}}_{1} \left[ i \right].HT \).

Find Hmax = max(ST1[i].HT, ST2[j].HT) (SA direction) and recall \( {\text{x}}_{1}^{ '} ,{\text{x}}_{2}^{ '} \) as follows:

$$ {\text{x}}_{1}^{{\prime }} = {\text{x}}_{1} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ 0 \right] + {\text{x}}_{3} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ 1 \right] + .. + {\text{x}}_{\text{r}} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ {\left\lceil {\frac{\text{n}}{2}} \right\rceil - 1} \right] $$
$$ {\text{x}}_{2}^{{\prime }} = {\text{x}}_{2} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ 0 \right] + {\text{x}}_{4} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ 1 \right] + .. + {\text{x}}_{\text{c}} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ {\frac{\text{n}}{2} - 1} \right] $$

If \( {\text{ST}}_{1} \left[ i \right].MCT_{max} \) is the maximum \( MCT \) on \( {\text{ST}}_{1} \), then find start index \( \left( {sx^{{\prime }} } \right) \), segment number (SN), segment’s first address (SFA) and value (VALUE) as follows:

$$ sx^{{\prime }} = \left\{ {\begin{array}{*{20}c} {{\text{ST}}_{1} \left[ i \right].IT \times {\text{ST}}_{1} \left[ i \right].MCT_{max} , {\text{when }}\,SA \,exists \,on \,d_{1}^{{\prime }} } \\ {{\text{ST}}_{2} \left[ j \right].IT \times {\text{ST}}_{2} \left[ j \right].MCT_{max} , {\text{when }}\,SA\, exists \,on \,d_{2}^{ } } \\ \end{array} } \right. $$
(4)
$$ {\text{SN}} = \left\{ {\begin{array}{*{20}c} {{\text{x}}_{1}^{{\prime }} - sx^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{1}^{{\prime }} } \\ {{\text{x}}_{2}^{{\prime }} - sx^{{\prime }} , {\text{when}}\, SA \,exists \,on \,d_{2}^{{\prime }} } \\ \end{array} } \right. $$
(5)
$$ SFA = \left\{ {\begin{array}{*{20}c} {{\text{ST}}_{1} \left[ i \right].AT\left[ 0 \right] + SN \times l_{2}^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{1}^{{\prime }} } \\ {{\text{ST}}_{2} \left[ j \right].AT\left[ 0 \right] + SN \times l_{1}^{{\prime }} , {\text{when}}\, SA\, exists\, on\, d_{2}^{{\prime }} } \\ \end{array} } \right. $$
(6)
$$ {\text{VALUE}} = \left\{ {\begin{array}{*{20}c} {SFA + {\text{x}}_{2}^{{\prime }} , {\text{when}}\, SA\, exists\, on\, d_{1}^{{\prime }} } \\ {SFA + {\text{x}}_{1}^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{2}^{{\prime }} } \\ \end{array} } \right. $$
(7)

Let \( \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} ,x_{5} } \right) = \left( {2,2,1,2,1} \right) \). For row \( max_{r} = 2 \), \( Cmax\left( {2, 1,1} \right) = 1 \) and \( x_{\alpha } = x_{1} = 2,\alpha = 1 \). Select \( {\text{ST}}_{1} \) index i = 1 (\( {\text{ST}}_{1} \left[ 1 \right].{\text{IT}} = 2 \) and \( {\text{ST}}_{1} \left[ 1 \right].{\text{EDT}} = 1 \)). For column, \( max_{c} = 2 \), \( Cmax\left( {2, 2} \right) = 2 \). Select \( {\text{ST}}_{2} \) index j 1  = 1, j 2  = 2 and j = 2 (j 2 is larger).

And \( {\text{x}}_{1}^{{\prime }} = 2 \times 4 + 1 \times 2 + 1 \times 1 = 11 \) and \( {\text{x}}_{2}^{{\prime }} = 2 \times 1 + 2 \times 3 = 8 \) (Eq. 1). \( H_{max} = {\text{ST}}_{2} \left[ 2 \right].{\text{HT}} \), \( sx^{{\prime }} = 2 \times 3 = 6 \) (Eq. 4), \( {\text{SN}} = 8 - 6 = 2 \) (Eq. 5), \( {\text{SFA}} = 72 + 2 \times 12 = 96 \) (Eq. 6) and \( {\text{VALUE}} = 96 + 11 = 107 \) (Eq. 7).

4 Segment Based Encoding Scheme

In 2D SES scheme, the index that exhibits the SA direction is called major index and the rest is named as minor index. The 2D Address Table (AT) is replaced by a 1D Segment Address Table (SAT) that contains each segment’s locations of a SA. For empty segments, it stores the negation of next available non-empty memory position. The SES representation of Fig. 2(a) (shaded empty cells) is shown in Fig. 2(b). The non-empty cells are replaced by its value and minor index. Consider the third segment of the first SA in Fig. 2(a). The minor index is \( x_{2}^{{\prime }} \). So three tuples <\( 0,8 \)>, <\( 2,10 \)> and <\( 3,11 \)> will be stored by SES. The location of tuple <\( 0,8 \)> will be pointed by SAT[0][2]. As, the next segment is empty, it will store −5. To encode an array non-empty cell, the R2I, Hmax and SN are calculated (see Sect. 3). The segment’s first non-empty cell is located in SAT[Hmax][SN]. Now, perform a binary search to find the minor index. Given an RnI \( \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} , x_{5} } \right) = \left( {2, \, 2, \, 0, \, 2, \, 0} \right) \). The R2I, Hmax and SN are <8,8>, 3, 2 respectively. As SAT[Hmax][SN] > 0, the segment is non-empty. Now, minor index \( x_{1}^{'} \) or 8 shows that the value is 104.

5 Performance Analysis

The analysis has been employed in Intel(R) Xeon(R) E5620 @ 2.40 GHz processor having 8 processors, 32 GB RAM, 1406 MB cache memory. The program is written in C and the data size is 8 Bytes. The analysis is linked with HSOEA [11], EaCRS [12]. The cost of EaCRS is always higher compared to others as it requires nD history and column information and n-1D row information for data encoding. Figure 3 illustrates the storage requirements. The SES and HSOEA scheme requires two parameters to encode nD data. Figure 3(a) shows storage requirement with varying dimensions and Fig. 3(b) shows storage requirement with varying data density \( \rho = \frac{number \,of\, non - empty \,cells}{total\, number \,of\, cells}, 0 \le \rho \le 1 \). Figure 4 shows the encoding costs. The HSOEA requires nD history, 2D segment number and 2D offset. The SES beats the others as it requires 2D indices to encode nD data. The range of usability \( (\upsilon ) \) of an encoding scheme is the greatest \( \rho \) fit for the compression ratio \( (\eta = \frac{size \,of\, compessed\, array}{size\, of\, uncompressed\, array} < 1) \). The SES and HSOEA points the range of usability at \( \rho \) = 0.66, which is higher than EaCRS scheme as shown in Fig. 5. The SES and HSOEA monitors each segment’s first address as the SA is divided into segments. The EaCRS does not offer segmentation. The HSOEA needs nD indices and extra metrics after 4th dimension. So, index overhead is less in SES as shown in Fig. 6.

Fig. 3.
figure 3

Storage requirement

Fig. 4.
figure 4

Encoding cost

Fig. 5.
figure 5

Range of usability, \( \upsilon \)

Fig. 6.
figure 6

Indexing overhead

6 Conclusion

The size of data to be needed is expanding gradually. Conversely, in real world the amount of effective data is very small for the presence of sparsity. It is very challenging to deal with sparsity while keeping the additional costs like data compression. Here, we present a 2D encoding scheme for nD array. We have shown that the proposed scheme can effectively encodes 66% data while reducing the indexing and encoding cost accordingly. It can be used in big data storage and parallel or multiprocessor environment.