An Efficient Encoding Scheme for Dynamic Multidimensional Datasets

Omar, Mehnuma Tabassum; Azharul Hasan, K. M.

doi:10.1007/978-3-319-69900-4_66

An Efficient Encoding Scheme for Dynamic Multidimensional Datasets

Conference paper
First Online: 01 November 2017

2620 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Abstract

Big Data involve composite, undefined volume and unspecified rate of datasets [1]. The index array lags behind the conventional approaches to maintain the data velocity by allowing subjective expansion on the boundary of array dimension. The major concern of large volume applications like “Big Data” is to perceive data volume and high velocity for further operations. In this paper we offer a scalable encoding scheme that replaces data block allocation with segment allocation and reorganizes the n dimensions of array into 2 dimensions only. Hence it requires 2 indices for data encoding and offers low indexing cost.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The territory of data volume progressively expands to terabytes and petabytes and expected to direct in Exascale computing [1]. Array based model like Conventional Multidimensional Array (CMA) can dominate other structures for their easy maintenance. But it is not scalable. The Index Array model [2, 3] solves this limitation by dynamically allocating memory during run time as form of subarrays (SA). But it cannot meet the expected demand of memory utilization as per the demand of data volume especially for “Big Data” applications [1] because of address space overflow. Again it is quite difficult to visualize the large volume of data. [4] mentions a structure which enhances the data volume capacity of Index Array by dealing address space overflow. The structure can also visualize the large volume of data by representing n dimension into 2 dimension only. But the most challenging task in large volume application is to get useful information as the volume entail sparsity [1]. Data encoding is an effective way to preserve only those cells that are meaningful and not-empty [5]. In this paper, we acclaim an encoding scheme based on SAI scheme [4]. The scheme is a segment based structure that encodes 2D indices of the SAI structure. We named the proposed encoding as Segment based Encoding Scheme and denoted as SES. The organization of the paper is as follows: Sect. 2 describes some related works. Section 3 revises the SAI model, Sect. 4 explains the proposed encoding scheme, Sect. 5 analyses the performances and Sect. 6 outlines the conclusion.

2 Related Work

Although large volume is the most needed property in various field of computation, the main challenge is to extract effective information from the volume due to sparsity. [6] deals sparsity in collaborative filtering using emotion and semantic based features. [7] handles sparsity for twitter sentiment analysis. The multi-dimensional indexed array based encoding scheme using history-offset has been initiated in [8] can also be found in [9, 10]. [11] shows an encoding scheme that undergoes indexing overhead and it can efficiently operate up to 4^th dimension. [12] offers an encoding scheme where the compression ratio is not suitable for higher dimensional array. Most of the index model mentioned above demand nD indexing. In this paper, we represent a scalable segment based encoding scheme that utilize only 2D indices for an nD array. Therefore, can illustrate better performance than the other schemes.

3 Segment Based Array Indexing (SAI)

The proposed scheme is a 2D depiction of an nD array that allocates small segments. Consider an nD Conventional Multidimensional Array (CMA(n)) of size $ A\left[ {l_{1} ,l_{2} , \ldots ,l_{n} } \right] $. Then $ {<}x_{1} ,x_{2} , \ldots ,x_{n} {>} $ be the Real nD Index (RnI); where l _i is the length of dimension d _i. Among CMA(n), $ \left\lceil {\frac{\varvec{n}}{2}} \right\rceil $ number of odd dimensions fit along row direction $ d_{1}^{'} $ and rest $ \frac{\varvec{n}}{2} $ number of even dimensions fit along column direction $ d_{2}^{{\prime }} $. The CMA(n) is converted to $ A^{{\prime }} $ $ \left[ {l_{1}^{{\prime }} ,l_{2}^{{\prime }} } \right] $ and $ {<} x_{1}^{{\prime }} ,x_{2}^{{\prime }} {>} $ be the Revised 2D Index (R2I) where $ l_{1}^{{\prime }} $ and $ l_{2}^{'} $ are the length of $ d_{1}^{{\prime }} $ and $ d_{2}^{{\prime }} $ respectively. So, <x ₁, x ₂,…,x _n> to $ < x_{1}^{{\prime }} ,x_{2}^{{\prime }} > $ is done as follows:

$$ \begin{aligned} {\text{x}}_{1}^{{\prime }} & = {\text{x}}_{1} l_{3} l_{5} \ldots l_{n - 3} l_{r} + {\text{x}}_{3} l_{5} \ldots l_{n - 3} l_{r} + \ldots + {\text{x}}_{\text{r}} \\ {\text{x}}_{2}^{'} & = {\text{x}}_{2} l_{4} l_{6} \ldots l_{n - 3} l_{c} + {\text{x}}_{4} l_{6} \ldots l_{n - 3} l_{c} + \ldots + {\text{x}}_{\text{c}} \\ {\text{r}} & = \left\{ {\begin{array}{*{20}c} {n - 1, {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{even}}} \\ {n,\quad \quad {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{odd}}} \\ \end{array} } \right. ; c = \left\{ {\begin{array}{*{20}c} {n - 1, {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{odd}}} \\ {n, \quad {\text{if}}\,{\text{n}}\,{\text{is}}\,{\text{even}}} \\ \end{array} } \right. \\ \end{aligned} $$

(1)

$$ \begin{aligned} f\left( {{\text{x}}_{1}^{{\prime }} ,{\text{x}}_{2}^{{\prime }} } \right) & = \left\{ {\begin{array}{*{20}c} {{\text{x}}_{1}^{{\prime }} \times l_{2}^{{\prime }} + {\text{x}}_{2}^{{\prime }} , \,{\text{if}}\,d_{1}^{{\prime }} {\text{holds}}\,{\text{the}}\,{\text{SA}}} \\ {{\text{x}}_{2}^{{\prime }} \times l_{1}^{{\prime }} + {\text{x}}_{1}^{{\prime }} , \,{\text{if}}\,d_{2}^{{\prime }} {\text{holds}}\,{\text{the}}\,{\text{SA}}} \\ \end{array} } \right. \\ l_{1}^{{\prime }} & = l_{1} \times l_{3} \times \ldots \times l_{r} ;\quad l_{2}^{{\prime }} = l_{2} \times l_{4} \times \ldots \times l_{c} \\ \end{aligned} $$

(2)

For an extension along d _i, the SA size (saz) is calculated as $ saz = \prod\nolimits_{j = 1}^{n} {l_{j} \left( {i \ne j} \right)} $, where $ l_{j} $ is the length of d _j. If the direction of SA is on $ d_{2}^{{\prime }} $, then the segment size sgz is $ l_{1}^{{\prime }} $, otherwise $ l_{2}^{{\prime }} $ and the number of segment is calculated as $ nos = \frac{saz}{sgz} $. Figure 1(a) shows a CMA(5) of size [2, 2, 2, 2, 2] by a SAI of $ \left[ {l_{1}^{{\prime }} ,l_{2}^{{\prime }} } \right] $ or [8, 4]. The CMA index <1, 0, 1, 1, 0> is converted to SAI by <6, 1>. Figure 1(b) shows the segmentation of Fig. 1(a). Here, saz = 32, $ l_{2}^{{\prime }} = 4 $ and the nos is $ \frac{32}{4} $ or 8. The SAI includes five types of 2D Supplementary Tables (ST) for attaining scalability as: History Table (HT) stores construction history of the; The Index Table (IT) stores the initial index of the corresponding extended dimension; Extend Dimension (EDT) tracks the scalable direction by assigning value 1 to n; Multiplicative Coefficient Table (MCT) stores co-efficient of $ {\text{x}}_{1}^{{\prime }} $ or $ {\text{x}}_{2}^{{\prime }} $ (Eq. 1); Address Table (AT) stores the first address of the first segment of SA.

Figure 2(a) shows a SAI after extending on d_2, d_1, d₄ respectively. The bold dotted SA shows an extension on d ₂. Here, saz is 16 ($ i.e 2^{4} $), sgz is 8 ($ i.e l_{1}^{{\prime }} $) and nos is 2 ($ i.e $ $ \frac{16}{8} $). The 1^st address of the 1^st segment ($ i.e $ 32) is stored in ST₂ [1].AT. The new history is stored in ST₂ [1].HT. The new value of l ₂ (i.e. 2) is stored in ST₂ [1].IT and d₂ is stored in ST₂ [1].EDT (i.e. 2). To retrieve a data, let, the row indexes are $ \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) $ and column indexes are $ \left( {x_{2} ,x_{4} , \ldots ,x_{c} } \right) $. Let $ max $() returns the maximum value and $ Cmax $() returns the count of $ max $(). Find $ max_{r} = x_{\alpha } = max $ $ \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) $, $ m_{r} = Cmax $ $ \left( {x_{1} ,x_{3} , \ldots ,x_{r} } \right) $ and $ max_{c} = x_{\beta } = max $ $ \left( {x_{2} ,x_{4} , \ldots ,x_{c} } \right) $, $ m_{c} = Cmax\left( {x_{2} , x_{4} , \ldots x_{c} } \right) $, where $ max_{r} $ is the maximum index value in row direction and $ x_{\alpha } $ is the index position of $ \alpha $ dimension in row direction that contains $ max_{r} $ and $ m_{r} $ is the count of the indexes that contain $ max_{r} $. To find i (or j) from ST₁ (or ST₂) there can be two cases using $ m_{r} $ (or $ m_{c} $) as follows:

i.
If $ m_{r} = 1, $ find i such that ST₁[i].$ {\text{IT}} = max_{r} = x_{\alpha } $ and ST₁[i]. $ {\text{EDT}} = \alpha $
ii.
If $ m_{r} > 1,m_{r} = a $. Let $ i_{1,} i_{2,} , \ldots ,i_{a} $ contains $ max_{r} $ such that ST₁[k].$ {\text{IT}} = max_{r} = x_{\alpha } $ and ST₁[k].EDT = $ \alpha $ where $ 1 \le k \le a $. Now from $ i_{1,} i_{2,} , \ldots ,i_{a} $ find $ h_{max} = \hbox{max} \left( {{\text{ST}}_{1} \left[ {i_{1} } \right].HT, {\text{ST}}_{1} \left[ {i_{2} } \right].HT, \ldots ,{\text{ST}}_{1} \left[ {i_{a} } \right].HT} \right). $ Find i such that $ h_{max} = {\text{ST}}_{1} \left[ i \right].HT $.

Find H_max = max(ST₁[i].HT, ST₂[j].HT) (SA direction) and recall $ {\text{x}}_{1}^{ '} ,{\text{x}}_{2}^{ '} $ as follows:

$$ {\text{x}}_{1}^{{\prime }} = {\text{x}}_{1} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ 0 \right] + {\text{x}}_{3} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ 1 \right] + .. + {\text{x}}_{\text{r}} {\text{ST}}_{1} \left[ i \right].{\text{MCT}}\left[ {\left\lceil {\frac{\text{n}}{2}} \right\rceil - 1} \right] $$

$$ {\text{x}}_{2}^{{\prime }} = {\text{x}}_{2} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ 0 \right] + {\text{x}}_{4} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ 1 \right] + .. + {\text{x}}_{\text{c}} {\text{ST}}_{2} \left[ j \right].{\text{MCT}}\left[ {\frac{\text{n}}{2} - 1} \right] $$

If $ {\text{ST}}_{1} \left[ i \right].MCT_{max} $ is the maximum $ MCT $ on $ {\text{ST}}_{1} $, then find start index $ \left( {sx^{{\prime }} } \right) $, segment number (SN), segment’s first address (SFA) and value (VALUE) as follows:

$$ sx^{{\prime }} = \left\{ {\begin{array}{*{20}c} {{\text{ST}}_{1} \left[ i \right].IT \times {\text{ST}}_{1} \left[ i \right].MCT_{max} , {\text{when }}\,SA \,exists \,on \,d_{1}^{{\prime }} } \\ {{\text{ST}}_{2} \left[ j \right].IT \times {\text{ST}}_{2} \left[ j \right].MCT_{max} , {\text{when }}\,SA\, exists \,on \,d_{2}^{ } } \\ \end{array} } \right. $$

(4)

$$ {\text{SN}} = \left\{ {\begin{array}{*{20}c} {{\text{x}}_{1}^{{\prime }} - sx^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{1}^{{\prime }} } \\ {{\text{x}}_{2}^{{\prime }} - sx^{{\prime }} , {\text{when}}\, SA \,exists \,on \,d_{2}^{{\prime }} } \\ \end{array} } \right. $$

(5)

$$ SFA = \left\{ {\begin{array}{*{20}c} {{\text{ST}}_{1} \left[ i \right].AT\left[ 0 \right] + SN \times l_{2}^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{1}^{{\prime }} } \\ {{\text{ST}}_{2} \left[ j \right].AT\left[ 0 \right] + SN \times l_{1}^{{\prime }} , {\text{when}}\, SA\, exists\, on\, d_{2}^{{\prime }} } \\ \end{array} } \right. $$

(6)

$$ {\text{VALUE}} = \left\{ {\begin{array}{*{20}c} {SFA + {\text{x}}_{2}^{{\prime }} , {\text{when}}\, SA\, exists\, on\, d_{1}^{{\prime }} } \\ {SFA + {\text{x}}_{1}^{{\prime }} , {\text{when }}\,SA \,exists \,on\, d_{2}^{{\prime }} } \\ \end{array} } \right. $$

(7)

Let $ \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} ,x_{5} } \right) = \left( {2,2,1,2,1} \right) $. For row $ max_{r} = 2 $, $ Cmax\left( {2, 1,1} \right) = 1 $ and $ x_{\alpha } = x_{1} = 2,\alpha = 1 $. Select $ {\text{ST}}_{1} $ index i = 1 ($ {\text{ST}}_{1} \left[ 1 \right].{\text{IT}} = 2 $ and $ {\text{ST}}_{1} \left[ 1 \right].{\text{EDT}} = 1 $). For column, $ max_{c} = 2 $, $ Cmax\left( {2, 2} \right) = 2 $. Select $ {\text{ST}}_{2} $ index j ₁ = 1, j ₂ = 2 and j = 2 (j ₂ is larger).

And $ {\text{x}}_{1}^{{\prime }} = 2 \times 4 + 1 \times 2 + 1 \times 1 = 11 $ and $ {\text{x}}_{2}^{{\prime }} = 2 \times 1 + 2 \times 3 = 8 $ (Eq. 1). $ H_{max} = {\text{ST}}_{2} \left[ 2 \right].{\text{HT}} $, $ sx^{{\prime }} = 2 \times 3 = 6 $ (Eq. 4), $ {\text{SN}} = 8 - 6 = 2 $ (Eq. 5), $ {\text{SFA}} = 72 + 2 \times 12 = 96 $ (Eq. 6) and $ {\text{VALUE}} = 96 + 11 = 107 $ (Eq. 7).

4 Segment Based Encoding Scheme

In 2D SES scheme, the index that exhibits the SA direction is called major index and the rest is named as minor index. The 2D Address Table (AT) is replaced by a 1D Segment Address Table (SAT) that contains each segment’s locations of a SA. For empty segments, it stores the negation of next available non-empty memory position. The SES representation of Fig. 2(a) (shaded empty cells) is shown in Fig. 2(b). The non-empty cells are replaced by its value and minor index. Consider the third segment of the first SA in Fig. 2(a). The minor index is $ x_{2}^{{\prime }} $. So three tuples <$ 0,8 $>, <$ 2,10 $> and <$ 3,11 $> will be stored by SES. The location of tuple <$ 0,8 $> will be pointed by SAT[0][2]. As, the next segment is empty, it will store −5. To encode an array non-empty cell, the R2I, H_max and SN are calculated (see Sect. 3). The segment’s first non-empty cell is located in SAT[H_max][SN]. Now, perform a binary search to find the minor index. Given an RnI $ \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} , x_{5} } \right) = \left( {2, \, 2, \, 0, \, 2, \, 0} \right) $. The R2I, H_max and SN are <8,8>, 3, 2 respectively. As SAT[H_max][SN] > 0, the segment is non-empty. Now, minor index $ x_{1}^{'} $ or 8 shows that the value is 104.

5 Performance Analysis

The analysis has been employed in Intel(R) Xeon(R) E5620 @ 2.40 GHz processor having 8 processors, 32 GB RAM, 1406 MB cache memory. The program is written in C and the data size is 8 Bytes. The analysis is linked with HSOEA [11], EaCRS [12]. The cost of EaCRS is always higher compared to others as it requires nD history and column information and n-1D row information for data encoding. Figure 3 illustrates the storage requirements. The SES and HSOEA scheme requires two parameters to encode nD data. Figure 3(a) shows storage requirement with varying dimensions and Fig. 3(b) shows storage requirement with varying data density $ \rho = \frac{number \,of\, non - empty \,cells}{total\, number \,of\, cells}, 0 \le \rho \le 1 $. Figure 4 shows the encoding costs. The HSOEA requires nD history, 2D segment number and 2D offset. The SES beats the others as it requires 2D indices to encode nD data. The range of usability $ (\upsilon ) $ of an encoding scheme is the greatest $ \rho $ fit for the compression ratio $ (\eta = \frac{size \,of\, compessed\, array}{size\, of\, uncompressed\, array} < 1) $. The SES and HSOEA points the range of usability at $ \rho $ = 0.66, which is higher than EaCRS scheme as shown in Fig. 5. The SES and HSOEA monitors each segment’s first address as the SA is divided into segments. The EaCRS does not offer segmentation. The HSOEA needs nD indices and extra metrics after 4^th dimension. So, index overhead is less in SES as shown in Fig. 6.

6 Conclusion

The size of data to be needed is expanding gradually. Conversely, in real world the amount of effective data is very small for the presence of sparsity. It is very challenging to deal with sparsity while keeping the additional costs like data compression. Here, we present a 2D encoding scheme for nD array. We have shown that the proposed scheme can effectively encodes 66% data while reducing the indexing and encoding cost accordingly. It can be used in big data storage and parallel or multiprocessor environment.

References

Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015). doi:10.1145/2699414
Article Google Scholar
Rotem, D., Zhao, J.L.: Extendible arrays for statistical databases and OLAP applications. In: 8th International Conference on Scientific and Statistical Database Systems (SSDBM), pp. 108–117 (1996). doi:10.1109/SSDM.1996.506053
Otoo, E.J., Nimako, G., Ohene-Kwofie, D.: Chunked extendible dense arrays for scientific data storage. Parallel Comput. 39(12), 802–818 (2013). doi:10.1016/j.parco.2013.08.006
Article Google Scholar
Omar, M.T., Azharul Hasan, K.M.: Towards an efficient maintenance of address space overflow for array based storage system. In: Proceeding of the 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (2016)
Google Scholar
Hasan, K.M.A.: Compression schemes of high dimensional data for MOLAP. In: Evolving Application Domains of Data Warehousing and Mining: Trends and Solutions, Chap. 4, pp. 64–81 (2010). doi:10.4018/978-1-60566-816-1.ch004
Moshfeghi, Y., Piwowarski, B., Jose, J.M.: Handling data sparsity in collaborative filtering using emotion and semantic based features. In: SIGIR 2011, pp. 625–634 (2011). doi:10.1145/2009916.2010001
Saif, H., He, Y., Alani, H.: Alleviating data sparsity for Twitter sentiment analysis. In: 2nd Workshop on Making Sense of Microposts (#MSM2012): Big Things Come in Small Packages at the 21^st International Conference on the World Wide Web (WWW 2012), 16 April 2012, Lyon, France, CEUR Workshop Proceedings, pp. 2–9 (2012). doi:10.1.1.309.6821
Hasan, K.M.A., Tsuji, T., Higuchi, K.: An efficient implementation for MOLAP basic data structure and its evaluation. In: Kotagiri, R., Krishna, P.Radha, Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 288–299. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71703-4_26
Chapter Google Scholar
Tsuchida, T., Tsuji, T., Higuchi, K.: Implementing vertical splitting for large scale multidimensional datasets and its evaluations. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 208–223. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23544-3_16
Chapter Google Scholar
Tsuji, T., Amaki, K., Nishino, H., Higuchi, K.: History-offset implementation scheme of xml documents and its evaluations. In: In 18th International Conference on Database Systems for Advanced Applications, pp. 315–330 (2013). doi:10.1007/978-3-642-37487-6_25
Sk, M., Masudul Ahsan, K.M., Hasan, A.: An efficient encoding scheme to handle the address space overflow for large multidimensional arrays. J. Comput. 8(5), 1136–1144 (2013). doi:10.4304/jcp.8.5.1136-1144
Google Scholar
Islam, R., Hasan, K.M.A., Tsuji, T.: EaCRS: an extendible array based compression scheme for high dimensional data. In: 2^nd Symposium on Information and Communication Technology (SoICT 2011), pp. 92–99 (2011). doi:10.1145/2069216.2069237

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Khulna University of Engineering & Technology, Khulna, Bangladesh
Mehnuma Tabassum Omar & K. M. Azharul Hasan

Authors

Mehnuma Tabassum Omar
View author publications
You can also search for this author in PubMed Google Scholar
K. M. Azharul Hasan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mehnuma Tabassum Omar or K. M. Azharul Hasan .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Omar, M.T., Azharul Hasan, K.M. (2017). An Efficient Encoding Scheme for Dynamic Multidimensional Datasets. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_66

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_66
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)