1 Introduction

Commercial organizations make business decisions by analyzing their performance using historical business transactions data. The data is usually spread across multiple disparate databases, as offices of an organization are spread across many locations in the world. Since the decision making queries are analytical in nature, processing such queries is time consuming as the required data is spread across multiple disparate data sources. Processing these analytical queries requires the data to be retrieved from multiple data sources, either using the on-demand approach or the in-advance approach for querying (Widom 1995). Data warehousing is based on the later approach, where data is retrieved in advance from multiple disparate sources and accumulated in a central repository, referred to as a data warehouse, within the organization. The purpose of a data warehouse is to answer analytical queries in order to facilitate management decisions. A data warehouse storing time variant and non-volatile data is always available for querying; even when remote data sources are inaccessible (Inmon 2003; Kimball and Ross 2002).

A data warehouse is designed with the purpose of supporting decision making (Inmon 2003). Since decision making queries are long and complex, processing these queries against a data warehouse is time consuming. Data warehouse addresses this problem by pre-computing and storing the relevant and the required data in it. This data is stored in the form of materialized views (Roussopoulos 1997). For an n-dimensional data set, the possible number of views that can be stored is 2n. Though materializing all possible views would result in the fastest query response time, it becomes infeasible to store all the possible views, when the value of N is high, due to storae space constraints. Thus, for higher values of N, subsets of views are to be selected that minimize the analytical query response time subject to the resource constraints (Chirkova et al. 2001; Yousri et al. 2005). This is referred to as the view selection, which is a constrained combinatorial optimization NP-Complete problem (Karloff and Mihail 1999).

1.1 View selection

View selection is key to the effective and efficient design of a data warehouse. Data warehouse are designed to answer decision making queries. These queries take a long time for processing when posed against a continuously growing data warehouse. A mechanism that would enable answering these queries efficiently needs to be devised. One such mechanism is the selection of a subset of views, which have the potential for answering most future queries, and materialize them so that most future analytical queries are answered using these materialized views. These materialized views, which conform to the storage space constraint, are considerably smaller when compared to a data warehouse and, thus, would process the analytical queries in an efficient manner resulting in reduced response times. The challenge lies in selecting such materialized views. These cannot be selected arbitrarily, as the set of views may become redundant and may not have the potential of answering analytical queries posed in future thereby making them an unnecessary space overhead. The objective should be to select those views that have the potential of answering future analytical queries. View selection aims to select views that reduce the response times for analytical queries while conforming to the limited resources available for materialization (Chirkova et al. 2001). Several view selection algorithms exist, most of these are either empirical (Agrawal et al. 2000; Aouiche and Darmont 2009; Baralis et al. 1997; Golfarelli and Rizzi 2000; Lehner et al. 1996; Li et al. 2005; Lin et al. 2007; Luo 2007; Teschke and Ulbrich 1997; Vijay Kumar et al. 2010b; Vijay Kumar and Devi 2012, 2013) or based on heuristic (Encinas and Montano 2007; Gupta and Mumick 2005; Gupta et al. 1997; Haider and Vijay Kumar 2011, 2017; Harinarayan et al. 1996; Shah et al. 2006; Valluri et al. 2002; Vijay Kumar 2013; Vijay Kumar and Ghoshal 2009; Vijay Kumar et al. 2010a, 2011; Vijay Kumar and Haider 2010, 2011a, b, 2012, 2015).

Several view selection frameworks, like lattice (Harinarayan et al. 1996), Cube (Shukla et al. 1998), AND-OR view graph (Gupta 1997), Multi-View Processing Plan (Yang et al. 1997), exist for which the view selection problem has been studied. In this paper, a Multidimensional lattice framework is considered for view selection.

In a multi-dimensional lattice framework (Harinarayan et al. 1996), the number of views in the lattice are exponential in regard to the number of dimensions in a star schema from which the lattice have been arrived at. That is, for a star schema comprising a fact table surrounded by n dimensional tables, the corresponding multidimensional lattice would comprise 2n views. Consider a 3-dimensional lattice shown in Fig. 1, with dimensions P, Q and R, wherein the nodes represent the views grouped by the respective dimensions, the view index is give alongside the view name and the size of the views is given alongside the view node. All the views in the lattice depend on the root view PQR, which is based on the grouping of all the dimensions P, Q and R. That is, queries on any of the views in the lattice can be answered PQR. Further, queries on a view can be answered by a view, above it, directly or indirectly connected through edge(s), i.e. there is a direct or indirect dependency between them.

Fig. 1
figure 1

3-Dimensional lattice along with View Size

For higher dimensions, the number of views in the multidimensional lattice varies exponentially in terms of the number of dimensions and, thus, it becomes infeasible to select an optimal set of views that complies with the storage space constraint. Alternately, a relevant subset of views can be selected from the multidimensional lattice. This being an NP-Complete problem (Karloff and Mihail 1999) can be appropriately addressed using metaheuristic algorithms that aim to achieve an optimal trade-off between exploration and exploitation. Several metaheuristic view selection algorithms exist, most of which are randomized (Kalnis et al. 2002; Lee and Hammer 2001; Theodoratos et al. 2001; Vijay Kumar and Kumar 2012a, c, 2015), evolutionary based (Horng et al. 1999; Kumar and Vijay Kumar 2018b; Lin and Kuo 2004; Vijay Kumar and Kumar 2012b, 2013, 2014; Wang and Zhang 2005; Yu et al. 2003; Zhang et al. 2001; Zhou et al. 2012) or swarm based (Arun and Vijay Kumar 2015a, b, 2017a, b; Sun and Wang 2009; Vijay Kumar and Arun 2016, 2017; Kumar and Vijay Kumar 2017, 2018a). In (Vijay Kumar and Kumar 2012a, 2012b, 2012c), the materialized view selection (MVS) problem is formulated as a single objective optimization problem with the objective to minimize the total view evaluation cost (TVEC). However, this problem was expressed as a bi-objective MVS problem in (Prakash and Vijay Kumar 2019a) and was addressed using VEGA in (Prakash and Vijay Kumar 2019a), MOGA in (Prakash and Vijay Kumar 2020) and SPEA-2 in (Prakash and Vijay Kumar 2019b). Among these, the SPEA-2 based MVS algorithm performed comparatively better. In this paper, a Non-dominated Sorting Genetic Algorithm (NSGA-II) (Deb et al. 2002) has been used to address this bi-objective optimization problem.

1.2 Organization of the paper

This paper is organized as follows: Sect. 2 discusses MVS using NSGA-II followed by an illustrative example in Sect. 3. Experimental results are given in Sect. 4. The conclusions are given in Sect. 5.

2 MVS using NSGA-II

The MVS problem was addressed in (Vijay Kumar and Kumar, 2012a, 2012b, 2012c) by minimizing the TVEC, which is defined below (Vijay Kumar and Kumar, 2012a, b, c):

$$ TVEC(V_{Top - K} ) = \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} = 1}}^{N} {Size(V_{i} )} + \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} = 0}}^{N} {SizeSMA(V_{i} )} $$

where, \({\text{Size}}\left({\text{V}}_{\text{i}}\right)\) and \({\text{SizeSMA}}\left({\text{V}}_{\text{i}}\right)\) are the size of the ith view Vi and the size of the smallest materialized ancestor of the ith view Vi respectively, \({\text{SM}}_{{\text{v}}_{\text{i}}}\) is Materialized Status of the ith view Vi, having value 1, if materialized, and 0, if not materialized and N is the number of views in the lattice.

The above MVS problem was expressed, as a bi-objective optimization problem with the objectives being (Prakash and Vijay Kumar 2019a):

$$ MinimizeC_{MV} = \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} }}^{N} {Size(V_{i} )} $$
$$ MinimizeC_{NMV} = \sum\limits_{{i = 0 \wedge SM_{{V_{i} }} }}^{N} {Size(V_{i} )} $$

Further, since these costs were shown to conflict with each other, as decrease in CMV would result in an increase in the CNMV and vice versa, Pareto based multi-objective evolutionary algorithms like MOGA in (Prakash and Vijay Kumar 2020) and SPEA-2 in (Prakash and Vijay Kumar 2019b) have already been used to solve this problem. In this paper, Pareto based technique NSGA-II (Deb et al. 2002) has been adapted to address this bi-objective MVS problem. NSGA-II is briefly discussed next.

2.1 NSGA-II

Srinivas and Deb proposed a first non-dominated sorting genetic algorithm (NSGA) for multi-objective optimization in (Srinivas and Deb 1994). NSGA had limitations like high computational complexity and the aprior specification of the sharing parameter (Deb et al. 2002; Deb 2014). In 2002, Deb et al. proposed “a fast and elitist multi-objective genetic algorithm: NSGA-II.” The NSGA-II used a fast non-dominated sorting and diversity preserving mechanism along with elite preservation, which addressed the limitations of NSGA (Deb et al. 2002). NSGA-II uses an elitist based sorting algorithm, which does not require the sharing parameter to be chosen aprior. NSGA-II is one of the most popular and widely used multi-objective evolutionary algorithms for multi-criteria optimization. It generates non-dominated and diversely spread solutions on the Pareto optimal front using rank and crowding distance (Deb et al. 2002). NSGA-II has been shown to generate better trade-off solutions between the conflicting objectives of a multi-objective optimization problem. In this paper, the bi-objective MVS problem has been addressed using NSGA-II. Accordingly, an NSGA-II based MVS algorithm (MVSNSGA-II) is proposed and is discussed next.

2.2 Algorithm MVSNSGA-II

The algorithm MVSNSGA-II takes a lattice of views L, size of views in L, the number of the Top-K views in the population N, the number of objective functions |F|, crossover probability CP and mutation probability MP, as input, and outputs the Top-K view vectors in the first Pareto optimal front. The algorithm MVSNSGA-II is given in Fig. 2.

Fig. 2
figure 2

Algorithm MVSNSGA-II

The Top-K views in the population, which are denoted as a set of K view indexes in L, is hereafter referred to as Top-K view-vector (Top-KVVs). Each element of a Top-KVV is a view index in L. An example of Top-KVV of 5 indexes (K = 5) is given below:

figure a

First, the population, PTKV of Top-KVVs is randomly generated, in which each Top-KVV is a permutation of K distinct indexes from the lattice L and N is the number of Top-KVVs in the population PTKV. Next, CMV and CNMV for each of the Top-KVVs in PTKV is computed. Using these computed values, a non-dominated sorting (Deb et al. 2002) is performed on the Top-KVVs in the population PTKV. For each Top-KVV TKV, a set containing the number of Top-KVVs dominated by it, i.e. TKVD, and the number of Top-KVVs dominating it, i.e. DTKV, is computed. Initially, TKVD and DTKV for the Top-KVV TKV are assigned an empty set and zero respectively. Thereafter, for each Top-KVV TKV’ in PTKV, CMV and CNMV values of TKV' and TKV are compared. If TKV dominates TKV', TKV´ is added to TKVD. Otherwise, increment the value of DTKV. After computation of TKVD and DTKV for each Top-KVV in PTKV, the Top-KVVs having a DTKV value equal to ‘0′ are assigned rank ‘one’ and are added to the first Pareto front. Next, DTKV’s values of Top-KVVs TKVD in the first Pareto front is decremented by ‘1′. Thereafter, the Top-KVVs having DTKV’s value as ‘zero’ are assigned rank ‘two’ and are added to the second Pareto front. For the remaining Top-KVVs, ranks and Pareto fronts are computed in a similar manner. Next, the crowding distance CD of each Top-KVV is computed. The Top-KVVs in each Pareto front are sorted based on the CMV and CNMVvalues. The crowding distance (CD) for each Top-KVV in the boundary is assigned a value ‘∞’. The crowding distance (CD), for the remaining Top-KVVs is computed as given below:

$$ CD(TKV_{j}) = CD(TKV_{j} ) + \frac{{(TKV_{j + 1} )_{F} - (TKV_{j - 1} )_{F} }}{\max (F) - \min (F)} $$

where CD(TKVj) is the crowding distance of the jth Top-K view-vector, (TKVj)F is the Fth objective function value for the jth Top-KVVs, max(F) and min(F) are the maximum and minimum values respectively on the Fth objective in the corresponding Pareto front.

Next, the mating population of Top-KVVs is selected from PTKV using the crowded comparison operator (Deb et al. 2002) method. In this method, from amongst two Top-KVVs, the higher ranked Top-KVV is selected. In case both the Top-KVVs have the same rank, the Top-KVV with a higher CD value is selected. Next, modified crossover (Davis 1985) and random mutation (Goldberg 1989) with probability CP and MP respectively are performed on the Top-KVVs in the mating population to produce the offspring population of Top-KVVs QTKV. Thereafter, the two populations PTKV and QTKV are combined to form an intermediate population ITKV of size 2 N. Non-dominated sorting is performed on the Top-KVVs in ITKV based on their ranks and within ranks on their CDs. Finally, the top N Top-KVVs are selected to form the new population PTKV.

Step 2 to Step 8 is executed for MaxG generations and thereafter, Top-KVVs in the first Pareto front are produced, as output.

An example illustrating the computation of non-dominated Top-KVVs from a three dimensional lattice of Fig. 1 using the algorithm MVSNSGA-II is given next.

3 An example

Suppose Top-4 views are to be selected from the three dimensional lattice shown in Fig. 1. The randomly generated initial population PTKV of eight Top-4VVs is given in Fig. 3.

Fig. 3
figure 3

Initial population PT4V

For each Top-4VV in the population, the CMV and CNMV values are computed. As an example, the computation of an objective fitness value for T4V1, i.e.[4, 2, 5, 6], in the population PT4V is given below:

$$ \begin{aligned} C_{MV} = & \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} = 1}}^{N} {Size(V_{i} )} \\ = & \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} = 1}}^{8} {Size(V_{i} ) = } Size(V_{1} ) + Size(V_{4} ) + Size(V_{2} ) + Size(V_{5} ) + Size(V_{6} ) \\ = & 54 + 50 + 48 + 23 + 19 \\ = & 194 \\ \end{aligned} $$
$$ \begin{aligned} C_{NMV} = & \sum\limits_{{i = 1 \wedge SM_{{V_{i} }} = 0}}^{N} {SizeSMA(V_{i} )} \\ = & \sum\limits_{{i = 1 \wedge \frac{n!}{{r!\left( {n - r} \right)!}}SM_{{V_{i} }} = 0}}^{8} {SizeSMA(V_{i} )} = SizeSMA(V_{3} ) + SizeSMA(V_{7} ) + SizeSMA(V_{8} ) \\ = & Size(V_{1} ) + Size(V_{4} ) + Size(V_{6} ) \\ = & 54 + 50 + 19 \\ = & 123 \\ \end{aligned} $$

In a similar manner, the values of CMV and CNMV are computed for the remaining Top-4VVs in PT4V and is given in Fig. 4.

Fig. 4
figure 4

CMV, CNMV, T4VD, DT4V and Pareto Front of T4V in PT4V

Next, non-dominated sorting is performed on the Top-4VVs in PT4V. The set T4VDi, containing Top-4VVs dominated by the ith Top-4VV T4Vi, and DT4Vi, which is the number of Top-4VVs that dominates the ith Top-4VV T4Vi, is computed. These values are given in Fig. 4.

The Top-4VVs T4V2, T4V4, and T4V8 are not dominated by any other view-vectors thus DT4V2 = DT4V4 = DT4V8 = 0. Therefore, these Top-4VVs are assigned rank one and become the members of the first Pareto front F1. Next, for computing the Top-4VVs to be assigned to the second Pareto front, the value of DT4Vi of the remaining Top-4VVs, i.e., T4V1, T4V3, T4V5, T4V6, and T4V7, which are not in the first Pareto front, is decremented by one. After decrementing DT4Vi, the value of DT4Vi becomes zero for the Top-4VVs T4V5, T4V6, and T4V7. These Top-4VVs are assigned rank two and become the members of the second Pareto front F2. In a similar manner, other non-dominated Pareto fronts are computed and are given in Fig. 4.

The Top-4VVs of each Pareto front are sorted on CMV and CNMV. The sorted Top-4VVs Pareto front wise for CMV and CNMV are given in Fig. 5. In each Pareto front, the boundary Top-4VVs, i.e. those having the minimum and maximum values in the Pareto front, are assigned a crowding distance (CD) value ‘∞’. That is, T4V1, T4V2, T4V3, T4V6, T4V7and T4V8 are assigned a value ‘∞’. The CD values are computed for the remaining Top-4VVs using the sorted order of the Top-4VVs in the respective front.

Fig. 5
figure 5

T4V in PT4V sorted frontwise on CMV and CNMV

The computations of crowding distance (CD) for Top-4VVs T4V4 is given below:

Initially the CD value for each Top-4VV T4V4 is set to zero, i.e. CD [T4V4] = 0.

$$ CD[T4V_{4} ] = CD[T4V_{4} ] + \frac{{C_{MV} [T4V_{8} ] - C_{MV} [T4V_{2} ]}}{{\max [C_{MV} ] - \min [C_{MV} ]}} = 0 + \frac{215 - 185}{{235 - 185}} = \frac{30}{{50}} = 0.60 $$
$$ CD[T4V_{4} ] = CD[T4V_{4} ] + \frac{{C_{MV} [T4V_{2} ] - C_{MV} [T4V_{8} ]}}{{\max [C_{MV} ] - \min [C_{MV} ]}} = 0.60 + \frac{127 - 107}{{131 - 107}} = 0.60 + 0.83 = 1.43 $$

In a similar manner, the CD value of T4V5 is computed as 0.83 and is given in Fig. 6.

Fig. 6
figure 6

CD values of T4V in PT4V

The crowded comparison operator is used to select the mating pool of Top-4VVs from PT4V. This selection is given in Fig. 7.

Fig. 7
figure 7

Mating Pool selection using Crowded Comparison Operator

The offspring population QTKV of Top-4VVs is computed by applying a single point modified crossover, with CP = 0.8, and random mutation, with MP = 0.1, on the mating pool of Top-4VVs and is given in Fig. 8.

Fig. 8
figure 8

Modified Cyclic Crossover and Random Mutation operation on the Selected T4V

Next, Top-4VVs in PT4V is combined with the offspring population of the Top-4VVs in QT4V to compute an intermediate population of Top-4VVs IT4V. Thereafter, non-dominated sorting is performedon the Top-4VVs in IT4V to compute PT4V for the next generation. The Top-4VVs in the first Pareto front and the population PT4V for next generation, after completion of the first iteration, is given in Fig. 9.

Fig. 9
figure 9

Population PT4V' for the next Generation

Experimental based comparison of MVSNSGA-II with SPEA-2 based MVS algorithm (MVSSPEA-2)(Prakash and Vijay Kumar 2019b) is discussed next.

4 Experimental results

The algorithm MVSNSGA-II and MVSSPEA-2 were implemented using MATLAB 8.0 in a Windows 7 platform on a PC with Intel Core i5 processor having 4 GB RAM. First, graphs were plotted to ascertain the crossover (CP) and mutation (MP) probabilities for which MVSNSGA-II is able select the Top-10 views with comparatively lower TVEC. The four combinations of (Pc, Pm) considered are (CP = 0.6, MP = 0.05), (CP = 0.6, MP = 0.10), (CP = 0.8, MP = 0.05) and (CP = 0.8, MP = 0.10). The graphs depicting the TVEC versus generations for selecting Top-10 views for dimensions 5 to 10 were plotted and are shown in Figs. 10, 11, 12, 13, 14 and 15 respectively. After observing these graphs, It can be said that MVSNSGA-II selects better quality views, having a comparatively lower TVEC, for the combination (CP = 0.8, MP = 0.10) for all dimensions. Therefore, CP = 0.8 and MP = 0.10 are considered for MVSNSGA-II while comparing it with algorithm MVSMOGA.

Fig. 10
figure 10

MVSNSGA-IITVEC versus Generations for 5 dimensions

Fig. 11
figure 11

MVSNSGA-II − TVEC versus Generations for 6 dimensions

Fig. 12
figure 12

MVSNSGA-II − TVEC versus Generations for 7 dimensions

Fig. 13
figure 13

MVSNSGA-II − TVEC versus Generations for 8 dimensions

Fig. 14
figure 14

MVSNSGA-II − TVEC versus Generations for 9 dimensions

Fig. 15
figure 15

MVSNSGA-II − TVEC versus Generations for 10 dimensions

Next, MVSNSGA-II(CP = 0.8, MP = 0.10) was compared with MVSSPEA-2 (CP = 0.8, MP = 0.10) (Prakash & Vijay Kumar, 2019b) in terms of the CMV and the CNMV of the Top-10 views in the first Pareto front, selected by MVSNSGA-II and MVSSPEA-2 after 200 generations. The corresponding graphs for dimensions 5 to 10 are shown in Figs. 16, 17, 18, 19, 20 and 21 respectively. After observing these graphs, it can be said that MVSNSGA-II was able to select almost similar number of non-dominated Top-10 views having a comparatively better spread on the Pareto optimal front. This was further substantiated by comparing the trade-off between CMV and CNMV of the Top-10 views selected by the two algorithms using C-measure (Zitzler et al. 1999, 2000, 2003). The values of the C-measure for the selection of the Top-10 views by MVSNSGA-II and MVSSPEA-2, for dimensions 5 to 10, is given in Fig. 22. It can be observed that, for all dimensions, the C(MVSNSGA-II, MVSSPEA-2) has a comparatively higher value than C(MVSSPEA-2, MVSNSGA-II) thereby implying that the Top-10 views selected by MVSNSGA-II have a comparatively wider spread than those selected using MVSSPEA-2. As a result, it can be inferred that MVSNSGA-II selects views that achieve a comparatively better trade-off between CMV and CNMV.

Fig. 16
figure 16

MVSNSGA-II vs. MVSSPEA-2 − CMV versus CNMV (after 200 Generations) for Dimension = 5

Fig. 17
figure 17

MVSNSGA-II vs. MVSSPEA-2 − CMV versus CNMV (after 200 Generations) for Dimension = 6

Fig. 18
figure 18

MVSNSGA-II vs. MVSSPEA-2 − CMV versus CNMV (after 200 Generations) for Dimension = 7

Fig. 19
figure 19

MVSNSGA-II vs. MVSSPEA-2 − CMV versus CNMV (after 200 Generations) for Dimension = 8

Fig. 20
figure 20

MVSNSGA-II vs. MVSSPEA-2 − MV versus CNMV (after 200 Generations) for Dimension = 9

Fig. 21
figure 21

MVSNSGA-II vs. MVSSPEA-2 − CMV versus CNMV (after 200 Generations) for Dimension = 10

Fig. 22
figure 22

Comparison of MVSNSGA-II and MVSSPEA-2 using C-measure

5 Conclusions

This paper focusses on the use of Pareto based Multi-objective genetic algorithm NSGA-II to solve the bi-objective MVS problem. Accordingly, MVSNSGA-II that minimizes the two costs CMV and CNMV simultaneously to select the Top-K views, from a multi-dimensional lattice, for materialization is proposed. Experimental based comparisons of MVSNSGA-II is carried out with the existing SPEA-2 based materialized view selection algorithm MVSSPEA-2 in order to compare the quality, with respect to CMV and CNMV, of the non-dominated Top-K views selected by them. The experimental results show that the proposed algorithm MVSNSGA-II was able to select the Top-K views that achieved a comparatively better trade-off between CMV and CNMV. Further, it was observed that the Top-K views selected by MVSNSGA-II had a comparatively better spread on the Pareto optimal front. Thus, it can be stated that the proposed algorithm MVSNSGA-II is able to select comparatively better quality Top-K views. These selected views, if materialized, would minimize the analytical queries response time thereby resulting in effectual decision-making.