Reconciling requirement-driven data warehouses with data sources via multidimensional normal forms

https://doi.org/10.1016/j.datak.2007.04.004Get rights and content

Abstract

Successful data warehouse (DW) design needs to be based upon a requirement analysis phase in order to adequately represent the information needs of DW users. Moreover, since the DW integrates the information provided by data sources, it is also crucial to take these sources into account throughout the development process to obtain a consistent reconciliation of data sources and information needs. In this paper, we start by summarizing our approach to specify user requirements for data warehouses and to obtain a conceptual multidimensional model capturing these requirements. Then, we make use of the multidimensional normal forms to define a set of Query/View/Transformation (QVT) relations to assure that the conceptual multidimensional model obtained from user requirements agrees with the available data sources that will populate the DW. Thus, we propose a hybrid approach to develop DWs, i.e., we firstly obtain the conceptual multidimensional model of the DW from user requirements and then we verify and enforce its correctness against data sources by using a set of QVT relations based on multidimensional normal forms. Finally, we provide some snapshots of the CASE tool we have used to implement our QVT relations.

Introduction

A data warehouse (DW) is commonly described as an integrated collection of historical data in support of decision making that structures information into facts and dimensions based on multidimensional (MD) modeling [8], [9]. Since the DW integrates several data sources, the development of the underlying MD model has been traditionally guided by an analysis of these data sources [8]. In fact, several approaches have been defined in order to derive a conceptual MD model from given data sources [6], [3], [7]. Considering these data-driven approaches, MNFs (multidimensional normal forms) have been developed [11] to reason, in a rigorous manner, about the quality (faithfulness, completeness, avoidance of redundancies, summarizability) of a conceptual MD model derived from operational data sources.

Nevertheless, in these data-driven approaches the requirement analysis phase is overlooked, thus resulting in an MD model in which the user needs and expectations may not be satisfied [33]. To overcome this problem, several approaches [33], [27], [18], [5] advocate a requirement-driven DW design process. However, hardly any of these approaches formally considers the data sources in the early stages of the development. Therefore, the correctness of the MD model with respect to the data sources cannot be assured and the DW repository cannot be properly populated from these data sources.

In order to reconcile these two points of view (data-driven and requirement-driven), a Model Driven Architecture (MDA) [23] framework for the development of DWs has been described in [17], [15]. Within this approach a conceptual MD model of the DW repository is developed from user requirements. This initial MD model must be then reconciled with the data sources in order to assure its correctness, which is the major contribution of the present paper. The following step is to apply model transformations to obtain several logical models as a base for the implementation of the DW repository. Within this development framework, up to now we have defined (i) a framework to specify user requirements for DWs [18], (ii) a UML (Unified Modeling Language) [26] profile that contains the necessary stereotypes in order to carry out the conceptual MD modeling successfully [13], (iii) a set of guidelines for deriving a conceptual MD model from user requirements [18], and (iv) a set of QVT (Query/View/Transformation) relations in order to automatically derive several logical representations of the conceptual MD model [17], [14], [15]. An overview of this overall framework, including the reconciliation of data sources and MD conceptual model by means of MNFs, is shown in Fig. 1.

In the short version of this paper [16], we presented a set of QVT relations in order to check the correctness of the conceptual MD model against the available data sources within our DW development framework (see Fig. 1). These QVT relations are based on MNFs proposed in [11]. The QVT language allows us to easily integrate this approach in our MDA framework [17], [15], while MNFs enable us to formalize the relationship between the data sources and the conceptual MD model of the DW repository.

In this paper, we extend the short version [16] as follows: (i) We provide a new section describing our approach for specifying user requirements for DWs. (ii) We now provide the complete set of the QVT relations in order to check the correctness of the conceptual MD model against the available data sources. (iii) We include a new aspect concerning on how to apply these QVT relations. (iv) We extend the case study, with more details and figures that clarify how to use our approach in real world scenarios. Finally, (v) we provide a new section explaining how we have implemented our QVT relations in a CASE tool in order to facilitate its application in real world case studies.

The motivation of our approach is as follows: since the DW integrates the information provided by source databases, it is important to check (in early stages of the development) if the requirement-driven conceptual MD model agrees with the available data sources in order to assure that (i) the DW repository will be properly populated from data sources, (ii) the analysis potential provided by the data sources is captured by the conceptual MD model, (iii) redundancies are avoided, and (iv) optional dimension levels, i.e., levels allowing NULL values, are controlled via specialization/generalization to enable context-sensitive summarizability and to avoid inconsistent queries.

To illustrate these benefits, consider the following running example, which is inspired by the sample scenario of [10]. We assume that the conceptual MD model for the banking domain shown in Fig. 2 has been derived from user requirements without taking data sources into account, e.g., according to the proposal presented in [18], which is briefly described in Section 3.1. The notation of Fig. 2 is based on our UML profile for MD modeling presented in [13] (see Section 4.2 for details). The figure models account facts which are composed of several measures (balance, turnover, interest, and customer age) and described by dimensions organization, product, time, and customer. For the sake of clarity, we only focus on the customer dimension.

Every customer is described in terms of a unique identification number, a name, and a date of birth. Every customer lives in a city which is described with a name and a population. Moreover, customers may be associated with job, gender, industry branch, and contact person. Finally, a city belongs to (Rolls-upTo) exactly one region and exactly one district, while a region belongs to exactly one state. This model represents a geographical classification where every region falls into exactly one state, while districts and states appear to be unrelated. From a conceptual perspective, this classification seems reasonable.

However, the data sources shown in Fig. 3 (using part of the relational CWM metamodel explained in Section 4.1) provide geographical information where every district falls into exactly one state, while regions and states are unrelated. Consequently, (i) the source information concerning regions and states cannot be represented faithfully under the MD model and (ii) potential for roll-up queries from level district to level state is not represented, i.e., analysis potential is lost.

Moreover, the MD model does not represent the structural information that industry branches and contact persons are assigned only to company customers while job and gender are only applicable to private customers, which poses challenges for summarizability and complicates querying (see [11], [12]). Finally, while it certainly makes sense to analyze the age structure of customers, the measure age is not specific to accounts but only to customers. Thus, this measure should be moved to a different fact schema. To summarize, based on schema information for the data sources, the conceptual MD model shown in Fig. 2 should be improved in a number of ways to obtain the “better” model shown in Fig. 4. Indeed, in this paper we show how to apply QVT relations, which are derived from MNFs, to obtain the model shown in Fig. 4 from the model shown in Fig. 2 by taking data sources shown in Fig. 3 into account.

The remainder of this paper is structured as follows: Related work is put into perspective next, before necessary background concerning requirement analysis in DWs, QVT and MNFs is collected in Section 3. Our approach is presented in Section 4 by describing our data source model as well as our conceptual MD model, and defining QVT relations based on MNFs. The application of sample QVT relations, as well as their implementation and execution is illustrated in Section 5. The paper ends with conclusions and suggestions for future work in Section 6.

Section snippets

Related work

In this section, we briefly describe the most relevant approaches for both data-driven and requirement-driven DW development.

Concerning data-driven approaches, in [3], the authors present the Multidimensional Model, a logical model for MD databases. The authors also propose a general design method, aimed at building an MD schema starting from an operational database described by an Entity-Relationship (ER) schema.

In [6], the authors propose the Dimensional-Fact Model (DFM), a particular

Background

In this section, we first outline our approach for (i) defining information requirements in DWs, and (ii) obtaining a conceptual MD model from them. Afterwards, we provide a brief overview of the building blocks of our approach, namely QVT and multidimensional normal forms.

Checking correctness of the conceptual MD model

In this section, we present our approach to check the correctness of a conceptual MD model with respect to the source databases. To this end, we present a set of QVT relations based on MNFs and obtain their inherent desirable design objectives: The resulting conceptual MD model faithfully represents the data sources and captures their analysis potential completely, redundancies are avoided, and NULL values are controlled to allow context-sensitive summarizability and avoid contradictory

Case study

In this section, we show how our QVT relations are properly applied. Furthermore, we explain how to implement and execute these defined QVT relations.

Conclusions and future work

In this paper, we have presented an approach to assure the correctness of a conceptual MD model of the DW repository according to the data sources that will populate this repository. This approach is outlined as follows: We first obtain the conceptual MD model of the DW from user requirements and then we verify and enforce its correctness against data sources by using a set of QVT relations based on MNFs. By using MNFs, we can assure that the conceptual MD model satisfies certain desirable

Acknowledgements

We thank all the reviewers of this paper for their detailed and useful comments. This work has been partially supported by the METASIGN (TIN2004-00779) project from the Spanish Ministry of Education and Science, and by the DADS (PBC-05-012-2) project from the Castilla-La Mancha Ministry of Education and Science (Spain). Jose-Norberto Mazón is funded by the Spanish Ministry of Education and Science under a FPU Grant (AP2005-1360).

Jose-Norberto Mazón is a Ph.D. student from the University of Alicante (Spain). He currently enjoys a research grant from the Spanish Ministry of Education and Science. He received a Master in Computer Science in 2004 from the University of Alicante. He has published several papers about data warehouses in national and international workshops and conferences, such as DAWAK, ER, DOLAP, BNCOD, JISBD and so on. His research interests are: database modeling, conceptual design of data warehouses,

References (36)

  • J. Lechtenbörger et al.

    Multidimensional normal forms for data warehouse design

    Inf. Syst.

    (2003)
  • S. Luján-Mora et al.

    A UML profile for multidimensional modeling in data warehouses

    Data Knowl. Eng.

    (2006)
  • P.A. Bernstein

    Synthesizing third normal form relations from functional dependencies

    ACM Trans. Database Syst.

    (1976)
  • Borland Together,...
  • L. Cabibbo et al.

    A logical approach to multidimensional databases

  • Eclipse,...
  • P. Giorgini, S. Rizzi, M. Garzetti, Goal-oriented requirement analysis for data warehouse design, in: DOLAP,...
  • M. Golfarelli et al.

    The dimensional fact model: a conceptual model for data warehouses

    Int. J. Cooper. Inf. Syst.

    (1998)
  • B. Hüsemann, J. Lechtenbörger, G. Vossen, Conceptual data warehouse modeling, in: M.A. Jeusfeld, H. Shu, M. Staudt, G....
  • W. Inmon

    Building the Data Warehouse

    (2002)
  • R. Kimball et al.

    The Data Warehouse Toolkit

    (2002)
  • J. Lechtenbörger, Data warehouse schema design, Ph.D. Thesis, University of Münster, Germany,...
  • W. Lehner et al.

    Normal forms for multidimensional databases

  • J.-N. Mazón et al.

    Applying transformations to model driven data warehouses

  • J.-N. Mazón, J. Trujillo, An MDA approach for the development of data warehouses, Decis. Support Syst., in press,...
  • J.-N. Mazón et al.

    A set of QVT relations to assure the correctness of data warehouses by using multidimensional normal forms

  • J.-N. Mazón, J. Trujillo, M. Serrano, M. Piattini, Applying MDA to the development of data warehouses, in: DOLAP,...
  • J.-N. Mazón et al.

    Designing data warehouses: from business requirement analysis to multidimensional modeling

  • Cited by (79)

    • Cost-benefit analysis of data warehouse design methodologies

      2017, Information Systems
      Citation Excerpt :

      For this reason, the methodologies compared here present a complexity value higher than that measured using traditional, simple methodologies. In particular, the methodologies chosen for the case study are: (a) Graph-based Hybrid Multidimensional Model−GrHyMM [4], (b) UML for data warehouse−UMLDW [10], (c) Multidimensional Design By Example−MDBE [3], (d) Phipps & Davis Methodology−PDM [6], (e) Goal-oriented Requirement Analysis for Data Warehouse Design−GRAnD [8], and (f) Goal/Question/Metric-based Methodology−GQM [9]. Since the methodologies solely differ from the early steps to multidimensional modeling and, moreover, the logical and the physical designs can be considered as a common, standard step, we only provide in Fig. 2 the different steps.

    • Prioritising data items for business analytics: Framework and application to human resources

      2016, European Journal of Operational Research
      Citation Excerpt :

      Giorgini, Rizzi, and Garzetti (2008), Maté, Trujillo, and Franch (2014) and Mazón and Trujillo (2009) applied the widely used i* system modelling language to structure the actions of organisation's stakeholders (as data suppliers) and the goal-oriented decision making of managers (as users in demand of data), which allows gaining an overview of potentially useful data items for the BI system. Maté and Trujillo (2012) and Mazón, Trujillo, and Lechtenbörger (2007) also aimed at linking the data supply from existing databases with data requirements expressed by end users by employing advanced modelling languages. All mentioned models are concerned with (i) exploring data items that are potentially useful for the BI system and (ii) the structuring of the data items from a BI system architecture viewpoint.

    • Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs

      2015, Ecological Informatics
      Citation Excerpt :

      Afterwards, the data warehouse is built in order to support queries from the presented approach. Therefore, we can ensure that the query posed on the DW node will return the correct data required by the decision maker (Mazón and Trujillo, 2008; Mazón et al., 2007). The paper is structured as follows.

    View all citing articles on Scopus

    Jose-Norberto Mazón is a Ph.D. student from the University of Alicante (Spain). He currently enjoys a research grant from the Spanish Ministry of Education and Science. He received a Master in Computer Science in 2004 from the University of Alicante. He has published several papers about data warehouses in national and international workshops and conferences, such as DAWAK, ER, DOLAP, BNCOD, JISBD and so on. His research interests are: database modeling, conceptual design of data warehouses, multidimensional databases, and model driven development. Contact him at [email protected].

    Juan Trujillo is an associated professor at the Computer Science School at the University of Alicante, Spain. Trujillo received a Ph.D. in Computer Science from the University of Alicante (Spain) in 2001. His research interests include database modeling, data warehouses, conceptual design of data warehouses, multidimensional databases, data warehouse security and quality, mining data warehouses, OLAP, as well as object-oriented analysis and design with UML. He has published many papers in high quality international conferences such as ER, UML, ADBIS, CAiSE, WAIM or DAWAK. He has also published papers in highly cited international journals such as IEEE Computer, Decision Support Systems (DSS), Data and Knowledge Engineering (DKE) or Information Systems (IS). Dr. Trujillo has served as a Program Committee member of several workshops and conferences such as ER, DOLAP, DAWAK, DSS, JISBD and SCI and has also spent some time as a reviewer of several journals such as JDM, KAIS, ISOFT and JODS. He has been Program Chair of DOLAP’05 and BP-UML’05, and Program Co-chair of DAWAK’05, DAWAK’06, BP-UML’06 and FP-UML’07. His e-mail is [email protected].

    Jens Lechtenbörger obtained his Ph.D. in computer science from the University of Münster, Germany, with a thesis on “Data Warehouse Schema Design” for which he received a dissertation award at the German database conference BTW 2003. Currently, he is a tenured lecturer at the Department of Information Systems at the University of Münster. His publications appear in major international conferences and journals, for which he also serves as reviewer. His current research interest include all aspects of database systems, in particular data warehouse modeling, and e-learning. His e-mail is [email protected].

    View full text