Reconciling requirement-driven data warehouses with data sources via multidimensional normal forms

doi:10.1016/j.datak.2007.04.004

Data & Knowledge Engineering

Volume 63, Issue 3, December 2007, Pages 725-751

https://doi.org/10.1016/j.datak.2007.04.004 Get rights and content

Abstract

Successful data warehouse (DW) design needs to be based upon a requirement analysis phase in order to adequately represent the information needs of DW users. Moreover, since the DW integrates the information provided by data sources, it is also crucial to take these sources into account throughout the development process to obtain a consistent reconciliation of data sources and information needs. In this paper, we start by summarizing our approach to specify user requirements for data warehouses and to obtain a conceptual multidimensional model capturing these requirements. Then, we make use of the multidimensional normal forms to define a set of Query/View/Transformation (QVT) relations to assure that the conceptual multidimensional model obtained from user requirements agrees with the available data sources that will populate the DW. Thus, we propose a hybrid approach to develop DWs, i.e., we firstly obtain the conceptual multidimensional model of the DW from user requirements and then we verify and enforce its correctness against data sources by using a set of QVT relations based on multidimensional normal forms. Finally, we provide some snapshots of the CASE tool we have used to implement our QVT relations.

Introduction

A data warehouse (DW) is commonly described as an integrated collection of historical data in support of decision making that structures information into facts and dimensions based on multidimensional (MD) modeling [8], [9]. Since the DW integrates several data sources, the development of the underlying MD model has been traditionally guided by an analysis of these data sources [8]. In fact, several approaches have been defined in order to derive a conceptual MD model from given data sources [6], [3], [7]. Considering these data-driven approaches, MNFs (multidimensional normal forms) have been developed [11] to reason, in a rigorous manner, about the quality (faithfulness, completeness, avoidance of redundancies, summarizability) of a conceptual MD model derived from operational data sources.

Nevertheless, in these data-driven approaches the requirement analysis phase is overlooked, thus resulting in an MD model in which the user needs and expectations may not be satisfied [33]. To overcome this problem, several approaches [33], [27], [18], [5] advocate a requirement-driven DW design process. However, hardly any of these approaches formally considers the data sources in the early stages of the development. Therefore, the correctness of the MD model with respect to the data sources cannot be assured and the DW repository cannot be properly populated from these data sources.

In order to reconcile these two points of view (data-driven and requirement-driven), a Model Driven Architecture (MDA) [23] framework for the development of DWs has been described in [17], [15]. Within this approach a conceptual MD model of the DW repository is developed from user requirements. This initial MD model must be then reconciled with the data sources in order to assure its correctness, which is the major contribution of the present paper. The following step is to apply model transformations to obtain several logical models as a base for the implementation of the DW repository. Within this development framework, up to now we have defined (i) a framework to specify user requirements for DWs [18], (ii) a UML (Unified Modeling Language) [26] profile that contains the necessary stereotypes in order to carry out the conceptual MD modeling successfully [13], (iii) a set of guidelines for deriving a conceptual MD model from user requirements [18], and (iv) a set of QVT (Query/View/Transformation) relations in order to automatically derive several logical representations of the conceptual MD model [17], [14], [15]. An overview of this overall framework, including the reconciliation of data sources and MD conceptual model by means of MNFs, is shown in Fig. 1.

In the short version of this paper [16], we presented a set of QVT relations in order to check the correctness of the conceptual MD model against the available data sources within our DW development framework (see Fig. 1). These QVT relations are based on MNFs proposed in [11]. The QVT language allows us to easily integrate this approach in our MDA framework [17], [15], while MNFs enable us to formalize the relationship between the data sources and the conceptual MD model of the DW repository.

In this paper, we extend the short version [16] as follows: (i) We provide a new section describing our approach for specifying user requirements for DWs. (ii) We now provide the complete set of the QVT relations in order to check the correctness of the conceptual MD model against the available data sources. (iii) We include a new aspect concerning on how to apply these QVT relations. (iv) We extend the case study, with more details and figures that clarify how to use our approach in real world scenarios. Finally, (v) we provide a new section explaining how we have implemented our QVT relations in a CASE tool in order to facilitate its application in real world case studies.

The motivation of our approach is as follows: since the DW integrates the information provided by source databases, it is important to check (in early stages of the development) if the requirement-driven conceptual MD model agrees with the available data sources in order to assure that (i) the DW repository will be properly populated from data sources, (ii) the analysis potential provided by the data sources is captured by the conceptual MD model, (iii) redundancies are avoided, and (iv) optional dimension levels, i.e., levels allowing NULL values, are controlled via specialization/generalization to enable context-sensitive summarizability and to avoid inconsistent queries.

To illustrate these benefits, consider the following running example, which is inspired by the sample scenario of [10]. We assume that the conceptual MD model for the banking domain shown in Fig. 2 has been derived from user requirements without taking data sources into account, e.g., according to the proposal presented in [18], which is briefly described in Section 3.1. The notation of Fig. 2 is based on our UML profile for MD modeling presented in [13] (see Section 4.2 for details). The figure models account facts which are composed of several measures (balance, turnover, interest, and customer age) and described by dimensions organization, product, time, and customer. For the sake of clarity, we only focus on the customer dimension.

Every customer is described in terms of a unique identification number, a name, and a date of birth. Every customer lives in a city which is described with a name and a population. Moreover, customers may be associated with job, gender, industry branch, and contact person. Finally, a city belongs to (Rolls-upTo) exactly one region and exactly one district, while a region belongs to exactly one state. This model represents a geographical classification where every region falls into exactly one state, while districts and states appear to be unrelated. From a conceptual perspective, this classification seems reasonable.

However, the data sources shown in Fig. 3 (using part of the relational CWM metamodel explained in Section 4.1) provide geographical information where every district falls into exactly one state, while regions and states are unrelated. Consequently, (i) the source information concerning regions and states cannot be represented faithfully under the MD model and (ii) potential for roll-up queries from level district to level state is not represented, i.e., analysis potential is lost.

Moreover, the MD model does not represent the structural information that industry branches and contact persons are assigned only to company customers while job and gender are only applicable to private customers, which poses challenges for summarizability and complicates querying (see [11], [12]). Finally, while it certainly makes sense to analyze the age structure of customers, the measure age is not specific to accounts but only to customers. Thus, this measure should be moved to a different fact schema. To summarize, based on schema information for the data sources, the conceptual MD model shown in Fig. 2 should be improved in a number of ways to obtain the “better” model shown in Fig. 4. Indeed, in this paper we show how to apply QVT relations, which are derived from MNFs, to obtain the model shown in Fig. 4 from the model shown in Fig. 2 by taking data sources shown in Fig. 3 into account.

The remainder of this paper is structured as follows: Related work is put into perspective next, before necessary background concerning requirement analysis in DWs, QVT and MNFs is collected in Section 3. Our approach is presented in Section 4 by describing our data source model as well as our conceptual MD model, and defining QVT relations based on MNFs. The application of sample QVT relations, as well as their implementation and execution is illustrated in Section 5. The paper ends with conclusions and suggestions for future work in Section 6.

Section snippets

Related work

In this section, we briefly describe the most relevant approaches for both data-driven and requirement-driven DW development.

Concerning data-driven approaches, in [3], the authors present the Multidimensional Model, a logical model for MD databases. The authors also propose a general design method, aimed at building an MD schema starting from an operational database described by an Entity-Relationship (ER) schema.

In [6], the authors propose the Dimensional-Fact Model (DFM), a particular

Background

In this section, we first outline our approach for (i) defining information requirements in DWs, and (ii) obtaining a conceptual MD model from them. Afterwards, we provide a brief overview of the building blocks of our approach, namely QVT and multidimensional normal forms.

Checking correctness of the conceptual MD model

In this section, we present our approach to check the correctness of a conceptual MD model with respect to the source databases. To this end, we present a set of QVT relations based on MNFs and obtain their inherent desirable design objectives: The resulting conceptual MD model faithfully represents the data sources and captures their analysis potential completely, redundancies are avoided, and NULL values are controlled to allow context-sensitive summarizability and avoid contradictory

Case study

In this section, we show how our QVT relations are properly applied. Furthermore, we explain how to implement and execute these defined QVT relations.

Conclusions and future work

In this paper, we have presented an approach to assure the correctness of a conceptual MD model of the DW repository according to the data sources that will populate this repository. This approach is outlined as follows: We first obtain the conceptual MD model of the DW from user requirements and then we verify and enforce its correctness against data sources by using a set of QVT relations based on MNFs. By using MNFs, we can assure that the conceptual MD model satisfies certain desirable

Acknowledgements

We thank all the reviewers of this paper for their detailed and useful comments. This work has been partially supported by the METASIGN (TIN2004-00779) project from the Spanish Ministry of Education and Science, and by the DADS (PBC-05-012-2) project from the Castilla-La Mancha Ministry of Education and Science (Spain). Jose-Norberto Mazón is funded by the Spanish Ministry of Education and Science under a FPU Grant (AP2005-1360).

Jose-Norberto Mazón is a Ph.D. student from the University of Alicante (Spain). He currently enjoys a research grant from the Spanish Ministry of Education and Science. He received a Master in Computer Science in 2004 from the University of Alicante. He has published several papers about data warehouses in national and international workshops and conferences, such as DAWAK, ER, DOLAP, BNCOD, JISBD and so on. His research interests are: database modeling, conceptual design of data warehouses,

References (36)

J. Lechtenbörger et al.
Multidimensional normal forms for data warehouse design
Inf. Syst.
(2003)
S. Luján-Mora et al.
A UML profile for multidimensional modeling in data warehouses
Data Knowl. Eng.
(2006)
P.A. Bernstein
Synthesizing third normal form relations from functional dependencies
ACM Trans. Database Syst.
(1976)
Borland Together,...
L. Cabibbo et al.
A logical approach to multidimensional databases
Eclipse,...
P. Giorgini, S. Rizzi, M. Garzetti, Goal-oriented requirement analysis for data warehouse design, in: DOLAP,...
M. Golfarelli et al.
The dimensional fact model: a conceptual model for data warehouses
Int. J. Cooper. Inf. Syst.
(1998)
B. Hüsemann, J. Lechtenbörger, G. Vossen, Conceptual data warehouse modeling, in: M.A. Jeusfeld, H. Shu, M. Staudt, G....
W. Inmon
Building the Data Warehouse
(2002)

R. Kimball et al.

The Data Warehouse Toolkit

(2002)

J. Lechtenbörger, Data warehouse schema design, Ph.D. Thesis, University of Münster, Germany,...

W. Lehner et al.

Normal forms for multidimensional databases

J.-N. Mazón et al.

Applying transformations to model driven data warehouses

J.-N. Mazón, J. Trujillo, An MDA approach for the development of data warehouses, Decis. Support Syst., in press,...

J.-N. Mazón et al.

A set of QVT relations to assure the correctness of data warehouses by using multidimensional normal forms

J.-N. Mazón, J. Trujillo, M. Serrano, M. Piattini, Applying MDA to the development of data warehouses, in: DOLAP,...

J.-N. Mazón et al.

Designing data warehouses: from business requirement analysis to multidimensional modeling

Cited by (79)

Specification and derivation of key performance indicators for business analytics: A semantic approach
2017, Data and Knowledge Engineering
Key Performance Indicators (KPI) measure the performance of an enterprise relative to its objectives thereby enabling corrective action where there are deviations. In current practice, KPIs are manually integrated within dashboards and scorecards used by decision makers. This practice entails various shortcomings. First, KPIs are not related to their business objectives and strategy. Consequently, decision makers often obtain a scattered view of the business status and business concerns. Second, while KPIs are defined by decision makers, their implementation is performed by IT specialists. This often results in discrepancies that are difficult to identify. In this paper, we propose an approach that provides decision makers with an integrated view of strategic business objectives and conceptual data warehouse KPIs. The main benefit of our proposal is that it links strategic business models to the data for monitoring and assessing them. In our proposal, KPIs are defined using a modeling language where decision makers specify KPIs using business terminology, but can also perform quick modifications and even navigate data while maintaining a strategic view. This enables monitoring and what-if analysis, thereby helping analysts to compare expectations with reported results.
Cost-benefit analysis of data warehouse design methodologies
2017, Information Systems
Citation Excerpt :
For this reason, the methodologies compared here present a complexity value higher than that measured using traditional, simple methodologies. In particular, the methodologies chosen for the case study are: (a) Graph-based Hybrid Multidimensional Model−GrHyMM [4], (b) UML for data warehouse−UMLDW [10], (c) Multidimensional Design By Example−MDBE [3], (d) Phipps & Davis Methodology−PDM [6], (e) Goal-oriented Requirement Analysis for Data Warehouse Design−GRAnD [8], and (f) Goal/Question/Metric-based Methodology−GQM [9]. Since the methodologies solely differ from the early steps to multidimensional modeling and, moreover, the logical and the physical designs can be considered as a common, standard step, we only provide in Fig. 2 the different steps.
Methodologies for data warehouse design are increasing more and more in last years, and each of them proposes a different point of view. Among all the methodologies present in literature, the promising ones are the hybrid methodologies—because they represent the only way to ensure a multidimensional schema to be both consistent with data sources and adherent to user business goals—and those able to support the designer by providing some kind of automation. However, the results obtainable by the methodologies can differ substantially in terms of schema quality and required efforts. In this paper, we provide metrics for evaluating the quality of multidimensional schemata in reference to the effort spent in the design process and the automation degree of the methodology. As a case study, we apply our evaluation to the major emerging hybrid methodologies for data warehouse schema design.
Prioritising data items for business analytics: Framework and application to human resources
2016, European Journal of Operational Research
Citation Excerpt :
Giorgini, Rizzi, and Garzetti (2008), Maté, Trujillo, and Franch (2014) and Mazón and Trujillo (2009) applied the widely used i* system modelling language to structure the actions of organisation's stakeholders (as data suppliers) and the goal-oriented decision making of managers (as users in demand of data), which allows gaining an overview of potentially useful data items for the BI system. Maté and Trujillo (2012) and Mazón, Trujillo, and Lechtenbörger (2007) also aimed at linking the data supply from existing databases with data requirements expressed by end users by employing advanced modelling languages. All mentioned models are concerned with (i) exploring data items that are potentially useful for the BI system and (ii) the structuring of the data items from a BI system architecture viewpoint.
The popularity of business intelligence (BI) systems to support business analytics has tremendously increased in the last decade. The determination of data items that should be stored in the BI system is vital to ensure the success of an organisation's business analytic strategy. Expanding conventional BI systems often leads to high costs of internally generating, cleansing and maintaining new data items whilst the additional data storage costs are in many cases of minor concern – what is a conceptual difference to big data systems. Thus, potential additional insights resulting from a new data item in the BI system need to be balanced with the often high costs of data creation. While the literature acknowledges this decision problem, no model-based approach to inform this decision has hitherto been proposed. The present research describes a prescriptive framework to prioritise data items for business analytics and applies it to human resources. To achieve this goal, the proposed framework captures core business activities in a comprehensive process map and assesses their relative importance and possible data support with multi-criteria decision analysis.
Enrichment of the phenotypic and genotypic Data Warehouse analysis using Question Answering systems to facilitate the decision making process in cereal breeding programs
2015, Ecological Informatics
Citation Excerpt :
Afterwards, the data warehouse is built in order to support queries from the presented approach. Therefore, we can ensure that the query posed on the DW node will return the correct data required by the decision maker (Mazón and Trujillo, 2008; Mazón et al., 2007). The paper is structured as follows.
Currently there are an overwhelming number of scientific publications in Life Sciences, especially in Genetics and Biotechnology. This huge amount of information is structured in corporate Data Warehouses (DWs) or in Biological Databases (e.g. UniProt, RCSB Protein Data Bank, CEREALAB or GenBank), whose main drawback is its cost of updating that makes it obsolete easily. However, these Databases are the main tool for enterprises when they want to update their internal information, for example when a plant breeder enterprise needs to enrich its genetic information (internal structured Database) with recently discovered genes related to specific phenotypic traits (external unstructured data) in order to choose the desired parentals for breeding programs.
In this paper, we propose to complement the internal information with external data from the Web using Question Answering (QA) techniques. We go a step further by providing a complete framework for integrating unstructured and structured information by combining traditional Databases and DW architectures with QA systems. The great advantage of our framework is that decision makers can compare instantaneously internal data with external data from competitors, thereby allowing taking quick strategic decisions based on richer data.
Tracing conceptual models' evolution in data warehouses by using the model driven architecture
2014, Computer Standards and Interfaces
Developing a data warehouse is an ongoing task where new requirements are constantly being added. A widely accepted approach for developing data warehouses is the hybrid approach, where requirements and data sources must be accommodated to a reconciliated data warehouse model. During this process, relationships between conceptual elements specified by user requirements and those supplied by the data sources are lost, since no traceability mechanisms are included. As a result, the designer wastes additional time and effort to update the data warehouse whenever user requirements or data sources change. In this paper, we propose an approach to preserve traceability at conceptual level for data warehouses. Our approach includes a set of traces and their formalization, in order to relate the multidimensional elements specified by user requirements with the concepts extracted from data sources. Therefore, we can easily identify how changes should be incorporated into the data warehouse, and derive it according to the new configuration. In order to minimize the effort required, we define a set of general Query/View/Transformation rules to automate the derivation of traces along with data warehouse elements. Finally, we describe a CASE tool that supports our approach and provide a detailed case study to show the applicability of the proposal.
Requirements-driven data warehouse design based on enhanced pivot tables
2021, Requirements Engineering

View all citing articles on Scopus

Juan Trujillo is an associated professor at the Computer Science School at the University of Alicante, Spain. Trujillo received a Ph.D. in Computer Science from the University of Alicante (Spain) in 2001. His research interests include database modeling, data warehouses, conceptual design of data warehouses, multidimensional databases, data warehouse security and quality, mining data warehouses, OLAP, as well as object-oriented analysis and design with UML. He has published many papers in high quality international conferences such as ER, UML, ADBIS, CAiSE, WAIM or DAWAK. He has also published papers in highly cited international journals such as IEEE Computer, Decision Support Systems (DSS), Data and Knowledge Engineering (DKE) or Information Systems (IS). Dr. Trujillo has served as a Program Committee member of several workshops and conferences such as ER, DOLAP, DAWAK, DSS, JISBD and SCI and has also spent some time as a reviewer of several journals such as JDM, KAIS, ISOFT and JODS. He has been Program Chair of DOLAP’05 and BP-UML’05, and Program Co-chair of DAWAK’05, DAWAK’06, BP-UML’06 and FP-UML’07. His e-mail is [email protected].

Jens Lechtenbörger obtained his Ph.D. in computer science from the University of Münster, Germany, with a thesis on “Data Warehouse Schema Design” for which he received a dissertation award at the German database conference BTW 2003. Currently, he is a tenured lecturer at the Department of Information Systems at the University of Münster. His publications appear in major international conferences and journals, for which he also serves as reviewer. His current research interest include all aspects of database systems, in particular data warehouse modeling, and e-learning. His e-mail is [email protected].

View full text

Reconciling requirement-driven data warehouses with data sources via multidimensional normal forms

Abstract

Introduction

Section snippets

Related work

Background

Checking correctness of the conceptual MD model

Case study

Conclusions and future work

Acknowledgements

Inf. Syst.

Data Knowl. Eng.

Synthesizing third normal form relations from functional dependencies

ACM Trans. Database Syst.

A logical approach to multidimensional databases

The dimensional fact model: a conceptual model for data warehouses

Int. J. Cooper. Inf. Syst.

Building the Data Warehouse

The Data Warehouse Toolkit

Normal forms for multidimensional databases

Applying transformations to model driven data warehouses

A set of QVT relations to assure the correctness of data warehouses by using multidimensional normal forms

Designing data warehouses: from business requirement analysis to multidimensional modeling