Abstract
Data warehouse (DW) quality depends on its data models (conceptual, logical and physical model). Multidimensional (MD) modeling has been widely recognized as the backbone of data modeling for DW. Recently, some of the authors have proposed a set of structural metrics to assess quality of MD conceptual models. They have found the significant relationship between metrics and understandability of DW conceptual schemas using various correlation analysis techniques such as Spearman’s, Pearson etc. However, advanced statistical and machine learning methods have not been used to predict effect of each metric on understandability. In this paper, our focus is on predicting the effect of structural metrics on understandability of conceptual schemas using (i) statistical method (logistic regression analysis) that include univariate and multivariate analysis, (ii) machine learning methods (Decision Trees, Naive Bayesian Classifier) and (iii) compare the performance of these statistical and machine learning methods. The results obtained show that some of the metrics individually have a significant effect on the understandability of MD conceptual schema. Further, few of the metrics have a significant combined effect on understandability of conceptual schema. The results also show that the performance of Naive Bayesian Classifier prediction method is better than logistic regression analysis and Decision Trees methods.






Similar content being viewed by others
References
Abello A, Samos J, Saltor F (2001) A framework for the classification and description of multidimensional data models. In: Proceedings of 12th international conference on database and expert systems applications (DEXA’2001), Springer-Verlag, Munich
Abello A, Samos J, Saltor F (2002) YAM2 (yet another multidimensional model): an extension of UML. In: Proceedings of international database engineering and applications symposium (IDEAS’2002), IEEE Computer Society, Edmonton, pp 172–181
Aggarwal KK, Singh Y, Kaur A, Malhotra R (2009) Empirical analysis for investigating the effect of object-oriented metrics on fault proneness: a replicated case study. Softw Process Improv Pract 16(1):39–62
Basili V, Briand L, Melo W (1996) A validation of object-oriented design metrics as quality Indicators. IEEE Trans Softw Eng 22(10):751–761
Blaschka M, Sapia C, Hofling G, Dinter B (1998) Finding your way through multidimensional data models. In: 9th International conference on database and expert systems applications (DEXA’98), Springer-Verlag, Vienna, pp 198–203
Bouzeghoub M, Kedad Z (2002) Information and database quality, chapter 8. In: Piattini M, Calero C, Genero M (eds) Quality in data warehousing. Kluwer Academic Publishers, Boston, pp 163–198
Briand L, El Emam K, Morasca S (1995) Theoretical and empirical validation of software product measures. Technical Report ISERN-95-03, International Software Engineering Research Network
Cherfil SS, Prat N (2003) Multidimensional schemas quality: assessing and balancing analyzability and simplicity. In: Proceedings of ER Workshops, Springer LNCS 2814, pp 140–151
El Emam K, Benlarbi S, Goel N, Rai S (1999) A validation of object-oriented metrics. NRC Technical report ERB-1063
English L (1996) Information quality improvement: principles, methods and management. Information Impact International, Inc., Brentwood
Fenton N, Pfleeger S (1997) Software metrics: a rigorous approach. Chapman & Hall, London
Golfarelli M, Rizzi S (1998) A methodological framework for data warehouse design. In: 1st International Workshop on Data Warehousing and OLAP (DOLAP’98), Bethesda, pp 3–9
Golfarelli M, Maio D, Rizzi S (1998) The dimensional fact model: a conceptual model for data warehouses. Int J Coop Inf Syst 7:215–247
Han J, Kamber M (2007) Data mining: concepts and techniques. Morgan Kaufman, San Francisco
Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of ACM SIGMOD international conference on management of data, pp 205–216
Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York
Husemann B, Lechtenborger J, Vossen G (2000) Conceptual data warehouse design. In: Proceedings of the international workshop on design and management of data warehouses (DMDW’2000), Stockholm, pp 3–9
Inmon WH (2003) Building the data warehouse. Wiley, New York
Jarke M, Lenzerini M, Vassiliou Y, Vassiliadis P (2002) Fundamentals of data warehouses. Springer-Verlag, Berlin
Jeusfeld M, Quix C, Jarke M (1998) Design and analysis of quality information for data warehouses. In: Proceedings of 17th International conference on conceptual modeling, Singapore
Kimball R, Ross M (2002) The data warehouse toolkit. Wiley, New York
Kohavi R (1995) The power of decision tables. In: Proceedings of eighth European conference on machine learning (ECML’1995), Heraklion, pp 174–189
Labio W, Quass D, Adelberg B (1997) Physical database design for data warehouses. In: Proceedings of 13th international conference on data engineering, IEEE Computer Society, Birmingham, pp 277–288
Lechtenborger J, Vossen G (2003) Multidimensional normal forms for data warehouse design. Inform Syst 28:415–434
Lehner W, Albretch J, Weekends H (1998) Normal forms for multidimensional databases. In: Proceedings of international conference on scientific and statistical database management, IEEE Press, pp 63–72
Lujan-Mora S, Trujillo J, Song IY (2002) Extending UML for multidimensional modeling. In: Proceedings of 5th international conference on the unified modeling language (UML 2002), LNCS 2460, Dresden, pp 290–304
Malhotra M, Kaur A, Singh Y (2010) Empirical validation of object-oriented metrics for predicting fault proneness at different severity levels using support vector machine. Int J Syst Assur Eng Manag 1(3):269–281
OMG (2005) OMG unified modeling language specification, version 2.0. Object Management Group, Needham Heights
Poels G, Dedene G (1999) DISTANCE: a framework for software measure construction. Research Report DTEW9937. Dept. Applied Economics, Katholieke Universiteit Leuven, Leuven
Ross Q (1993) C4.5: programs for machine learning. Morgan Kaufman, San Mateo
Sapia C (1999) On modeling and predicting query behaviour in OLAP systems. In: Proceedings of international workshop on design and management of data warehouses (DMDW’99), Heidelberg, pp 1–10
Sapia C, Blaschka M, Hofling G, Dinter B (1998) Extending the E/R model for the multidimensional paradigm. In: Proceedings of 1st international workshop on data warehouse and data mining (DWDM’98), Springer-Verlag, Singapore, pp 105–116
Serrano M (2004) Definition of a set of metrics for assuring data warehouse quality. University of Castilla, La Mancha
Serrano M, Calero C, Piattini M (2002) Validating metrics for data warehouses. IEE Softw 149(5):161–166
Serrano M, Trujillo J, Calero C, Piattini M (2007) Metrics for data warehouse conceptual models understandability. Inf Softw Technol 49:851–870
Serrano M, Trujillo J, Calero C, Sahraouh HA, Piattini M (2008) Empirical studies to assess the understandability of data warehouse schemas using structural metrics. Softw Qual J 16(1):79–106
Singh Y, Kaur A, Malhotra M (2010) Empirical validation of object-oriented metrics for predicting fault proneness models. Softw Qual J 18:3–35
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc 36:111–147
Trujillo J, Palomar M, Gomez J, Song IY (2001) Designing data warehouses with OO conceptual models. IEEE Comput 34:66–75
Tryfona N, Busborg F, Christiansen J (1999) starER: a conceptual model for data warehouse design. In: Proceedings of the 2nd ACM international workshop on data warehousing and OLAP (DOLAP’99), Missouri, pp 3–8
Vassiliadis P (2000) Data warehouse modeling and quality issues. National Technical University of Athens, Athens
Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufman/Addison-Wesley, San Francisco
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, M., Gosain, A. & Singh, Y. Empirical validation of structural metrics for predicting understandability of conceptual schemas for data warehouse. Int J Syst Assur Eng Manag 5, 291–306 (2014). https://doi.org/10.1007/s13198-013-0159-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-013-0159-4