Elsevier

Decision Support Systems

Volume 129, February 2020, 113173
Decision Support Systems

Multivariate data quality assessment based on rotated factor scores and confidence ellipsoids

https://doi.org/10.1016/j.dss.2019.113173Get rights and content

Highlights

  • A novel multivariate data quality assessment was proposed.

  • Strategies based on rotated factor scores and confidence ellipsoids were proposed.

  • An experimental application to verify the method in a real case was performed.

  • The results showed that the method favors the correlated data quality evaluation.

  • The method proved to be a better option when compared with other approaches.

Abstract

This study explores the nature of the correlation in data to estimate the data quality to be used in decision-making processes. The main contribution of this research is the introduction of a new multivariate method based on rotated factor scores by varimax strategy for the repeatability and reproducibility study to effectively identify possible data of poor quality leading to measurement errors. In addition, a new confidence ellipsoid-based decision support method is developed. The efficiency of the proposed method was demonstrated using the metallographic measurements of the geometric characteristics of the resistance spot welding process. To prove the efficiency of the proposed method, it was compared with other consolidated techniques such as the analysis of variance, weighted principal components method, and factor analysis without rotation. Thus, we verified that the proposed method performed better interpretation of the latent information, minimizing the dimensionality of the data, and separating the quality attributes analyzed by clusters. One response group was classified as acceptable, and the other as marginal. These results were verified by the confidence ellipsoids, in which the proposed method obeyed the Bonferroni bilateral limits, outlining the factors which demonstrated superior discriminatory power with non-overlapping ellipsoids avoiding the confounding and favoring the better data quality analysis for multicriteria decision-making. When compared with the other approaches, the proposed method demonstrated more reliable and robust results without such deficiencies as inversion of the groupings, neglection of the variance-covariance structure, and the variability attributed to the data within the measurement system.

Introduction

Improvements in industrial processes aimed at cost reduction and quality improvement [1] are widely discussed. The researchers seek to introduce innovative methodologies based on mathematical modeling to maximize the efficiency and to improve the decision-making in these processes. Among these proposals is the study conducted by McHaney and Douglas [2], in which they developed a multivariate regression metamodel of a decision support system (DSS) for the task of daily resource allocation in an industry. Gomes et al. [3] used an approach based on artificial neural network (ANN) modeling together with a genetic algorithm for damage detection in carbon fiber reinforced polymer (CFRP) aeronautical plates aiming to create a DSS to provide more precise decision-making for the coupling of sensors in commercial aircrafts. We can also highlight here the work of Gaudencio et al. [4], in which they used the fuzzy decision-making strategy together with the mean square error multivariate approach for the identification of optimal parameters in robust estimators applied to AISI 12 L14 free-machining steel-turning.

However, focusing all efforts on the exclusive improvement of the process may not yield a satisfactory result, as variability can often be attributed to the measurement process [5], which may compromise the quality of the data to be analyzed by the decision maker. According to Moges et al. [6], among the many factors that can affect the decision-making process, data quality is the most critical. According to these authors, the poor-quality data may lead to poor decision-making. Thus, they highlight the data quality issue as one of the most crucial problems in many industries. Heinrich and Klier [7] state that data quality assessment has been extensively discussed in the literature related to fields in which high-quality data is required for various business or decision-making processes. Furthermore, Timmerman and Bronselaer [8] infer that data quality is of great interest for the scientific research. In the literature, it is possible to find several works which address the data quality issue, such as: [[6], [7], [8], [9], [10]]. The importance of data quality has created the need for appropriate metrics for its evaluation [7], leading to the development of different measurement approaches [8]. One of the most effective ways to analyze variability and uncertainty in the data quality is to use measurement system assessment (MSA). MSA is helpful in determining the ability of the system to analyze the total variability, and provides reliable aid for decision-making in cases where the special cause of variability is associated with the measurement system, and the common cause is variability attributed to the process itself.

According to Mast and Trip [11], there are several statistical methodologies that aim to improve the data quality and, consequently, the decision-making, such as the six sigma methodology. However, as its applicability often depends on the process data, the data quality is considered as a crucial parameter to be analyzed. Therefore, the quality of the data is of great importance. Thus, the diagnosis and identification of variability in the data intended for the use in decision-making is done to avoid taking the wrong decisions. In other words, if the data quality used in the decision-making process is not high, the decision maker may come to an unsatisfactory or even erroneous conclusion. Woodal and Borror [12] state that the best approach to analyze the capability of measurement systems is through the gage repeatability and reproducibility study (GR&R). This strategy analyzes the variability within and between systems, and verifies the consistency of the analysis corresponding to the same operator by measuring several different parts (or decision-making units – DMU's). Other than this, the variability of several operators in the measurements is considered. This strategy also evaluates replicated measurements, which are used to verify the consistency and variability of an instrument or an operator, thereby prioritizing the diagnosis of the data quality. In this way, repeatability can be considered as the variation within the system and reproducibility as the variation caused by the measurements between the analyzed systems [5].

In light of the methods used in the GR&R studies, several authors emphasize the approach employing the analysis of variance (ANOVA) as a widely used technique [5,13]. ANOVA classifies the variance of the measurement system into two components, namely, repeatability and reproducibility. Several studies apply the GR&R strategy to conduct the attribute, crossed, expanded, and nested methods of analysis using the univariate ANOVA approach. The analysis of data by univariate techniques is widely used in several applications, specifically, in the economic and health oriented segments. However, when analyzing several datasets of the same segment, it is necessary to estimate the correlation between the data. Furthermore, the variance-covariance structure of the data should be taken into consideration.

Many decision-making processes possess a considerable amount of critical-to-quality characteristics (CTQ), and evaluating them univariately can result in inaccurate analysis and unsatisfactory practical conclusions [14]. A Type I error may occur when performing a statistical control in a univariate manner (i.e., separately) to track a multivariate situation. Therefore, when the data has multiple correlations, it is more appropriate to use multivariate strategies. Many authors have used multivariate approaches in several applications, such as [[15], [16], [17], [18], [19], [20]]. However, the use of the multivariate techniques for MSA, specifically for the GR&R study, has not been explored in the literature to a great extent. Among the methods found in the published research we can outline the work of Majeske [21], who proposed the multivariate analysis of variance (MANOVA), and the work of Wang and Yang [22] that applies principal component analysis (PCA) to solve this problem. Both of the above-mentioned strategies focused on the GR&R studies. In view of these methods, Peruchi et al. [23] proposed the weighted principal component (WPC) strategy (originally proposed by Liao [24] for multiobjective optimization) for application in MSA. According to this study, the WPC approach is superior to the other techniques used in the GR&R studies (both univariate and multivariate) because they presented more robust results and achieved confidence intervals with greater precision. Finally, Almeida et al. [25] proposed a single vector approach called weighted factor scores (WF) using non-rotating factor scores, which were weighted by their respective eigenvalues.

In relation to the research which employs multivariate methods linked to the GR&R studies, we can mention the work of Hamada [26] and Scagliarini [27], which used the MANOVA method applied to the simulated and literature data, respectively. Peruchi et al. [28] used a weighted approach based on the application of MANOVA to the steel turning process. Flyn et al. [29] used MANOVA and PCA in military applications, and Almeida et al. [14] used WPC to evaluate the variability and quality of measurement instruments in the spot welding process.

This indicates the potential for further studies and proposals to estimate the quality of massive correlated data used in the decision-making processes. In small or large-scale numerical procedures that can be measured (such as multicriteria decision-making processes), uncertainty should be considered when collecting these data, which eventually have a significant level of correlation. This justifies the use of multivariate techniques together with the strategies such as analytic hierarchy process (AHP). This is a widely used modern approach and has been applied in recent studies as outlined in several related articles [[30], [31], [32], [33]]. To the best of our knowledge, none of the studies hitherto attempted have considered conducting a GR&R multivariate study using the factor analysis (FA) technique with a rotated axes approach to estimate the data quality for the purposes of decision-making. Almeida et al. [14] in their work initially suggested the use of FA with rotated factor scores applied to MSA to enhance the data quality analysis; however, no other work has considered this proposal with a rotated axes approach.

According to Rencher [34], the FA method seeks to reduce the repetition of information between the variables through the use of a smaller number of latent variables. This is characterized as an important method for treating data that have the variance-covariance structure due to the presence of correlation. Moreover, the factor loads estimated through extraction methods do not always allow the determination of the factors. That is, it is not always possible to clearly identify which factors a given observable variable is associated with. According to Johnson and Wichern [35], in a desirable factorial load pattern, each variable has a high factorial load on a single common factor and moderate and small loads on the remaining common factors. However, this ideal structure of factor loads is not always obtained in the real world. Therefore, rotating the original factor loads is a standard practice. According to Costello and Osborne [36], the purpose of the original factor load rotation is to obtain an easily interpretable, simpler, and clear data structure to avoid the confounding of variables.

Given the great importance of obtaining a quality correlated dataset for multicriteria decision-making applications, this paper proposes a multivariate measurement system assessment method by evaluating the repeatability and reproducibility of the measurement process. To assist in the assessment of the data quality for use in decision-making processes, we seek to improve the quality attributes to be analyzed in multicriteria processes using the proposed method, which is based on rotated factor scores. This method will allow the proper analysis of the multicorrelated structure of the data. It performs the interpretation of the latent data through the rotation of the axes, thereby reducing the dimensionality of the analyzed data and separating them into clusters. In addition, we evaluated the results using confidence ellipsoids and by conducting the variability analysis using non-overlapping multivariate confidence intervals and Bonferroni bilateral limits. This approach favors the compilation of useful information from the combined raw data, allowing for robust decision-making by analyzing the quality of the massive correlated data. To assess the performance of the proposed method, it is applied to a real process. That is, it is applied to the measurement of the geometric characteristics of the resistance spot welding (RSW) process (indentation depth, penetration, nugget width, and fusion zone). To prove the efficiency of the proposed method, the results are compared with the results of other methods found in the literature (ANOVA and WPC). In addition, the results are compared to the factor analysis conducted on the same data with unrotated scores.

This paper is organized as follows: Section 2 presents the theoretical reference to the importance of data quality in the decision-making processes. Furthermore, it discusses in detail the techniques used in the related studies (GR&R and multivariate approaches). In Section 3, the proposed method is presented, along with all the steps and equations required for its application in detail. The application of the method is presented in Section 4 with a focus on its application to the evaluation of the geometric characteristics of the spot welding process. In addition, its comparison with other methods is discussed. Finally, Section 5 presents the conclusions of the study.

Section snippets

Data quality in DSSs

Data quality is regarded as the most important factor to be considered in decision-making processes [7], being a study area in various research segments. In decision-making processes, the results are affected by many different factors. However, data quality is very critical. [6]. Taking into account the constant increase in the number of studies focused on this [6], it is important to properly analyze and assess data quality.

According to Wang and Strong [37], data quality can be defined as the

Multivariate data quality assessment based on rotated factor scores and confidence ellipsoids

Given the importance of ensuring appropriate data quality in DSSs, it is necessary to properly assess the data quality, so that the decision-making process is not skewed by the use of data of insufficient quality. To evaluate the data quality, one has to evaluate variability of the considered dataset. For this purpose, the repeatability and reproducibility study can be applied, which consist of many approaches to ensure reliability of the analysis. In multicriteria decision-making, one should

Resistance spot welding (RSW)

To demonstrate the application of the proposed method, an experimental study was conducted on a real and widely used process in the industry, the resistance spot welding (RSW). It is possible to check the quality attributes of RSW through the specific tests [48]. Considering the end product, it is possible to verify the geometrical characteristics of the spot weld quality, such as indentation depth (ID), penetration (P), nugget width (NW), and the fusion zone (FZ) [49]. These characteristics

Conclusions

This study aimed to propose a new decision support strategy for data quality assessment via a multivariate measurement system using the gage repeatability and reproducibility study (GR&R). Thus, the present study focuses on a new multivariate method based on rotated factor scores and confidence ellipsoids. Furthermore, it seeks to improve the decision support process by analyzing the quality of massive correlated data. To confirm the efficiency of the proposed method, it was applied to the

Acknowledgments

The authors would like to express their gratitude to the following Brazilian institutes: FAPEMIG (project number APQ-00385-18), CAPES, and CNPq (project number 303586/2015-0 and 409318/2017-5) for their support to this research.

Fabrício Alves de Almeida got his bachelor's degree in economics at Faculty of Economic Sciences Southern Minas Gerais, got his master's degree in industrial engineering at Federal University of Itajubá (UNIFEI) and is currently a PhD student in industrial engineering at UNIFEI as well. Nowadays, he is also a professor at Faculty of Economic Sciences Southern Minas Gerais and member of the research group of the Nucleus of Manufacturing Optimization and Innovation Technology with h6 factor. His

References (51)

  • C.-J. Lu et al.

    Sales forecasting for computer wholesalers: a comparison of multivariate adaptive regression splines and artificial neural networks

    Decis. Support. Syst.

    (2012)
  • P.A. Jokinen

    Visualization of multivariate processes using principal component analysis and nonlinear inverse modelling

    Decis. Support. Syst.

    (1994)
  • R.S. Peruchi et al.

    A new multivariate Gage R&R method for correlated characteristics

    Int. J. Prod. Econ.

    (2013)
  • R.S. Peruchi et al.

    Weighted approach for multivariate analysis of variance in measurement system analysis

    Precis. Eng.

    (2014)
  • Y. Wang et al.

    Complex chemical process operation evaluations using a novel analytic hierarchy process model integrating deep residual network with principal component analysis

    Chemom. Intell. Lab. Syst.

    (2019)
  • T. Cai et al.

    In vitro evaluation by PCA and AHP of potential antidiabetic properties of lactic acid bacteria isolated from traditional fermented food

    Lwt

    (2019)
  • J. Lee et al.

    A hybrid approach of goal programming for weapon systems selection

    Comput. Ind. Eng.

    (2010)
  • M. Ghasemaghaei et al.

    A macro model of online information quality perceptions: a review and synthesis of the literature

    Comput. Hum. Behav.

    (2016)
  • C. Grange et al.

    With a little help from my friends: cultivating serendipity in online shopping environments

    Inf. Manag.

    (2019)
  • A. Al-Refaie et al.

    Evaluating measurement and process capabilities by GR&R with four quality measures

    Measurement

    (2010)
  • Y.-R. Shiau

    Decision support for off-line gage evaluation and improving on-line gage usage

    J. Manuf. Syst.

    (2001)
  • S.. Darwish et al.

    Micro-hardness of spot welded (B.S. 1050) commercial aluminium as correlated with welding variables and strength attributes

    J. Mater. Process. Technol.

    (1999)
  • F.A. de Almeida et al.

    A weighted mean square error approach to the robust optimization of the surface roughness in an AISI 12L14 free-machining steel-turning process

    Stroj. Vestnik/Journal Mech. Eng.

    (2018)
  • G.F. Gomes et al.

    Optimized damage identification in CFRP plates by reduced mode shapes and GA-ANN methods

    Eng. Struct.

    (2019)
  • J. Helena et al.

    A multiobjective optimization model for machining quality in the AISI 12L14 steel turning process using fuzzy multivariate mean square error

    Precis. Eng.

    (2019)
  • Cited by (14)

    • Enhancement of discriminatory power by ellipsoidal functions for substation clustering in voltage sag studies

      2020, Electric Power Systems Research
      Citation Excerpt :

      Given the correlated nature and extent of data used in a previous study [9], the use of dimensionality reduction techniques such as PCA, where the responses must attend the multivariate character, is justified. Thus, a method is proposed to estimate substation clusters using factor analysis (FA), which is an extension of the PCA, favoring the interpretation and explanation of latent information [17,18]. For this purpose, FA can perform rotation of the axes (varimax method), allowing adequate evaluation of the data.

    View all citing articles on Scopus

    Fabrício Alves de Almeida got his bachelor's degree in economics at Faculty of Economic Sciences Southern Minas Gerais, got his master's degree in industrial engineering at Federal University of Itajubá (UNIFEI) and is currently a PhD student in industrial engineering at UNIFEI as well. Nowadays, he is also a professor at Faculty of Economic Sciences Southern Minas Gerais and member of the research group of the Nucleus of Manufacturing Optimization and Innovation Technology with h6 factor. His research areas are: multivariate statistical analysis, multiobjective optimization, quality engineering, industrial economics and design of experiments.

    Rodrigo Reis Leite got his bachelor's degree in industrial engineering at Federal University of São João Del-Rei, got his master's degree in industrial engineering at the Federal University of Itajubá and is currently a phd student in industrial engineering at UNIFEI as well. Member of the Teaching, Research and Extension Group on Quality and Product and the Nucleus of Manufacturing Optimization and Innovation Technology. He is interested in control, modeling and optimization of processes with the aid of statistical quality control (CEQ), planning and analysis of experiments (DOE), multivariate statistics and multi-objective optimization.

    Guilherme Ferreira Gomes is a professor of the Mechanical Engineering Institute at Federal University of Itajubá (UNIFEI). He got his bachelor's degree in mechanical engineering at UNIFEI, his master's degree in mechanical engineering at UNIFEI and master's degree in industrial and mechanical engineering at École Nationale d'Ingénieurs de Metz. He has experience in Mechanical Engineering, working mainly in the following topics: structural analysis, materials resistance, numerical simulation, numerical methods, finite element method, optimization methods, artificial neural networks and engineering materials (metallic and composite). Metrics since 2017: 85 citations and h6 factor.

    Anderson Paulo de Paiva is a professor of the Industrial Engineering and Management Institute at Federal University of Itajubá since 2005. He has 67 journal papers published on Web of Science, 67 journal papers published on Scopus and he has a H9 factor. He got his bachelor's degree in mechanical engineering at Centro Universitário do Sul de Minas in 1996, his master's degree in industrial engineering at Federal University of Itajubá (UNIFEI) in 2004 and his doctor's degree in mechanical engineering at Federal University of Itajubá (UNIFEI) in 2006. His main research areas are: multiobjective optimization, design of experiments and multivariate statistics.

    José Henrique de Freitas Gomes is a Mechanical Industrial Engineer from the Federal University of Itajubá (2003–2007), with master's degree (2008–2010) and PhD (2010−2013) in Industrial Engineering from the same institution. Currently Associate Professor I of the Institute of Industrial and Management Engineering (IEPG) of the Federal University of Itajubá, in the undergraduate courses in Production Engineering and Administration, and postgraduate in Production Engineering. It acts in the research lines of modeling, analysis and optimization of production systems and manufacturing operations. His main research areas are: application and improvement of multi-objective methods, statistical methods, mathematical programming methods and materials management.

    View full text