Gower distance-based multivariate control charts for a mixture of continuous and categorical variables
Introduction
Statistical process control (SPC) tools are widely used in monitoring and improving output quality in the manufacturing and service industries (Woodall, 2000, Woodall and Montgomery, 1999). Control charts, which are based on solid statistical theory, are the most widely used tool in SPC (Montgomery, 2005). Their main purpose is to detect any assignable changes that affect output quality. Monitoring statistics and control limits are the two major components in construction of a control chart. Monitoring statistics, plotted on a control chart, can be established as a function of observations. Control limits are generally determined based on the probability distribution of the monitoring statistics with user-specified false alarm rates. Out-of-control signals for a monitored process are issued when the corresponding monitoring statistic exceeds (or falls below) the control limit.
Control charts can be divided into univariate and multivariate charts based on the number of quality characteristics that they monitor. Univariate charts monitor a single quality characteristic, and multivariate charts monitor a number of quality characteristics simultaneously. The most widely used multivariate control chart is a Hotelling’s T2 control chart. Its monitoring statistic is the distance between an observation and the scaled-mean, estimated from in-control observations. The control limit of a Hotelling’s T2 control chart is proportional to the percentile of the F-distribution, assuming that the data follow a multivariate normal distribution (Hotelling, 1947). The necessity of this distributional assumption has restricted the applicability of Hotelling’s T2 control charts to situations in which the data are nonnormally distributed.
To address this problem, many distribution-free control charts have been proposed (Bakir, 2006, Chakraborti et al., 2001, Liu, 1995, Liu et al., 2004, Phaladiganon et al., 2011, Qiu, 2008, Qiu and Hawkins, 2001, Qiu and Hawkins, 2003, Sukchotrat et al., 2009, Sun and Tsung, 2003, Tuerhong et al., 2014, Yang et al., 2011). A comprehensive review of univariate distribution-free control charts can be found in Chakraborti et al. (2001). As for multivariate cases, Liu (1995) developed a multivariate nonparametric control chart that uses the concept of data depth. Moreover, to improve the location detection capability of the previous data depth-based chart, Liu et al. (2004) later proposed a nonparametric multivariate data depth moving average control charts. However, both of these data depth methods require a high computational load, which makes them less efficient for many modern processes that involve many quality characteristics (Ning & Tsung, 2012). Qiu and Hawkins have worked on developing distribution free rank-based multivariate cumulative sum procedures to handle nonnormal distributed process data (Qiu and Hawkins, 2001, Qiu and Hawkins, 2003). However, their methods assume that the distribution of the in-control data is known. Recently, several other useful nonparametric multivariate control charts based on sign test have been proposed (Das, 2009, Zou and Tsung, 2011, Zou et al., 2012).
Further, some studies have been conducted to integrate data mining algorithms with control chart techniques. Sun and Tsung (2003) introduced a kernel-based multivariate control chart that uses support vector data description to handle nonnormally distributed processes. He and Wang (2007) presented a multivariate control chart based on a k nearest neighbor algorithm. In terms of low computational cost and better detection of out-of-control signals, Cui, Li, and Wang (2008) proposed an improved version of kernel principal component analysis-based multivariate control charts. Sukchotrat et al. (2009) proposed a K2 control chart based on a k nearest neighbor data description. Stefatos and Hamza (2009) proposed a multivariate control chart based on a robust covariance matrix and principal component analysis. Yu and Xi (2009) proposed an on-line monitoring approach based on a neural network ensemble technique. EI-Midany, EI-Baz, and Abd-EIwahed (2010) proposed a control scheme using artificial neural networks. Bush, Chongfuangprinya, Chen, Sukchotrat, and Kim (2010) developed a nonparametric multivariate control charts using a linkage ranking algorithm. Phaladiganon et al. (2011) proposed a bootstrap-based multivariate T2 control chart for the situations in which the distribution of observed data is nonnormal or unknown. Kim, Jitpitaklert, Park, and Hwang (2012) proposed control charts for multivariate and autocorrelated processes that use various data mining algorithms. Verdier and Ferreira (2011) proposed an adaptive Mahalanobis distance-based multivariate control chart. Their approach showed good performance with data that have a local structure. Recently, Tuerhong, Kim, Kang, and Cho (2014) proposed a distribution-free multivariate control chart based on a hybrid novelty score.
All of the aforementioned approaches are designed for processes, characterized by continuous quality characteristics. However, in some modern industries the data contain both continuous and categorical variables. In service industries, for example, a credit card transaction dataset described in Prodromidis and Stolfo (1999) contain a mixture of 30 continuous and categorical variables, designed to detect fraudulent transaction. To the best of our knowledge, only a few efforts have been made to develop multivariate nonparametric control charts for mixture data. In one such effort, Hwang, Runger, and Eugene (2007) proposed a multivariate control chart using artificial contrast that converts the monitoring problem into a supervised classification problem. The basic idea behind their approach is to generate out-of-control data from a uniform distribution and create labels (classes) to build classification models. In another approach, Hu, Runger, and Eugene (2007) simulated artificial out-of-control data from a nonuniform distribution to detect the mean shifts in more specific directions. Hu and Runger (2010) proposed an exponentially weighted moving average version of the approach in Hwang et al. (2007) to improve detection capability. Deng, Runger, and Eugene (2012) proposed system monitoring with real-time contrasts. Unlike the recourse of the artificial contrasts embraced in the other approaches, Deng et al.’ approach builds a new classifier for each new observation, and this enables its on-line monitoring capability. One advantage of these artificial contrast-based control charts (Hu and Runger, 2010, Hu et al., 2007, Hwang et al., 2007, Deng et al., 2012) is that they can treat mixture data. However, unlike conventional control charts, their construction relies on supervised classification methods that necessarily require out-of-control data as well as in-control data. Recently, Ning and Tsung (2012) proposed a density-based control chart that uses a local outlier factor and show that their approach can efficiently handle processes characterized by a mixture of continuous and categorical variables. However, the simulation study presented to demonstrate the usefulness of their proposed approach has limitations, especially with data that have a large number of categorical variables.
In the present study, we propose nonparametric multivariate control charts based on the Gower distance to handle a mixture of continuous and categorical data. In the proposed Gower distance-based control chart, the monitoring statistic is the value of the Gower distance, and the control limits can be calculated by a bootstrap percentile method.
The rest of the paper is organized as follows. In Section 2, we describe the proposed Gower distance-based multivariate control chart in terms of its monitoring statistics and control limits. Section 3 presents a simulation study that examines the performance of the proposed control chart and compared it with existing ones under various scenarios. In Section 4, we use real data to demonstrate the feasibility and effectiveness of the proposed control charts. Finally, Section 5 contains concluding remarks and topics for future study.
Section snippets
Gower’s dissimilarity coefficient
Let q be the size of dimension and be a mixture observation, characterized by p categorical variables and q–p continuous variables. Thus, the vector x can be rewritten as follows:
where and represents the vector of the subset of x containing the p categorical variables and q–p continuous variables. Gower’s dissimilarity coefficient is the weighted average of the distances calculated for each variable after scaling each variable to a [0, 1]
Simulation set up
A simulation study was conducted to evaluate the performance of the proposed control chart and compare it to existing control charts under various scenarios. The Euclidean distance-based local K2 control charts utilize a k nearest neighbor data description (Sukchotrat et al., 2009). For example, for each observation, one can find its k nearest neighbors using the Euclidean distance in the training data X; then the average value of the k nearest neighbor distances is used as monitoring statistic
Real case study
We demonstrated the effectiveness and applicability of the proposed Gower distance-based control charts with a German credit dataset from the machine learning repository (Frank et al., 2010). This dataset contains 1000 observations, each characterized by 13 categorical and seven continuous variables. Of the 1000 observations, 700 are considered as in control (i.e., good customers) and 300 as out of control (i.e., bad customers). The categorical variables include the checking account status
Conclusions
We have proposed the multivariate control charts based on the Gower distance that can efficiently handle the processes characterized by high dimensional and mixture data. Simulation results revealed that the proposed Gower distance-based control charts outperformed the existing ones. This is especially true of the proposed local Gower distance-based chart, which consistently yielded better results than the others as the number of categorical variables increased. Moreover, we used a real dataset
Acknowledgement
The authors thank the editor and the referees, whose comments helped improving the presentation of this paper. This research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (2013007724) and the Ministry of Knowledge Economy in Korea under the IT R&D Infrastructure Program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-(B1110-1101-0002)).
References (37)
- et al.
Improved kernel principal component analysis for fault detection
Expert Systems with Applications
(2008) - et al.
Data mining model-based control charts for multivariate and autocorrelated processes
Expert Systems with Applications
(2012) - et al.
Fault detection using robust multivariate control chart
Expert Systems with Applications
(2009) - et al.
A new nonparametric EWMA sign control chart
Expert Systems with Applications
(2011) - et al.
A neural network ensemble-based model for on-line monitoring and diagnosis of out-of-control signals in multivariate manufacturing processes
Expert Systems with Applications
(2009) Distribution-free quality control charts based on signed-rank-like statistics
Communications in Statistics – Theory and Methods
(2006)- et al.
Nonparametric multivariate control charts based on a linkage ranking algorithm
Quality and Reliability Engineering International
(2010) - et al.
Nonparametric control chart: an overview and some results
Journal of Quality Technology
(2001) A new multivariate non-parametric control chart based on sign test
Quality Technology and Quantitative Management
(2009)- et al.
System monitoring with real-time contrasts
Journal of Quality Technology
(2012)