Gower distance-based multivariate control charts for a mixture of continuous and categorical variables

https://doi.org/10.1016/j.eswa.2013.08.068Get rights and content

Highlights

  • We propose new nonparametric multivariate process monitoring techniques.

  • Proposed control charts can efficiently handle mixed data.

  • Integration of Gower’s dissimilarity coefficient and Hotelling’s T2 control charts.

  • We examine the performance under various simulation and real scenarios.

  • Performance of the method improves as the number of categorical variable increases.

Abstract

Processes characterized by high dimensional and mixture data challenge traditional statistical process control charts. In this study, we propose a multivariate control chart based on the Gower distance that can handle a mixture of continuous and categorical data. An extensive simulation study was conducted to examine the properties of the proposed control chart under various scenarios and compared it with some existing multivariate control charts. The simulation results revealed that the proposed control chart outperformed the existing charts when the number of categorical variables increases. Furthermore, we demonstrated the applicability and effectiveness of the proposed control charts through a real case study.

Introduction

Statistical process control (SPC) tools are widely used in monitoring and improving output quality in the manufacturing and service industries (Woodall, 2000, Woodall and Montgomery, 1999). Control charts, which are based on solid statistical theory, are the most widely used tool in SPC (Montgomery, 2005). Their main purpose is to detect any assignable changes that affect output quality. Monitoring statistics and control limits are the two major components in construction of a control chart. Monitoring statistics, plotted on a control chart, can be established as a function of observations. Control limits are generally determined based on the probability distribution of the monitoring statistics with user-specified false alarm rates. Out-of-control signals for a monitored process are issued when the corresponding monitoring statistic exceeds (or falls below) the control limit.

Control charts can be divided into univariate and multivariate charts based on the number of quality characteristics that they monitor. Univariate charts monitor a single quality characteristic, and multivariate charts monitor a number of quality characteristics simultaneously. The most widely used multivariate control chart is a Hotelling’s T2 control chart. Its monitoring statistic is the distance between an observation and the scaled-mean, estimated from in-control observations. The control limit of a Hotelling’s T2 control chart is proportional to the percentile of the F-distribution, assuming that the data follow a multivariate normal distribution (Hotelling, 1947). The necessity of this distributional assumption has restricted the applicability of Hotelling’s T2 control charts to situations in which the data are nonnormally distributed.

To address this problem, many distribution-free control charts have been proposed (Bakir, 2006, Chakraborti et al., 2001, Liu, 1995, Liu et al., 2004, Phaladiganon et al., 2011, Qiu, 2008, Qiu and Hawkins, 2001, Qiu and Hawkins, 2003, Sukchotrat et al., 2009, Sun and Tsung, 2003, Tuerhong et al., 2014, Yang et al., 2011). A comprehensive review of univariate distribution-free control charts can be found in Chakraborti et al. (2001). As for multivariate cases, Liu (1995) developed a multivariate nonparametric control chart that uses the concept of data depth. Moreover, to improve the location detection capability of the previous data depth-based chart, Liu et al. (2004) later proposed a nonparametric multivariate data depth moving average control charts. However, both of these data depth methods require a high computational load, which makes them less efficient for many modern processes that involve many quality characteristics (Ning & Tsung, 2012). Qiu and Hawkins have worked on developing distribution free rank-based multivariate cumulative sum procedures to handle nonnormal distributed process data (Qiu and Hawkins, 2001, Qiu and Hawkins, 2003). However, their methods assume that the distribution of the in-control data is known. Recently, several other useful nonparametric multivariate control charts based on sign test have been proposed (Das, 2009, Zou and Tsung, 2011, Zou et al., 2012).

Further, some studies have been conducted to integrate data mining algorithms with control chart techniques. Sun and Tsung (2003) introduced a kernel-based multivariate control chart that uses support vector data description to handle nonnormally distributed processes. He and Wang (2007) presented a multivariate control chart based on a k nearest neighbor algorithm. In terms of low computational cost and better detection of out-of-control signals, Cui, Li, and Wang (2008) proposed an improved version of kernel principal component analysis-based multivariate control charts. Sukchotrat et al. (2009) proposed a K2 control chart based on a k nearest neighbor data description. Stefatos and Hamza (2009) proposed a multivariate control chart based on a robust covariance matrix and principal component analysis. Yu and Xi (2009) proposed an on-line monitoring approach based on a neural network ensemble technique. EI-Midany, EI-Baz, and Abd-EIwahed (2010) proposed a control scheme using artificial neural networks. Bush, Chongfuangprinya, Chen, Sukchotrat, and Kim (2010) developed a nonparametric multivariate control charts using a linkage ranking algorithm. Phaladiganon et al. (2011) proposed a bootstrap-based multivariate T2 control chart for the situations in which the distribution of observed data is nonnormal or unknown. Kim, Jitpitaklert, Park, and Hwang (2012) proposed control charts for multivariate and autocorrelated processes that use various data mining algorithms. Verdier and Ferreira (2011) proposed an adaptive Mahalanobis distance-based multivariate control chart. Their approach showed good performance with data that have a local structure. Recently, Tuerhong, Kim, Kang, and Cho (2014) proposed a distribution-free multivariate control chart based on a hybrid novelty score.

All of the aforementioned approaches are designed for processes, characterized by continuous quality characteristics. However, in some modern industries the data contain both continuous and categorical variables. In service industries, for example, a credit card transaction dataset described in Prodromidis and Stolfo (1999) contain a mixture of 30 continuous and categorical variables, designed to detect fraudulent transaction. To the best of our knowledge, only a few efforts have been made to develop multivariate nonparametric control charts for mixture data. In one such effort, Hwang, Runger, and Eugene (2007) proposed a multivariate control chart using artificial contrast that converts the monitoring problem into a supervised classification problem. The basic idea behind their approach is to generate out-of-control data from a uniform distribution and create labels (classes) to build classification models. In another approach, Hu, Runger, and Eugene (2007) simulated artificial out-of-control data from a nonuniform distribution to detect the mean shifts in more specific directions. Hu and Runger (2010) proposed an exponentially weighted moving average version of the approach in Hwang et al. (2007) to improve detection capability. Deng, Runger, and Eugene (2012) proposed system monitoring with real-time contrasts. Unlike the recourse of the artificial contrasts embraced in the other approaches, Deng et al.’ approach builds a new classifier for each new observation, and this enables its on-line monitoring capability. One advantage of these artificial contrast-based control charts (Hu and Runger, 2010, Hu et al., 2007, Hwang et al., 2007, Deng et al., 2012) is that they can treat mixture data. However, unlike conventional control charts, their construction relies on supervised classification methods that necessarily require out-of-control data as well as in-control data. Recently, Ning and Tsung (2012) proposed a density-based control chart that uses a local outlier factor and show that their approach can efficiently handle processes characterized by a mixture of continuous and categorical variables. However, the simulation study presented to demonstrate the usefulness of their proposed approach has limitations, especially with data that have a large number of categorical variables.

In the present study, we propose nonparametric multivariate control charts based on the Gower distance to handle a mixture of continuous and categorical data. In the proposed Gower distance-based control chart, the monitoring statistic is the value of the Gower distance, and the control limits can be calculated by a bootstrap percentile method.

The rest of the paper is organized as follows. In Section 2, we describe the proposed Gower distance-based multivariate control chart in terms of its monitoring statistics and control limits. Section 3 presents a simulation study that examines the performance of the proposed control chart and compared it with existing ones under various scenarios. In Section 4, we use real data to demonstrate the feasibility and effectiveness of the proposed control charts. Finally, Section 5 contains concluding remarks and topics for future study.

Section snippets

Gower’s dissimilarity coefficient

Let q be the size of dimension and x=(x1,,xp,xp+1,,xq) be a mixture observation, characterized by p categorical variables and q–p continuous variables. Thus, the vector x can be rewritten as follows:x=(z1,,zp,c1,,cq-p)T=(zT,cT)

where zTand cT represents the vector of the subset of x containing the p categorical variables and qp continuous variables. Gower’s dissimilarity coefficient is the weighted average of the distances calculated for each variable after scaling each variable to a [0, 1]

Simulation set up

A simulation study was conducted to evaluate the performance of the proposed control chart and compare it to existing control charts under various scenarios. The Euclidean distance-based local K2 control charts utilize a k nearest neighbor data description (Sukchotrat et al., 2009). For example, for each observation, one can find its k nearest neighbors using the Euclidean distance in the training data X; then the average value of the k nearest neighbor distances is used as monitoring statistic

Real case study

We demonstrated the effectiveness and applicability of the proposed Gower distance-based control charts with a German credit dataset from the machine learning repository (Frank et al., 2010). This dataset contains 1000 observations, each characterized by 13 categorical and seven continuous variables. Of the 1000 observations, 700 are considered as in control (i.e., good customers) and 300 as out of control (i.e., bad customers). The categorical variables include the checking account status

Conclusions

We have proposed the multivariate control charts based on the Gower distance that can efficiently handle the processes characterized by high dimensional and mixture data. Simulation results revealed that the proposed Gower distance-based control charts outperformed the existing ones. This is especially true of the proposed local Gower distance-based chart, which consistently yielded better results than the others as the number of categorical variables increased. Moreover, we used a real dataset

Acknowledgement

The authors thank the editor and the referees, whose comments helped improving the presentation of this paper. This research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (2013007724) and the Ministry of Knowledge Economy in Korea under the IT R&D Infrastructure Program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2011-(B1110-1101-0002)).

References (37)

  • B. Efron et al.

    An introduction to the bootstrap

    (1993)
  • T.T. EI-Midany et al.

    A proposed framework for control chart pattern recognition in multivariate process using artificial neural networks

    Expert Systems with Applications

    (2010)
  • B.S. Everitt et al.

    Cluster analysis

    (2011)
  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository [http://archive.ics.uci.edu/ml]. Irvine, CA:...
  • Q.P. He et al.

    Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes

    IEEE Transactions on Semiconductor Manufacturing

    (2007)
  • H. Hotelling

    Multivariate Quality Control

  • J. Hu et al.

    Time-based detection of changes to multivariate patterns

    Annals of Operations Research

    (2010)
  • J. Hu et al.

    Tuned artificial contrasts to detect signals

    International Journal of Production Research

    (2007)
  • Cited by (0)

    View full text