Elsevier

Information Fusion

Volume 27, January 2016, Pages 161-169
Information Fusion

Outlier elimination using granular box regression

https://doi.org/10.1016/j.inffus.2015.04.001Get rights and content

Highlights

  • We employ granular box regression to eliminate the outliers.

  • We propose penalty schemes, on instances or boxes, to configure granular boxes.

  • We investigate the performance in terms of regression analysis and box configuration.

  • It offers better linear models for data sets with high and low rates of outliers.

  • The penalty scheme on instances improves 72% of regression and 99% of box configuration.

Abstract

A regression method desires to fit the curve on a data set irrespective of outliers. This paper modifies the granular box regression approaches to deal with data sets with outliers. Each approach incorporates a three-stage procedure includes granular box configuration, outlier elimination, and linear regression analysis. The first stage investigates two objective functions each applies different penalty schemes on boxes or instances. The second stage investigates two methods of outlier elimination to, then, perform the linear regression in the third stage. The performance of the proposed granular box regressions are investigated in terms of: volume of boxes, insensitivity of boxes to outliers, elapsed time for box configuration, and error of regression. The proposed approach offers a better linear model, with smaller error, on the given data sets containing varieties of outlier rates. The investigation shows the superiority of applying penalty scheme on instances.

Introduction

Simplifying and abstracting comfort us to understand the data to trace its general pattern or trend. While the term of “abstracting” associates with studies in artificial intelligence, the term of “granularity” is its synonym in soft computing studies [1], [2], [3], [4]. Granularity often aims at reducing the complexities of data that cause to increase the processing cost – mostly where the uncertainties are involved. Therefore, the certain practices motivate the studies on granularity, i.e., clarity, low cost approximation and the tolerance of uncertainty. An application of these practices, through understanding the data, are identifying or eliminating the anomalies – known as outliers. The granulated data can benefit the computation in two ways as follows. (i) To understand the data: by transparent the complexity to a reduced size data – known as granules. When a method performs an estimation based on granular data, as the requisite, the outcome should be more accurate than using the original complex data. (ii) To reduce the cost of data analysis: by avoiding complex tools to run through the data for insight discovery [2], [5], [6], [7]. Thus, a non-expert data mining person can also make sense out of it. (iii) To increase the power of estimation and the capability of dealing with uncertainty. A data, as a part of a granule, does not only represent a single observation but, rather, a group of data.

However, beside the pointed advantages for granulation, it requires methods to sharpen the transparency of data. Granular box regression analysis [8], [9] carries this out by detecting the outliers in data. It finds the correlation between the dependent and independent variables using hyper dimensional interval numbers known as boxes. In granular box regression (GBR), every instance in data set affects the size and coordinate of boxes. In the result, an approach becomes sensitive to outliers as an issue. To resolve this, we propose variations of granular box regression based on subset of data they process, and then, we investigate their performance with the presence of outliers in a data set. There are two motivations to the proposed variations. First, to simplify a data set contains numerous data – thus, we comfort understanding of a non-expert by clarifying the relationship between dependent and independent variables; second, to study the performances of each variation of granular box regression with presence of outliers.

Outlier is an anomaly object that is atypical to the remaining data. It deviates from other objects such that it makes suspicious to be generated by a different mechanism [10]. Subject of outlier is either an elimination of disturbance such as noise reduction [10], [11], [12], [13], [14], or an interest of detection such as crime detection [15], [16], [17], [18], [19], [20], [21], [22], [23]. This paper focuses on the former view. Different approaches [10], [11], [15], [24] have studied the outlier; though, this paper differentiates itself by applying granular box approach and elimination of the outliers. To measure the goodness of applied approach, either box or relationships between boxes can indicate the quality of box configuration on data. In case of box measurement, the approach should minimize the overall volumes to reduce the complexity of data as were indented; and in case of measuring the relationships of boxes, am approach should build a coherent relationship similar to the true function by regression analysis. A possible approach, to furnish the former issue, is employing genetic algorithm (GA) [25], [26], [27] to find the optimal volumes gradually. Performing the box configuration based on GA builds the relationships between boxes to reduce the complexity of data and exclude the outliers. As the result, configured boxes represent the simplified original data.

This paper is organized in seven sections. Section 2 reviews the granular box regression and explains its notions. Section 3 explains the three stages of proposed framework to configure the boxes, eliminate the outliers, and fit the curve; where, box-based penalization (BP) and instance-based penalization (IP) are proposed for box configuration, and clean- and candidate-based methods are proposed for elimination of outliers. Section 5, reveals the results on six data sets each with two rates of outliers. Then, Section 6 gives detailed analyses in three parts to investigate the regression analysis and box configuration with respect to effect of dimensionality of data and rate of outliers on each method. Section 7 concludes the achieved results and addresses the future works for each method.

Section snippets

Granular box regression

A regression approach should eliminate the outliers prior to fit a model on data; as otherwise, it may fit the outlier data and makes a wrong interpretation. Granular box regression (GBR) is an inclusive approach to detect the outliers in every dimension of data. Compared with classical regression analysis (CRA), which performs only on respond variable [27], [28], [29], GBR performs also on predictors to detect the outliers. Where CRA approach minimizes the summation of distances between the

Proposed granular box regression

An overall view to our proposed GBR is shown in Fig. 2. It illustrates the idea of penalty scheme, where box configuration represents the simplification of data by clean instances or candidates of boxes. Concretely, Fig. 3 shows the procedure of performing GBR in three stages that are: (i) apply a granular box configuration, (ii) exclude the outliers based on dispersion of data in each box, and (iii) apply the linear regression analysis – on the results of the second stage as the remaining data

Preparation and computing goodness of model

We investigate the performance of all variations of granular box regression given in Table 1. We conducted 100 runs to report the average and the standard deviation for each configuration. As given in Table 2, we used the following six data sets: micro-economic data for Customer Purchase Index (CPI) of Germany, artificial data set used in [9], two data sets generated by Eqs. (10), (11), and, servo, and combined cycle power plant (CCPP) data sets. We generated 1000 instances to produce synthetic

Results

To produce the results of proposed GBRs, we performed linear regression analysis on clean data and candidates of three configured boxes. The results are in average of 100 times run in terms of residual error. Table 4 gives the results based on rates of outliers over each data set. The comparisons investigate six methods over their equivalent S approach – as each three GBR performs two methods either based on clean data or based on candidates of boxes. Table 4, Table 5 reveal that IP, BP and P

Analyses of proposed granular box regression

This section provides analyses on proposed GBRs based on 12 variations of six data sets, i.e., artificial data set [9], two data sets generated by Eqs. (10), (11), CPI, servo, and CCPP. Regression analysis and box configuration conduct the results with the following measurements to reveal the detailed analyses. (i) For regression: residual error, rate of outliers, and statistical analyses and (ii) for box configuration: elapsed time, volume of boxes, and standard deviation of box volumes.

Conclusion

This paper investigated the insensitivity of granular box regressions. They configure the granular box on data, and then perform the regression analysis on subsets of granulated data. We modified two approaches for granular box configuration. The first approach, penalizes the box uncontained the required number of data; and the second approach, penalizes the instances unconfined by any box. Then, we performed the outlier- elimination by applying two methods to keep the major trend of data

Acknowledgements

The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under Research University Grants 00M19, 02G71 and 4F550 are hereby acknowledged for some of the facilities that were utilized during the course of this research work.

References (53)

  • J.J. Buckley et al.

    Linear and non-linear fuzzy regression: evolutionary algorithm solutions

    Fuzzy Sets Syst.

    (2000)
  • P. Tüfekci

    Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods

    Int. J. Electr. Power Energy Syst.

    (2014)
  • S. Ekinci et al.

    Predictions of oil/chemical tanker main design parameters using computational intelligence techniques

    Appl. Soft Comput.

    (2011)
  • D. Hernández-Lobato et al.

    Empirical analysis and evaluation of approximate techniques for pruning regression bagging ensembles

    Neurocomputing

    (2011)
  • J.R. Hobbs, Granularity, in: Proceedings of the Ninth International Joint Conference on Artificial Intelligence,...
  • Y. Yao, Human-inspired granular computing, in: Novel Developments in Granular Computing: Applications for Advanced...
  • Y. Yao

    Interpreting concept learning in cognitive informatics and granular computing

    IEEE Trans. Syst. Man Cybern., Part B: Cybern.

    (2009)
  • Y. Yao, J. Luo, Top-down progressive computing, in: Rough Sets and Knowledge Technology, Springer, 2011, pp....
  • G. Peters

    Granular box regression

    IEEE Trans. Fuzzy Syst.

    (2011)
  • D.M. Hawkins

    Identification of Outliers

    (1980)
  • V.J. Hodge et al.

    A survey of outlier detection methodologies

    Artif. Intell. Rev.

    (2004)
  • J. Fox

    Regression Diagnostics: An Introduction

    (1991)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1994)
  • N. Devarakonda, S. Subhani, S.A.H. Basha, Outliers detection in regression analysis using partial least square...
  • A. Dastanpour, S. Ibrahim, R. Mashinchi, Using genetic algorithm to supporting artificial neural network for intrusion...
  • V. Chandola et al.

    Anomaly detection: a survey

    ACM Comput. Surv. (CSUR)

    (2009)
  • Cited by (8)

    • A clusterwise nonlinear regression algorithm for interval-valued data

      2021, Information Sciences
      Citation Excerpt :

      The paper considers three methods of tackling outliers in granular box regression and discuss their properties. Later, Mashinchi et al. [33] modified the granular box regression approaches to deal with data sets with outliers by incorporating a three-stage procedure including granular box configuration, outlier elimination, and linear regression analysis. Clusterwise regression is a useful technique when heterogeneity is present in the data.

    • Seamless group target tracking using random finite sets

      2020, Signal Processing
      Citation Excerpt :

      An illustration of B-spline fitting and object initialization is shown in Fig. 5. It can be seen that clutter will affect the fitting performance, and thus, efficient outlier elimination processes are necessary [39,40]. Moreover, the lower the detection probability, the longer the duration of the data needed for stable fitting.

    • Towards fine-grained maize tassel flowering status recognition: Dataset, theory and practice

      2017, Applied Soft Computing
      Citation Excerpt :

      To boost the related research, the flowering status image dataset, source code and other supporting materials are made available online.1 Recently, computer vision and machine learning technologies have experienced a rapid development, like emerging visual descriptors [8] and granular classifiers [19,20] and regressors [21]. In particular, computer vision applications in agriculture have also drawn considerably attentions in both ASOC and agriculture engineering communities [7,22–26].

    View all citing articles on Scopus
    View full text