Interval joint robust regression method

doi:10.1016/j.neucom.2021.08.129

Neurocomputing

Volume 465, 20 November 2021, Pages 265-288

https://doi.org/10.1016/j.neucom.2021.08.129 Get rights and content

Highlights

•
The paper provides a robust regression method for interval-valued variables.
•
The objective function of the method considers the full interval information.
•
The computation of the sum of squares errors uses exponential-type kernel functions.
•
Outliers have a small weight for both center and radius parameter estimates.
•
Applications on synthetic and real data sets corroborate the proposed method.

Abstract

Interval-valued data are needed to manage either the uncertainty related to measurements, or the variability inherent to the description of complex objects representing group of individuals. A number of regression methods suitable to interval variables describing variability of complex objects are already available. However, less attention has been given to methods that, simultaneously, take into account the full interval information and are resistant to interval outlier observations, even with the frequent presence of atypical observations on interval-valued data sets. This paper proposes a new robust linear regression method for interval variables, where the presence of outliers either in the center or in the radius penalize both the center and the radius regression models. Moreover, the interval observations with outliers on both center and radius are more penalized than those observations with outliers only in the center (or in the radius). Besides, this paper provides a suitable iterative algorithm to estimate the parameters of the proposed method. The algorithm estimates the parameters of the center (or of the radius) model taking into account both information of the center and the radius. The convergence and time complexity of the iterative algorithm are also presented. Finally, the performance of the new method is compared with some previous robust regression approaches and evaluated on synthetic and real interval-valued data sets.

Introduction

The progress of the information technologies provided tools to collect, store and manage big and complex data sets in many areas like Biology, Meteorology, Telecommunications, etc. Complex data arrive in case of unstructured and multi-source data (numerical, textual, image and social networks data). Analyse these massive and complex data sets is still a challenge.

The aggregation, fusion and summarization of these big and complex data sets into a concise number of groups can be of interest for a matter of simplification or because the elements of interest are not the individual records. For example, in a big data base of phone calls one can be interested in a set of calls of a person rather than a particular call.

.xmllabelp0015The description of these groups of individuals needs to take into account its internal variability, and cannot be properly achieve by the usual single-valued variables (quantitative or qualitative). Symbolic Data Analysis (SDA) [3], [11] provides new variables types, such as interval-valued, set-valued and histogram-valued, that are capable to take into account the internal variability inherent to the description of a group of individuals rather than a single individual.

Regression analysis provides interesting tools to establish and understand the relationship between an independent variable and a set of predictor variables. A wide class of regression models had been proposed for real-valued data (single-valued data) in the last decades, representing an important research field with new methods in constant development. In this paper we are focused on regression analysis of interval-valued data where the variables assume as values an interval and the objects are described by vectors of intervals. Interval-valued data considers the imprecision or uncertainty on measurements or internal variability present in the description of groups of individuals. Interval-valued data arises in situations as recording of interval temperatures at meteorological stations, daily interval stock prices, etc. In this paper, we are concerned with interval-valued data that represent internal variability.

Fuzzy-valued data is the preferred way in which interval data is managed when representing imprecision or uncertainty. Regression methods for fuzzy-valued variables were developed, mainly, according to two approaches: using linear and non-linear programming (see for example, [5], [30], [31], [33], [47]) or using least squares method (see for example, [6], [7], [9], [13], [14], [15], [16], [28]).

Concerning interval data representing internal variability of groups of individuals, a number of regression methods have been proposed in the framework of SDA. Billard and Diday [2] considered the linear regression model for interval-valued variable like a least squares problem on the center of the intervals. Others authors (e.g., [35], [43], [45], [46]) proposed linear regression methods for interval-valued variable fitted on both the center and the range of the intervals. Besides, some works (e.g., [20], [21], [22], [36]) considered constrained regression methods aiming to guarantee that there is no inversion on the bounds of the predicted interval of the dependent variable. Regression methods that take into account a probabilistic support for the response variable were also considered [1], [34], [38]. They are able to provide inference techniques over the parameter estimates. Finally, Refs. [10], [18], [25], [27], [48] propose semi-parametric, non-parametric and quantile functions regression methods.

Least squares estimates for regression methods can provide biased parameter estimates due to the presence of outliers, i.e., observations that do not come from the same data-generating process as the rest of the data. Robust regression methods are designed to be not overly affected by outliers and leverage points [23], [29], [41], [42], [50]. Despite its importance, less attention was given to robust regression models for interval-valued data and relatively few works were previously proposed. Domingues et al. [12] proposed an approach that considers a Gaussian distribution on the ranges and a Student-t distribution on the center of the intervals. Fagundes et al. [17] considers two independent robust regressions on the centers and ranges of the intervals in which a Tukey’s bi-weight function penalize the outliers. Later, Fagundes et al. [19] provided a quantile regression approach to interval-valued data.

Recently, Ref. [37] proposed the so-called iETKRR method, a robust regression method for interval-valued variables that considers exponential-type kernel functions. An iterative algorithm minimizes a suitable objective function that penalizes outliers in the centers and/or radius (or in the lower and or upper boundaries) through the use of exponential-type kernel functions, in such a way that the weights computed to the outliers observations are as small as possible. This method performed best (or as good as) the previous methods above mentioned. Besides, the center (respectively, the radius) outliers are penalized only on the center (respectively, only on the radius) regression. The same occurs for the lower (respectively, the upper) boundary.

Despite its usefulness, the approach of Ref. [37] is not able to take into account simultaneously the interrelations between the centers and radii (or the interrelations between lower and upper boundaries). Indeed, the approach of Ref. [37] manages the intervals splitting them into centers and radii (or into lower and upper boundaries) to fit two independent regression models, i.e. the center (or the lower boundary) regression model takes into account only the information of the centers (or of the lower boundaries) and the radius (or upper boundary) regression model takes into account only the information of the radii (or of the upper boundaries).

The proposed method, hereafter named iJRR (interval joint robust regression), is able to take into account the interrelations between the centers and radii (or the interrelations between lower and upper boundaries). For the best of our knowledge, the proposed approach is one of the few regression methods (and certainly the first robust regression method) that takes into account the full interval information on the fitted regression models.

The iJRR method is based on a suitable objective function of two terms with the aim to take into consideration jointly the information provided either by the center and the radius or by the lower and upper boundaries of the intervals. Consequently, the iJRR method fits two regression models, either in the center and in the range or in the lower and upper boundaries of the intervals.

The main novelties of this paper are the followings:

•
A suitable objective function for interval-valued data that considers the full interval information provided by the center and the radius (or the full interval information provided by the lower and upper boundaries) of the intervals;
•
An iterative algorithm that provides parameter estimates for the center (or for the radius) of the regression model taking into account the full interval information (center plus radius). The same occurs in the lower (or in the upper) boundary regression model.
•
The parameter estimates for the center and range regression models (or the lower and upper regression models) take into account the full interval information due to the objective function that is optimized;
•
The objective function of the iJRR method allows that an interval-valued outlier observation penalizes both the regression equations (either center and range or lower and upper boundary models) through a weight computed with the use of exponential-type kernel functions;
•
Interval observations with outliers on both center and range (or similarly, on both lower and upper boundaries) are more penalized in the parameter estimation algorithm, than those observations with outliers only in the center or only in the range (or similarly, only in the lower boundary or only in the upper boundary);
•
Two variants of the iJRR method are considered aiming to provide more flexibility and robustness to the method for different types of outliers. In the first variant, the same width hyper-parameter is used to smooth the difference between the observed responses and the predictions. In the second variant, different width hyper-parameter are used for the same aim.
•
A new hyper-parameter estimator based on the covariance definition for interval-valued variables, proposed by Ref. [26], is considered in the new approach.

The proposed iJRR method as well as the previous robust methods of Refs. [12], [17], [19], [37] will be evaluated with synthetic data sets in terms of the bias and mean squared error (MSE) of the parameter estimates taking into account X-space outliers, Y-space outliers, leverage points, different sample sizes and percentage of outliers in the sample, in a Monte Carlo simulation framework with $10, 000$ replications. The performance of the proposed iJRR method will also be evaluated on real interval-valued data sets.

This work is organized as follows. Section 2 presents the iJRR method and the corresponding parameter estimate algorithm. Section 3 provides the Monte Carlo experiments that compares the iJRR method with the previous robust regression methods of Refs. [12], [17], [19], [37]. Section 4 shows applications with real interval-valued data sets. Section 5 provides the final remarks and conclusions.

Section snippets

The interval joint robust regression method

This section introduces the iJRR method considering two different variants and presents its parameter estimation algorithm as well as its time complexity and the convergence properties.

Monte Carlo experiments

In this section, the iJRR method will be compared with the previous robust regression methods for interval-valued data, namely iETKRR [37], IRR [17], QIR [19] and SSLR [12], with synthetic data sets in a framework of a Monte Carlo scheme. Previously to the comparison study, we will evaluate the influence of the different width hyper-parameter estimators in the variants $iJRR . 1$ and $iJRR . 2$ of the proposed method. The regression methods were implemented in R language [40] and all the experiments

Application to real interval-valued data sets

This section considers the application of the proposed method on real interval-valued data sets aiming to highlight its usefulness in comparison with the robust methods SSLR [12], IRR [17], QIR [19] and iETKRR [37]. The non-robust CRM method [35] is also applied on these real data set to highlight the need to the use of robust methods when the data set is contaminated by outliers.

To assess the robustness of each method, the percentage of change of the parameter estimates when the outliers are

Concluding remarks

This paper proposed a new linear robust regression method for interval-valued data. The interval joint robust regression (iJRR) method takes into account the full interval information on the fitted regression models, i.e., iJRR is able to take into account either the interrelations between the centers and radii or the interrelations between lower and upper boundaries of the intervals. Besides, the iJRR method is resistant (robust) to the presence of interval outlier observations.

In the proposed

CRediT authorship contribution statement

Francisco de A.T. de Carvalho: Conceptualization, Methodology, Writing - review & editing. Eufrásio de A.Lima Neto: Conceptualization, Methodology, Writing - review & editing. Ullysses da N. Rosendo: Software, Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are grateful to the anonymous referees and the Associate Editor for their careful revision, valuable suggestions, and comments which improved this paper. The first author would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq (311164/2020-0) for their financial support.

References (50)

F.A.T. de Carvalho et al.
A robust regression method based on exponential-type kernel functions
Neurocomputing
(2017)
P. Diamond
Fuzzy least squares
Information Sciences
(1988)
S. Dias et al.
Off the beaten track: A new linear model for interval data
European Journal of Operational Research
(2017)
M.A.O. Domingues et al.
A robust method for linear regression of symbolic interval data
Pattern Recognition Letters
(2010)
R.A.A. Fagundes et al.
Interval kernel regression
Neurocomputing
(2014)
P. Hao et al.
Constrained center and range joint model for interval-valued symbolic data regression
Computational Statistics and Data Analysis
(2017)
L. Maciel et al.
A fuzzy inference system modeling approach for interval-valued symbolic data forecasting
Knowledge-Based Systems
(2019)
M. Modarres et al.
Fuzzy linear regression models with least square errors
Applied Mathematics and Computation
(2005)
E.A. Lima Neto et al.
Centre and range method for fitting a linear regression model to symbolic interval data
Computational Statistics and Data Analysis
(2008)
E.A. Lima Neto et al.
Constrained linear regression models for symbolic interval-valued variable
Computational Statistics and Data Analysis
(2010)

E.A. Lima Neto et al.

An exponential-type kernel robust regression model for interval-valued variables

Information Sciences

(2018)

B.A. Pimentel et al.

A weighted multivariate fuzzy c-means method in interval-valued scientific production data

Expert Systems with Applications

(2014)

W.J.F. Silva et al.

Polygonal data analysis: A new framework in symbolic data analysis

Knowledge-Based Systems

(2019)

L.C. Souza et al.

A parametrized approach for linear regression of interval data

Knowledge-Based Systems

(2017)

J. Ahn et al.

A resampling approach for interval-valued data regression

Statistical Analysis and Data Mining

(2012)

L. Billard et al.

From the statistics of data to the statistics of knowledge: Symbolic data analysis

J. Amer. Statist. Assoc.

(2003)

B. Caputo, K. SIM, F. Furesjo, A. Mola, Appearance-based object recognition using svms: which kernel should i use? in:...

E. Ping-Teng Chang et al.

A generalized fuzzy weighted least-squares regression

Fuzzy Sets and Systems

(1996)

Seung Hoe Choi et al.

Least absolute deviation estimator in fuzzy regression

Soft Computing

(2008)

R. Coppi, P. DUrso, P. Giordani, A. Santoro, Least squares estimation of a linear regression model with lr fuzzy...

P. DUrso. Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data, Computational Statistics & Data...

P. DUrso, T. Gastaldi, A least-squares approach to fuzzy linear regression analysis, Computational Statistics & Data...

P. DUrso, R. Massari, Weighted least squares and least median squares estimation for the fuzzy linear regression...

Cited by (4)

A regularized MM estimate for interval-valued regression
2024, Expert Systems with Applications
In real life, we usually encounter with interval-valued data when analyzing imprecise data or massive data sets. In this paper, a regularized interval MM estimate (RIMME) for interval-valued regression is proposed. In order to mitigate the mathematical incoherence of the predicted intervals, a regularized term is introduced to penalize the number of crossing intervals. Therefore, the proposed method can achieve a good balance between the prediction accuracy and mathematical coherence of the predicted intervals. To evaluate the performance of RIMME, a simulation study and three real data sets are examined. Experimental results illustrate that our method outperforms five commonly used methods in almost all cases.
Determining hedges and safe havens for stocks using interval analysis
2022, North American Journal of Economics and Finance
Citation Excerpt :
For example, when today’s closing price equals yesterday’s closing price, the price return will be zero, but price variation during a today might be turbulent. Hence, interval time series (ITS) data and associated tools for modeling interval data processes have been suggested as alternatives (e.g., Billard and Diday (2003), Arroyo et al. (2007, Arroyo et al. 2011), Lima Neto and De Carvalho (2008, 2010), He and Hu (2009), Han et al. (2016) and de Carvalho et al. (2021)). An ITS contains both trend/level information (e.g., the price at an interval’s boundaries) and volatility information (e.g., the range of prices within an interval).
We examine whether hedging and safe haven assets exist against stocks when market high and low prices evaluate asset prices. Using interval-based estimations, this paper finds that 10-year government bonds, the U.S. dollar, and gold served as weak hedging and/or safe haven assets for the stock market losses over the 2002–2019 period. We also provide evidence of the USD’s and gold’s hedging ability against the stock market volatility and of volatility transmission between assets, and highlight the importance of considering volatility.
Regression applied to symbolic interval-spatial data
2024, Applied Intelligence
Robust Regression for interval valued data
2023, Research Square

Francisco de A.T. de Carvalho received the Ph.D. degree in Computer Science in 1992 from Institut National de Recherche en Informatique et en Automatique (INRIA) and Université Paris-IX Dauphine, France. From 1992 to 1998, he was a lecturer at Statistical Department at Universidade Federal de Pernambuco, Brazil. He joined the Center of Informatics at Universidade Federal de Pernambuco in 1999, where he is currently Full Professor. He held visiting posts in several leading universities and research centers in Europe. With main research interests in symbolic data analysis, clustering analysis and machine learning he has authored over 200 technical papers in international journals and conferences. He has served as Coordinator (2005–2009) of the post-graduate program of computer science of the CIn/UFPE. He has been involved in program committees of many Brazilian and international conferences. He has also served as review of many international journals and conferences. He was member of the council (2009–2013) of the International Association for Statistical Computing (IASC). He was member of the council (2017–2020) of the Latin American Regional Section – LARS of the IASC. He is within the top 2% of scientists in the world in the field of Artificial Intelligence and Image Processing throughout his career and in the year 2019 according to a study by Plos Biology (https://journals.plos.org/plosbiology /article?id=10.1371/journal.pbio.3000918).

Eufrásio de A. Lima Neto is an Associate Professor in the Department of Statistics and faculty member of the Graduate Program in Computational and Mathematical Modelling at the Federal University of Para?ba. He has Bachelor’s and Master’s degrees in Statistics and Ph.D. in Computer Science (Machine Learning). His main research interests are Statistical Modeling, Regression, Generalized Linear Models, Robust Regression, Clusterwise Regression, Machine Learning, Symbolic Data Analysis, Interval-valued Data and Kernel Methods. He is the author over of 50 technical papers in international journals and conferences.

Ullysses da N. Rosendo has a bachelor degree in Statistics by the Federal University of Para?ba (Brazil) and works in Keek Inteligência Anal?tica consulting. His main research interests are Statistical Modeling, Machine Learning, Python and R.

View full text

Interval joint robust regression method

Highlights

Abstract

Introduction

Section snippets

The interval joint robust regression method

Monte Carlo experiments

Application to real interval-valued data sets

Concluding remarks

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Information Sciences

European Journal of Operational Research

Pattern Recognition Letters

Neurocomputing

Computational Statistics and Data Analysis

Knowledge-Based Systems

Applied Mathematics and Computation

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis

Information Sciences

Expert Systems with Applications

Knowledge-Based Systems

Knowledge-Based Systems

A resampling approach for interval-valued data regression

Statistical Analysis and Data Mining

From the statistics of data to the statistics of knowledge: Symbolic data analysis

J. Amer. Statist. Assoc.

A generalized fuzzy weighted least-squares regression

Fuzzy Sets and Systems

Least absolute deviation estimator in fuzzy regression

Soft Computing