Interval joint robust regression method
Introduction
The progress of the information technologies provided tools to collect, store and manage big and complex data sets in many areas like Biology, Meteorology, Telecommunications, etc. Complex data arrive in case of unstructured and multi-source data (numerical, textual, image and social networks data). Analyse these massive and complex data sets is still a challenge.
The aggregation, fusion and summarization of these big and complex data sets into a concise number of groups can be of interest for a matter of simplification or because the elements of interest are not the individual records. For example, in a big data base of phone calls one can be interested in a set of calls of a person rather than a particular call.
.xmllabelp0015The description of these groups of individuals needs to take into account its internal variability, and cannot be properly achieve by the usual single-valued variables (quantitative or qualitative). Symbolic Data Analysis (SDA) [3], [11] provides new variables types, such as interval-valued, set-valued and histogram-valued, that are capable to take into account the internal variability inherent to the description of a group of individuals rather than a single individual.
Regression analysis provides interesting tools to establish and understand the relationship between an independent variable and a set of predictor variables. A wide class of regression models had been proposed for real-valued data (single-valued data) in the last decades, representing an important research field with new methods in constant development. In this paper we are focused on regression analysis of interval-valued data where the variables assume as values an interval and the objects are described by vectors of intervals. Interval-valued data considers the imprecision or uncertainty on measurements or internal variability present in the description of groups of individuals. Interval-valued data arises in situations as recording of interval temperatures at meteorological stations, daily interval stock prices, etc. In this paper, we are concerned with interval-valued data that represent internal variability.
Fuzzy-valued data is the preferred way in which interval data is managed when representing imprecision or uncertainty. Regression methods for fuzzy-valued variables were developed, mainly, according to two approaches: using linear and non-linear programming (see for example, [5], [30], [31], [33], [47]) or using least squares method (see for example, [6], [7], [9], [13], [14], [15], [16], [28]).
Concerning interval data representing internal variability of groups of individuals, a number of regression methods have been proposed in the framework of SDA. Billard and Diday [2] considered the linear regression model for interval-valued variable like a least squares problem on the center of the intervals. Others authors (e.g., [35], [43], [45], [46]) proposed linear regression methods for interval-valued variable fitted on both the center and the range of the intervals. Besides, some works (e.g., [20], [21], [22], [36]) considered constrained regression methods aiming to guarantee that there is no inversion on the bounds of the predicted interval of the dependent variable. Regression methods that take into account a probabilistic support for the response variable were also considered [1], [34], [38]. They are able to provide inference techniques over the parameter estimates. Finally, Refs. [10], [18], [25], [27], [48] propose semi-parametric, non-parametric and quantile functions regression methods.
Least squares estimates for regression methods can provide biased parameter estimates due to the presence of outliers, i.e., observations that do not come from the same data-generating process as the rest of the data. Robust regression methods are designed to be not overly affected by outliers and leverage points [23], [29], [41], [42], [50]. Despite its importance, less attention was given to robust regression models for interval-valued data and relatively few works were previously proposed. Domingues et al. [12] proposed an approach that considers a Gaussian distribution on the ranges and a Student-t distribution on the center of the intervals. Fagundes et al. [17] considers two independent robust regressions on the centers and ranges of the intervals in which a Tukey’s bi-weight function penalize the outliers. Later, Fagundes et al. [19] provided a quantile regression approach to interval-valued data.
Recently, Ref. [37] proposed the so-called iETKRR method, a robust regression method for interval-valued variables that considers exponential-type kernel functions. An iterative algorithm minimizes a suitable objective function that penalizes outliers in the centers and/or radius (or in the lower and or upper boundaries) through the use of exponential-type kernel functions, in such a way that the weights computed to the outliers observations are as small as possible. This method performed best (or as good as) the previous methods above mentioned. Besides, the center (respectively, the radius) outliers are penalized only on the center (respectively, only on the radius) regression. The same occurs for the lower (respectively, the upper) boundary.
Despite its usefulness, the approach of Ref. [37] is not able to take into account simultaneously the interrelations between the centers and radii (or the interrelations between lower and upper boundaries). Indeed, the approach of Ref. [37] manages the intervals splitting them into centers and radii (or into lower and upper boundaries) to fit two independent regression models, i.e. the center (or the lower boundary) regression model takes into account only the information of the centers (or of the lower boundaries) and the radius (or upper boundary) regression model takes into account only the information of the radii (or of the upper boundaries).
The proposed method, hereafter named iJRR (interval joint robust regression), is able to take into account the interrelations between the centers and radii (or the interrelations between lower and upper boundaries). For the best of our knowledge, the proposed approach is one of the few regression methods (and certainly the first robust regression method) that takes into account the full interval information on the fitted regression models.
The iJRR method is based on a suitable objective function of two terms with the aim to take into consideration jointly the information provided either by the center and the radius or by the lower and upper boundaries of the intervals. Consequently, the iJRR method fits two regression models, either in the center and in the range or in the lower and upper boundaries of the intervals.
The main novelties of this paper are the followings:
- •
A suitable objective function for interval-valued data that considers the full interval information provided by the center and the radius (or the full interval information provided by the lower and upper boundaries) of the intervals;
- •
An iterative algorithm that provides parameter estimates for the center (or for the radius) of the regression model taking into account the full interval information (center plus radius). The same occurs in the lower (or in the upper) boundary regression model.
- •
The parameter estimates for the center and range regression models (or the lower and upper regression models) take into account the full interval information due to the objective function that is optimized;
- •
The objective function of the iJRR method allows that an interval-valued outlier observation penalizes both the regression equations (either center and range or lower and upper boundary models) through a weight computed with the use of exponential-type kernel functions;
- •
Interval observations with outliers on both center and range (or similarly, on both lower and upper boundaries) are more penalized in the parameter estimation algorithm, than those observations with outliers only in the center or only in the range (or similarly, only in the lower boundary or only in the upper boundary);
- •
Two variants of the iJRR method are considered aiming to provide more flexibility and robustness to the method for different types of outliers. In the first variant, the same width hyper-parameter is used to smooth the difference between the observed responses and the predictions. In the second variant, different width hyper-parameter are used for the same aim.
- •
A new hyper-parameter estimator based on the covariance definition for interval-valued variables, proposed by Ref. [26], is considered in the new approach.
The proposed iJRR method as well as the previous robust methods of Refs. [12], [17], [19], [37] will be evaluated with synthetic data sets in terms of the bias and mean squared error (MSE) of the parameter estimates taking into account X-space outliers, Y-space outliers, leverage points, different sample sizes and percentage of outliers in the sample, in a Monte Carlo simulation framework with replications. The performance of the proposed iJRR method will also be evaluated on real interval-valued data sets.
This work is organized as follows. Section 2 presents the iJRR method and the corresponding parameter estimate algorithm. Section 3 provides the Monte Carlo experiments that compares the iJRR method with the previous robust regression methods of Refs. [12], [17], [19], [37]. Section 4 shows applications with real interval-valued data sets. Section 5 provides the final remarks and conclusions.
Section snippets
The interval joint robust regression method
This section introduces the iJRR method considering two different variants and presents its parameter estimation algorithm as well as its time complexity and the convergence properties.
Monte Carlo experiments
In this section, the iJRR method will be compared with the previous robust regression methods for interval-valued data, namely iETKRR [37], IRR [17], QIR [19] and SSLR [12], with synthetic data sets in a framework of a Monte Carlo scheme. Previously to the comparison study, we will evaluate the influence of the different width hyper-parameter estimators in the variants and of the proposed method. The regression methods were implemented in R language [40] and all the experiments
Application to real interval-valued data sets
This section considers the application of the proposed method on real interval-valued data sets aiming to highlight its usefulness in comparison with the robust methods SSLR [12], IRR [17], QIR [19] and iETKRR [37]. The non-robust CRM method [35] is also applied on these real data set to highlight the need to the use of robust methods when the data set is contaminated by outliers.
To assess the robustness of each method, the percentage of change of the parameter estimates when the outliers are
Concluding remarks
This paper proposed a new linear robust regression method for interval-valued data. The interval joint robust regression (iJRR) method takes into account the full interval information on the fitted regression models, i.e., iJRR is able to take into account either the interrelations between the centers and radii or the interrelations between lower and upper boundaries of the intervals. Besides, the iJRR method is resistant (robust) to the presence of interval outlier observations.
In the proposed
CRediT authorship contribution statement
Francisco de A.T. de Carvalho: Conceptualization, Methodology, Writing - review & editing. Eufrásio de A.Lima Neto: Conceptualization, Methodology, Writing - review & editing. Ullysses da N. Rosendo: Software, Data curation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors are grateful to the anonymous referees and the Associate Editor for their careful revision, valuable suggestions, and comments which improved this paper. The first author would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq (311164/2020-0) for their financial support.
Francisco de A.T. de Carvalho received the Ph.D. degree in Computer Science in 1992 from Institut National de Recherche en Informatique et en Automatique (INRIA) and Université Paris-IX Dauphine, France. From 1992 to 1998, he was a lecturer at Statistical Department at Universidade Federal de Pernambuco, Brazil. He joined the Center of Informatics at Universidade Federal de Pernambuco in 1999, where he is currently Full Professor. He held visiting posts in several leading universities and
References (50)
- et al.
A robust regression method based on exponential-type kernel functions
Neurocomputing
(2017) Fuzzy least squares
Information Sciences
(1988)- et al.
Off the beaten track: A new linear model for interval data
European Journal of Operational Research
(2017) - et al.
A robust method for linear regression of symbolic interval data
Pattern Recognition Letters
(2010) - et al.
Interval kernel regression
Neurocomputing
(2014) - et al.
Constrained center and range joint model for interval-valued symbolic data regression
Computational Statistics and Data Analysis
(2017) - et al.
A fuzzy inference system modeling approach for interval-valued symbolic data forecasting
Knowledge-Based Systems
(2019) - et al.
Fuzzy linear regression models with least square errors
Applied Mathematics and Computation
(2005) - et al.
Centre and range method for fitting a linear regression model to symbolic interval data
Computational Statistics and Data Analysis
(2008) - et al.
Constrained linear regression models for symbolic interval-valued variable
Computational Statistics and Data Analysis
(2010)
An exponential-type kernel robust regression model for interval-valued variables
Information Sciences
A weighted multivariate fuzzy c-means method in interval-valued scientific production data
Expert Systems with Applications
Polygonal data analysis: A new framework in symbolic data analysis
Knowledge-Based Systems
A parametrized approach for linear regression of interval data
Knowledge-Based Systems
A resampling approach for interval-valued data regression
Statistical Analysis and Data Mining
From the statistics of data to the statistics of knowledge: Symbolic data analysis
J. Amer. Statist. Assoc.
A generalized fuzzy weighted least-squares regression
Fuzzy Sets and Systems
Least absolute deviation estimator in fuzzy regression
Soft Computing
Cited by (4)
A regularized MM estimate for interval-valued regression
2024, Expert Systems with ApplicationsDetermining hedges and safe havens for stocks using interval analysis
2022, North American Journal of Economics and FinanceCitation Excerpt :For example, when today’s closing price equals yesterday’s closing price, the price return will be zero, but price variation during a today might be turbulent. Hence, interval time series (ITS) data and associated tools for modeling interval data processes have been suggested as alternatives (e.g., Billard and Diday (2003), Arroyo et al. (2007, Arroyo et al. 2011), Lima Neto and De Carvalho (2008, 2010), He and Hu (2009), Han et al. (2016) and de Carvalho et al. (2021)). An ITS contains both trend/level information (e.g., the price at an interval’s boundaries) and volatility information (e.g., the range of prices within an interval).
Regression applied to symbolic interval-spatial data
2024, Applied IntelligenceRobust Regression for interval valued data
2023, Research Square
Francisco de A.T. de Carvalho received the Ph.D. degree in Computer Science in 1992 from Institut National de Recherche en Informatique et en Automatique (INRIA) and Université Paris-IX Dauphine, France. From 1992 to 1998, he was a lecturer at Statistical Department at Universidade Federal de Pernambuco, Brazil. He joined the Center of Informatics at Universidade Federal de Pernambuco in 1999, where he is currently Full Professor. He held visiting posts in several leading universities and research centers in Europe. With main research interests in symbolic data analysis, clustering analysis and machine learning he has authored over 200 technical papers in international journals and conferences. He has served as Coordinator (2005–2009) of the post-graduate program of computer science of the CIn/UFPE. He has been involved in program committees of many Brazilian and international conferences. He has also served as review of many international journals and conferences. He was member of the council (2009–2013) of the International Association for Statistical Computing (IASC). He was member of the council (2017–2020) of the Latin American Regional Section – LARS of the IASC. He is within the top 2% of scientists in the world in the field of Artificial Intelligence and Image Processing throughout his career and in the year 2019 according to a study by Plos Biology (https://journals.plos.org/plosbiology /article?id=10.1371/journal.pbio.3000918).
Eufrásio de A. Lima Neto is an Associate Professor in the Department of Statistics and faculty member of the Graduate Program in Computational and Mathematical Modelling at the Federal University of Para?ba. He has Bachelor’s and Master’s degrees in Statistics and Ph.D. in Computer Science (Machine Learning). His main research interests are Statistical Modeling, Regression, Generalized Linear Models, Robust Regression, Clusterwise Regression, Machine Learning, Symbolic Data Analysis, Interval-valued Data and Kernel Methods. He is the author over of 50 technical papers in international journals and conferences.
Ullysses da N. Rosendo has a bachelor degree in Statistics by the Federal University of Para?ba (Brazil) and works in Keek Inteligência Anal?tica consulting. His main research interests are Statistical Modeling, Machine Learning, Python and R.