Extreme point bias compensation: A similarity method of functional clustering and its application to the stock market
Introduction
Cluster analysis is an unsupervised learning method that is based on the characteristics of datasets. In traditional clustering analysis, the sample data are usually treated as vectors. The vectors are then used for clustering (Nazari et al., 2019, Ramon-Gonen and Gelbard, 2017). However, with the rapid development of information technology, a large number of extensive datasets have come forth in many fields. Due to this large-scale data acquisition and its increasing complexity, the data have expanded beyond vectors (Blanquero, Carrizosa, Jimenez-Cordero, & Martin-Barragan, 2019). These sophisticated datasets appear in many fields, such as real-time monitoring data of numerical control machines in industry, magnetic resonance image data in medicine, climate data in meteorology or share data of the listed companies in finance (Henderson, 2006, Ramsay and Silverman, 2008). Given the complexity and amount of these data, data collection must be a continuous process and it must result in functional data (Park and Ahn, 2017, Song et al., 2018, Zhang and Wang, 2018). If functional data are simply clustered in the conventional way, then some of the dynamic information may not be accounted for. This reduces the accuracy of the results. Therefore, functional clustering analysis has attracted a lot of attention in recent years (Cao et al., 2020, Gong et al., 2019, McGarry, 2013).
Functional clustering methods can be broadly classified into four types: raw data methods, filtering methods, adaptive methods, and distance-based methods (Jacques & Preda, 2014). However, except for adaptive methods, which are based on the assumption of probability distribution, the other methods use distance measures. Hence, we can classify clustering methods into two broad categories: probability distribution-based methods and distance-based methods.
The lack of a probability density function of a functional random variable means that the probability distribution-based methods are inapplicable directly (Ferraty et al., 2002, Leng and Muller, 2006). The distance-based functional clustering methods can be further subdivided into three groups: numerical distance-based methods, curve shape-based methods, and methods that combine numerical distance with the curve shape (Ferraty & Vieu, 2006).
While most of the earlier studies have relied on either numerical distances or curve shape similarity measures, methods that combine numerical distance and curve shape measures have also been developed, such as numerical distances combined with slope, kurtosis, skewness, and extreme points (Sharma, Shokeen, & Mathur, 2016). These methods can be regarded as an improvement of the functional clustering methods based on numerical distances. However, the shape parameters that are considered cannot fully reflect the shape of the curve along its domain (Antoniadis, Brossat, Cugliari, & Poggi, 2013). For instance, in the extreme point-based functional clustering methods, the distance of extreme points on the two curves is the basis for similarity measurement. Nevertheless, this approach only accounts for the proximity of two curves at the extreme points. In this case, it is impossible to identify the differences in the curvature if extreme points coincide.
In conclusion, numerical distance-based similarity measures only reflect the differences of curves at an absolute level, without taking their dynamics (curvature) into account. Although curve shape-based similarity measures are able to overcome this shortcoming by considering certain points of the curves, the differences for the whole domain remain neglected (Gaffney and Smyth, 2004, Marc, 2012). In addition, similarity measures are often chosen in an arbitrary manner. Taking the curves of financial data as an example, long-term investors may prefer the numerical distance method to ascertain whether the functions (curves) differ along their domain. In contrast, short-term investors may prefer similarity measures that are based on curve shape and may group the curves with regard to the timing of fluctuation. Consequently, a unified measure is needed to avoid arbitrary choices (Ieva, Paganoni, Pigoli, & Vitelli, 2013). This motivates our proposal of a similarity measure that effectively combines numerical distance and curve shape for the functional clustering analysis.
To address these issues, a new similarity measure based on extreme points bias compensation is proposed in this paper. This measure allows us to account for both the distance between the curves (functions) based on the numerical distance defined for the whole domain of the function and for differences in the curvature based on the deviation of extreme points. Therefore, both dimensions (i.e., time and range of the values) are used to define the shape of curves. Because functional data are used, the coefficients of the representative functions can be exploited to measure the distance (in terms of morphology) between the functions following Ieva et al. (2013). Indeed, Ferraty and Vieu (2006) also proposed to exploit the coefficients of functions to measure the similarity, although their approach is based on the numeric distance between the coefficient vectors. The distances based on the extreme points (both in one and two dimensions) were previously discussed by Huang et al. (2016) and Ieva et al. (2013).
The novelty of this paper is that the numerical distance and curve shape are considered simultaneously in functional clustering. Therefore, we integrate the morphology-based distance (Ieva et al., 2013) and extreme-points-based approach (Ieva et al. Huang et al., 2016). The penalization due to the deviation of the extreme points is integrated into the measure of curvilinear distance. The proposed method can improve the accuracy and stability of the empirical models because it addresses both the morphology and numeric distance of the given curves.
The rest of this paper is organized as follows. Section 2 briefly reviews the related similarity measures. Section 3 proposes a new similarity measure for functional clustering and it makes a comparison with other similarity measures. Section 4 provides an illustrative example to demonstrate the application of the method. Section 5 concludes and discusses some possible directions for future research.
Section snippets
Exposition of the existing similarity measures
The basic principle of functional clustering based on numerical distances is to extend the idea of distance measurement that is used in traditional clustering analysis, where the similarity of data is still measured at an absolute level (Bouveyron & Jacques, 2011). Assuming that the data are realizations of a certain function, the information about this function can be used to calculate the distance measures. There are two ways to apply a functional clustering based on the numerical distances.
Functional clustering method based on deviation compensation of the extreme points
The core idea underlying similarity measures based on the curve shape is to assign a penalty for the deviation in time between extreme points. It is obvious that the closer the extreme points of the two curves are, the smaller the distance between them will be. However, the deviation distance (in terms of time) between extreme points cannot represent the overall distance between the curves because it can only measure local morphological differences. Thus, the existing methods that measure curve
Empirical application
In the stock market, many investors may attach importance to the form of the price curve (Przekota et al., 2019, Raudys and Pabarskaitė, 2018). Investors engaged in financial arbitrage need to find at least two targets exhibiting higher correlation of the price curve. The accuracy and stability of arbitrage models can be improved by using cluster analysis (Aghabozorgi and Teh, 2014, Nanda et al., 2010). By applying cluster analysis, the price curves of investment objects are classified into
Conclusion
Recent studies in functional cluster analysis have emphasized similarity measures based on numerical distances and curve shapes. A new similarity measure based on extreme points bias compensation and numerical distance is proposed in this paper to unify the virtues of these two approaches. The new method was applied to a set of stock data for Chinese companies. The empirical analysis suggested that the similarity measure based on extreme points bias compensation can measure the numerical
CRediT authorship contribution statement
Lirong Sun: Conceptualization, Methodology, Writing - original draft. Kaili Wang: Conceptualization, Data curation, Methodology, Writing - original draft. Tomas Balezentis: Methodology, Writing - review & editing. Dalia Streimikiene: Writing - review & editing. Chonghui Zhang: Writing - original draft.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Social Science Foundation of China (No. 18BTJ037, No. 16ZDA053).
References (38)
- et al.
Stock market co-movement assessment using a three-phase clustering method
Expert Systems with Applications
(2014) - et al.
Functional-bandwidth kernel for support vector machine with functional data: An alternating optimization algorithm
European Journal of Operational Research
(2019) - et al.
Automatic feature group combination selection method based on GA for the functional regions clustering in DBS
Computer Methods and Programs in Biomedicine
(2020) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional
Expert Systems with Applications
(2011)- et al.
Time series k-means: A new k-means type smooth subspace clustering for time series data
Information Sciences
(2016) - et al.
Funclust: A curves clustering method using functional random variables density approximation
Neurocomputing
(2013) Discovery of functional protein groups by clustering community links and integration of ontological knowledge
Expert Systems with Applications
(2013)- et al.
Comparison study of orthonormal representations of functional data in classification
Knowledge-Based Systems
(2016) - et al.
A new distance with derivative information for functional k-means clustering algorithm
Information Sciences
(2018) - et al.
Clustering Indian stock market data for portfolio management
Expert Systems with Applications
(2010)