Extreme point bias compensation: A similarity method of functional clustering and its application to the stock market

https://doi.org/10.1016/j.eswa.2020.113949Get rights and content

Highlights

  • This paper presents a novel approach for functional clustering.

  • Numerical distance approach and a curve shape approach are combined.

  • The comparative analysis confirms superiority of the proposed method.

  • Illustrative example for stock data is provided.

Abstract

Functional clustering is based on functional similarity measures that are adapted to functional data. However, the existing functional similarity measures account either for the position (value) or temporal deviation (bias) of extreme points of the functional curves. This may lead to erroneous conclusions on the similarities of the curves. In this case, most functional clustering measures underperform in (for example) the analysis of stock market data. To address this methodological limitation, a new similarity measure that is based on extreme point bias compensation is proposed in this paper. By penalizing the curves with the temporal deviation of extreme points and rewarding the curves that are close to each other, the new similarity measure better reflects the shape of the curve. In addition, the proposed method overcomes the difficulty of unifying the dimensions of the horizontal and vertical axes (i.e., time and function value) when calculating the distance between two adjacent extreme points. Finally, an empirical example of stock return analysis verifies the validity of this new measure.

Introduction

Cluster analysis is an unsupervised learning method that is based on the characteristics of datasets. In traditional clustering analysis, the sample data are usually treated as vectors. The vectors are then used for clustering (Nazari et al., 2019, Ramon-Gonen and Gelbard, 2017). However, with the rapid development of information technology, a large number of extensive datasets have come forth in many fields. Due to this large-scale data acquisition and its increasing complexity, the data have expanded beyond vectors (Blanquero, Carrizosa, Jimenez-Cordero, & Martin-Barragan, 2019). These sophisticated datasets appear in many fields, such as real-time monitoring data of numerical control machines in industry, magnetic resonance image data in medicine, climate data in meteorology or share data of the listed companies in finance (Henderson, 2006, Ramsay and Silverman, 2008). Given the complexity and amount of these data, data collection must be a continuous process and it must result in functional data (Park and Ahn, 2017, Song et al., 2018, Zhang and Wang, 2018). If functional data are simply clustered in the conventional way, then some of the dynamic information may not be accounted for. This reduces the accuracy of the results. Therefore, functional clustering analysis has attracted a lot of attention in recent years (Cao et al., 2020, Gong et al., 2019, McGarry, 2013).

Functional clustering methods can be broadly classified into four types: raw data methods, filtering methods, adaptive methods, and distance-based methods (Jacques & Preda, 2014). However, except for adaptive methods, which are based on the assumption of probability distribution, the other methods use distance measures. Hence, we can classify clustering methods into two broad categories: probability distribution-based methods and distance-based methods.

The lack of a probability density function of a functional random variable means that the probability distribution-based methods are inapplicable directly (Ferraty et al., 2002, Leng and Muller, 2006). The distance-based functional clustering methods can be further subdivided into three groups: numerical distance-based methods, curve shape-based methods, and methods that combine numerical distance with the curve shape (Ferraty & Vieu, 2006).

While most of the earlier studies have relied on either numerical distances or curve shape similarity measures, methods that combine numerical distance and curve shape measures have also been developed, such as numerical distances combined with slope, kurtosis, skewness, and extreme points (Sharma, Shokeen, & Mathur, 2016). These methods can be regarded as an improvement of the functional clustering methods based on numerical distances. However, the shape parameters that are considered cannot fully reflect the shape of the curve along its domain (Antoniadis, Brossat, Cugliari, & Poggi, 2013). For instance, in the extreme point-based functional clustering methods, the distance of extreme points on the two curves is the basis for similarity measurement. Nevertheless, this approach only accounts for the proximity of two curves at the extreme points. In this case, it is impossible to identify the differences in the curvature if extreme points coincide.

In conclusion, numerical distance-based similarity measures only reflect the differences of curves at an absolute level, without taking their dynamics (curvature) into account. Although curve shape-based similarity measures are able to overcome this shortcoming by considering certain points of the curves, the differences for the whole domain remain neglected (Gaffney and Smyth, 2004, Marc, 2012). In addition, similarity measures are often chosen in an arbitrary manner. Taking the curves of financial data as an example, long-term investors may prefer the numerical distance method to ascertain whether the functions (curves) differ along their domain. In contrast, short-term investors may prefer similarity measures that are based on curve shape and may group the curves with regard to the timing of fluctuation. Consequently, a unified measure is needed to avoid arbitrary choices (Ieva, Paganoni, Pigoli, & Vitelli, 2013). This motivates our proposal of a similarity measure that effectively combines numerical distance and curve shape for the functional clustering analysis.

To address these issues, a new similarity measure based on extreme points bias compensation is proposed in this paper. This measure allows us to account for both the distance between the curves (functions) based on the numerical distance defined for the whole domain of the function and for differences in the curvature based on the deviation of extreme points. Therefore, both dimensions (i.e., time and range of the values) are used to define the shape of curves. Because functional data are used, the coefficients of the representative functions can be exploited to measure the distance (in terms of morphology) between the functions following Ieva et al. (2013). Indeed, Ferraty and Vieu (2006) also proposed to exploit the coefficients of functions to measure the similarity, although their approach is based on the numeric distance between the coefficient vectors. The distances based on the extreme points (both in one and two dimensions) were previously discussed by Huang et al. (2016) and Ieva et al. (2013).

The novelty of this paper is that the numerical distance and curve shape are considered simultaneously in functional clustering. Therefore, we integrate the morphology-based distance (Ieva et al., 2013) and extreme-points-based approach (Ieva et al. Huang et al., 2016). The penalization due to the deviation of the extreme points is integrated into the measure of curvilinear distance. The proposed method can improve the accuracy and stability of the empirical models because it addresses both the morphology and numeric distance of the given curves.

The rest of this paper is organized as follows. Section 2 briefly reviews the related similarity measures. Section 3 proposes a new similarity measure for functional clustering and it makes a comparison with other similarity measures. Section 4 provides an illustrative example to demonstrate the application of the method. Section 5 concludes and discusses some possible directions for future research.

Section snippets

Exposition of the existing similarity measures

The basic principle of functional clustering based on numerical distances is to extend the idea of distance measurement that is used in traditional clustering analysis, where the similarity of data is still measured at an absolute level (Bouveyron & Jacques, 2011). Assuming that the data are realizations of a certain function, the information about this function can be used to calculate the distance measures. There are two ways to apply a functional clustering based on the numerical distances.

Functional clustering method based on deviation compensation of the extreme points

The core idea underlying similarity measures based on the curve shape is to assign a penalty for the deviation in time between extreme points. It is obvious that the closer the extreme points of the two curves are, the smaller the distance between them will be. However, the deviation distance (in terms of time) between extreme points cannot represent the overall distance between the curves because it can only measure local morphological differences. Thus, the existing methods that measure curve

Empirical application

In the stock market, many investors may attach importance to the form of the price curve (Przekota et al., 2019, Raudys and Pabarskaitė, 2018). Investors engaged in financial arbitrage need to find at least two targets exhibiting higher correlation of the price curve. The accuracy and stability of arbitrage models can be improved by using cluster analysis (Aghabozorgi and Teh, 2014, Nanda et al., 2010). By applying cluster analysis, the price curves of investment objects are classified into

Conclusion

Recent studies in functional cluster analysis have emphasized similarity measures based on numerical distances and curve shapes. A new similarity measure based on extreme points bias compensation and numerical distance is proposed in this paper to unify the virtues of these two approaches. The new method was applied to a set of stock data for Chinese companies. The empirical analysis suggested that the similarity measure based on extreme points bias compensation can measure the numerical

CRediT authorship contribution statement

Lirong Sun: Conceptualization, Methodology, Writing - original draft. Kaili Wang: Conceptualization, Data curation, Methodology, Writing - original draft. Tomas Balezentis: Methodology, Writing - review & editing. Dalia Streimikiene: Writing - review & editing. Chonghui Zhang: Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Social Science Foundation of China (No. 18BTJ037, No. 16ZDA053).

References (38)

  • Ramon-GonenR. et al.

    Cluster evolution analysis: Identification and detection of similar clusters and migration patterns

    Expert Systems with Applications

    (2017)
  • ZhangX.K. et al.

    Optimal weighting schemes for longitudinal and functional data

    Statistics & Probability Letters

    (2018)
  • AbrahamC. et al.

    Unsupervised curve clustering using B-splines

    Scandinavian Journal of Statistics

    (2003)
  • AntoniadisA. et al.

    Clustering functional data using wavelets

    International Journal of Wavelets, Multiresolution and Information Processing

    (2013)
  • BouveyronC. et al.

    Model-based clustering of time series in group-specific functional subspaces

    Advances in Data Analysis and Classification

    (2011)
  • FerratyF. et al.

    Functional nonparametric model for time series: a fractal approach for dimension reduction

    TEST: An Official Journal of the Spanish Society of Statistics and Operations Research

    (2002)
  • FerratyF. et al.
  • GaffneyS. et al.

    Joint probabilistic curve clustering and alignment

  • GiacofciM. et al.

    Wavelet-based clustering for mixed-effects functional models in high dimension

    Biometrics

    (2013)
  • Cited by (0)

    View full text