Representing financial time series based on data point importance
Introduction
Recently, the increasing use of temporal data, in particular time series data, has initiated various research and development attempts in the field of data and knowledge management (Last et al., 2001). A time series is a collection of observations made chronologically. The nature of time series data include: large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical field.
There are varieties of time series data related research, for examples, finding similar time series (Liao et al., 2004), querying time series database (Rafiei and Mendelzon, 2000), segmentation (Wang and Willett, 2004; Feng et al., 2005), dimensionality reduction (Keogh et al., 2000; Keogh et al., 2001), clustering (Policker and Geva, 2000), classification (Wang and Willett, 2004) and forecasting (Pantazopoulos et al., 1998; Sfetsos and Siriopoulos, 2004, Sfetsos and Siriopoulos, 2005). Those researches have been studied in considerable detail by both database and pattern recognition communities for different domains of time series data (Keogh and Kasetty, 2002). While most of the research communities have concentrated on the above issues, the fundamental problem on how to represent a time series in multi-resolution, which is also considered as information granulation as in (Bargiela and Pedrycz, 2003), has not yet been fully addressed so far. To represent a time series is essential because time series data is hard to manipulate in its original structure. Therefore, defining a more effective and efficient time series representation scheme is of fundamental importance.
The time series data used in data and knowledge management is high dimensional, but before it can be processed and analyzed, this dimensionality must be reduced, commonly using approaches that focus on lower bounding the Euclidean distance. These approaches, however, smooth out salient points of the original time series, which is counterproductive when applied to financial time series data, as financial analysis often depends on the shape of data and the salience of data points to identify technical patterns. For these purposes, then it is important to reduce dimensionality while retaining the information associated with these points and the salient points are considered as important points to the shape of the time series.
Previous approaches to reducing dimensionality while retaining point information have included sampling. In this approach, a rate of m/x is used, where m is the length of time series P and x is the dimension after dimensionality reduction, but sampling approaches have the drawback of distorting the shape of sampled/compressed time series if the sampling rate is too low. As already noted, most other time series dimensionality reduction approaches, such as principal component analysis (PCA) (Fukunaga, 1990), singular value decomposition (SVD) (Korn, et al., 1997), discrete Fourier transform (DFT) (Agrawal et al., 1993; Rafiei and Mendelzon, 2000; Chu and Wong, 1999), discrete wavelet transform (DWT) (Popivanov and Miller, 2002; Kahveci and Singh, 2001; Chan and Fu, 1999), piecewise aggregate approximation (PAA) (Keogh et al., 2000; Yi and Faloutsos, 2000) and adaptive piecewise constant approximation (APCA) (Keogh et al., 2001), focus on lower bounding the Euclidean distance. However, because such approaches often lose important data points, they may fail to retain the general shape of the time series after compression (Fig. 1).
A time series is constructed by a sequence of data points and the amplitude of a data point has different extent of influence on the shape of the time series. That is, each data point has its own importance to the time series. A data point may contribute on the overall shape of the time series while another may only have little influence on the time series or may even be discarded. For example, frequently appearing technical time series patterns are typically characterized by a few salient points such as a head and shoulders. Time series pattern consists of a head point, two shoulder points and a pair of neck points. These points are perceptually important in the human visual identification process. These points are therefore more important than other data points in the time series. The data point with importance calculation is named as perceptually important point (PIP). The identification of PIP is first introduced by Chung et al. (2001) and used for pattern matching of technical (analysis) patterns in financial applications. The idea was later found similar to a technique proposed about 30 years ago for reducing the number of points required to represent a line by Douglas and Peucker (1973) (see also Hershberger and Snoeyink, 1992). We also found independent works by Perng et al. (2000), Pratt and Fink (2002) and Fink and Pratt (2003) which work on similar ideas. However, none of these techniques propose data structure to well-organize and store the salient points identified.
In this paper, we propose a time series representation framework which is based on the concept of data point importance. Challenges in here are like how to recognize these salient points, a data structure to represent these points which can facilitate incremental updating, multi-resolution retrieval and support dimensionality reduction. The proposed framework is capable to reduce the time series dimension to different levels of detail based on the importance of data point. On the other hand, the original accuracy can be maintained and salient points will not be distorted. A tree data structure, which stores the data points of the time series, is then proposed and efficient computation of cumulative new data points, maintaining the data structure views incrementally to avoid expensive recomputation and accessing method on this tree to retrieve the time series data point according to their importance are introduced.
The remaining part of this paper is organized as follows: Section 2 describes the concept of data point importance and three methods for evaluating the data point importance. Section 3 describes the proposed time series representation framework, the proposed Specialized Binary Tree (SB-Tree) algorithm, and how the SB-Tree is used to create, update, retrieve and reduce the dimension of time series. In Section 4, we analyze the results of the experiments and the mobile application of the proposed representation is demonstrated on Section 5. Section 6 offers our conclusion.
Section snippets
Defining and evaluating data point importance
In this section, we describe the concept of data point importance based on identifying the perceptually importance points (PIPs). Then, we introduce three methods for evaluating the importance of the PIPs in a time series, they are: euclidean distance (PIP-ED), perpendicular distance (PIP-PD) and vertical distance (PIP-VD). A simple example will be given at the end of this section to illustrate the PIP identification process using the different data point importance evaluation methods.
Tree representation for dimensionality reduction
The management of financial time series data in multi-resolution requires the definition of a suitable time series representation data structure. In Section 3.1, we therefore describe a tree structure for representing financial time series representation that is based on determining the data point importance in the time series. Instead of storing the time series data according to time or transforming it into other domains (e.g. the frequency domain), data points of a time series are stored
Experimental results
In this section, we evaluate the performance of the data point importance evaluation methods, PIP-ED, PIP-VD and PIP-VD, the proposed point-by-point updating method of the SB-Tree and the dimensionality reduction methods, tree pruning method and the error threshold method. The experiments are implemented with the C programming language. They were performed on a Sun computer (Sun Solaris Unltra5 with 2 sets of 200 MHz UltraSPARC CPU and 256MB memory).
Mobile application
Unlike systems run on a fixed network (Saha et al., 2001), mobile devices operating in a wireless environment suffer from limited resources (Pham et al., 2001). Mobile devices are limited in display screen size, which makes it a challenging task in illustrating a complete time series chart clearly. Network bandwidth of a mobile device is also limited, and sometimes expensive when using cellular technology. Storage and computation capacity of mobile devices are also much inferior to their fixed
Conclusions
This paper has presented a financial time series representation based on a tree structure according to the importance of the data points. The process of Perceptual Important Point Identification, which evaluates the importance of a data point, has been illustrated. Three data point importance evaluation methods: PIP-ED, PIP-PD and PIP-VD are proposed. Experiments show that PIP-VD is a preferable method for evaluating the data point importance in most of the cases in financial domain. Then, a
References (30)
- Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In: Proceedings of the...
- et al.
Recursive information granulation: aggregation and interpretation issues
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics
(2003) - Chan, K.P., Fu, A.C., 1999. Efficient time series matching by wavelets. In: Proceedings of the 15th International...
- Chu, K.K.W., Wong, M.H., 1999. Fast time-series searching with scaling and shifting. In: Proceedings of the 18th ACM...
- Chung, F.L., Fu, T.C., Luk, R., Ng, V., 2001. Flexible time series pattern matching based on perceptually important...
- et al.
Algorithms for the reduction of the number of points required to represent a digitized line or its caricature
The Canadian Cartographer
(1973) - et al.
A Method for segmentation of switching dynamic modes in time series
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics
(2005) - et al.
Indexing of compressed time series
Data Mining in Time Series Databases
(2003) - Fu, T.C., Chung, F.L., Tang, P.Y., Luk, R., Ng, C.M., 2005. Incremental stock time series data delivery and...
Introduction to Statistical Pattern Recognition
(1990)
Dimensionality reduction for fast similarity Search in large time series databases
Journal of Knowledge and Information Systems
Cited by (76)
Soft sensor for non-invasive detection of process events based on Eigenresponse Fuzzy Clustering
2023, Applied Soft ComputingResearch on temporal and spatial evolution of public's response to the mandatory waste separation policy based on big data mining
2022, Sustainable Production and ConsumptionDetermination of the fatigue behavior of mechanical components through infrared thermography
2022, Engineering Failure AnalysisA fast and accurate similarity measure for long time series classification based on local extrema and dynamic time warping
2021, Expert Systems with ApplicationsApplying genetic algorithms with speciation for optimization of grid template pattern detection in financial markets
2020, Expert Systems with Applications