Representing financial time series based on data point importance

doi:10.1016/j.engappai.2007.04.009

Engineering Applications of Artificial Intelligence

Volume 21, Issue 2, March 2008, Pages 277-300

https://doi.org/10.1016/j.engappai.2007.04.009 Get rights and content

Abstract

Recently, the increasing use of time series data has initiated various research and development attempts in the field of data and knowledge management. Time series data is characterized as large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical fields. Indeed, a large set of time series data is from stock market. Stock time series has its own characteristics over other time series. Moreover, dimensionality reduction is an essential step before many time series analysis and mining tasks. For these reasons, research is prompted to augment existing technologies and build new representation to manage financial time series data. In this paper, financial time series is represented according to the importance of the data points. With the concept of data point importance, a tree data structure, which supports incremental updating, is proposed to represent the time series and an access method for retrieving the time series data point from the tree, which is according to their order of importance, is introduced. This technique is capable to present the time series in different levels of detail and facilitate multi-resolution dimensionality reduction of the time series data. In this paper, different data point importance evaluation methods, a new updating method and two dimensionality reduction approaches are proposed and evaluated by a series of experiments. Finally, the application of the proposed representation on mobile environment is demonstrated.

Introduction

Recently, the increasing use of temporal data, in particular time series data, has initiated various research and development attempts in the field of data and knowledge management (Last et al., 2001). A time series is a collection of observations made chronologically. The nature of time series data include: large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical field.

There are varieties of time series data related research, for examples, finding similar time series (Liao et al., 2004), querying time series database (Rafiei and Mendelzon, 2000), segmentation (Wang and Willett, 2004; Feng et al., 2005), dimensionality reduction (Keogh et al., 2000; Keogh et al., 2001), clustering (Policker and Geva, 2000), classification (Wang and Willett, 2004) and forecasting (Pantazopoulos et al., 1998; Sfetsos and Siriopoulos, 2004, Sfetsos and Siriopoulos, 2005). Those researches have been studied in considerable detail by both database and pattern recognition communities for different domains of time series data (Keogh and Kasetty, 2002). While most of the research communities have concentrated on the above issues, the fundamental problem on how to represent a time series in multi-resolution, which is also considered as information granulation as in (Bargiela and Pedrycz, 2003), has not yet been fully addressed so far. To represent a time series is essential because time series data is hard to manipulate in its original structure. Therefore, defining a more effective and efficient time series representation scheme is of fundamental importance.

The time series data used in data and knowledge management is high dimensional, but before it can be processed and analyzed, this dimensionality must be reduced, commonly using approaches that focus on lower bounding the Euclidean distance. These approaches, however, smooth out salient points of the original time series, which is counterproductive when applied to financial time series data, as financial analysis often depends on the shape of data and the salience of data points to identify technical patterns. For these purposes, then it is important to reduce dimensionality while retaining the information associated with these points and the salient points are considered as important points to the shape of the time series.

Previous approaches to reducing dimensionality while retaining point information have included sampling. In this approach, a rate of m/x is used, where m is the length of time series P and x is the dimension after dimensionality reduction, but sampling approaches have the drawback of distorting the shape of sampled/compressed time series if the sampling rate is too low. As already noted, most other time series dimensionality reduction approaches, such as principal component analysis (PCA) (Fukunaga, 1990), singular value decomposition (SVD) (Korn, et al., 1997), discrete Fourier transform (DFT) (Agrawal et al., 1993; Rafiei and Mendelzon, 2000; Chu and Wong, 1999), discrete wavelet transform (DWT) (Popivanov and Miller, 2002; Kahveci and Singh, 2001; Chan and Fu, 1999), piecewise aggregate approximation (PAA) (Keogh et al., 2000; Yi and Faloutsos, 2000) and adaptive piecewise constant approximation (APCA) (Keogh et al., 2001), focus on lower bounding the Euclidean distance. However, because such approaches often lose important data points, they may fail to retain the general shape of the time series after compression (Fig. 1).

A time series is constructed by a sequence of data points and the amplitude of a data point has different extent of influence on the shape of the time series. That is, each data point has its own importance to the time series. A data point may contribute on the overall shape of the time series while another may only have little influence on the time series or may even be discarded. For example, frequently appearing technical time series patterns are typically characterized by a few salient points such as a head and shoulders. Time series pattern consists of a head point, two shoulder points and a pair of neck points. These points are perceptually important in the human visual identification process. These points are therefore more important than other data points in the time series. The data point with importance calculation is named as perceptually important point (PIP). The identification of PIP is first introduced by Chung et al. (2001) and used for pattern matching of technical (analysis) patterns in financial applications. The idea was later found similar to a technique proposed about 30 years ago for reducing the number of points required to represent a line by Douglas and Peucker (1973) (see also Hershberger and Snoeyink, 1992). We also found independent works by Perng et al. (2000), Pratt and Fink (2002) and Fink and Pratt (2003) which work on similar ideas. However, none of these techniques propose data structure to well-organize and store the salient points identified.

In this paper, we propose a time series representation framework which is based on the concept of data point importance. Challenges in here are like how to recognize these salient points, a data structure to represent these points which can facilitate incremental updating, multi-resolution retrieval and support dimensionality reduction. The proposed framework is capable to reduce the time series dimension to different levels of detail based on the importance of data point. On the other hand, the original accuracy can be maintained and salient points will not be distorted. A tree data structure, which stores the data points of the time series, is then proposed and efficient computation of cumulative new data points, maintaining the data structure views incrementally to avoid expensive recomputation and accessing method on this tree to retrieve the time series data point according to their importance are introduced.

The remaining part of this paper is organized as follows: Section 2 describes the concept of data point importance and three methods for evaluating the data point importance. Section 3 describes the proposed time series representation framework, the proposed Specialized Binary Tree (SB-Tree) algorithm, and how the SB-Tree is used to create, update, retrieve and reduce the dimension of time series. In Section 4, we analyze the results of the experiments and the mobile application of the proposed representation is demonstrated on Section 5. Section 6 offers our conclusion.

Section snippets

Defining and evaluating data point importance

In this section, we describe the concept of data point importance based on identifying the perceptually importance points (PIPs). Then, we introduce three methods for evaluating the importance of the PIPs in a time series, they are: euclidean distance (PIP-ED), perpendicular distance (PIP-PD) and vertical distance (PIP-VD). A simple example will be given at the end of this section to illustrate the PIP identification process using the different data point importance evaluation methods.

Tree representation for dimensionality reduction

The management of financial time series data in multi-resolution requires the definition of a suitable time series representation data structure. In Section 3.1, we therefore describe a tree structure for representing financial time series representation that is based on determining the data point importance in the time series. Instead of storing the time series data according to time or transforming it into other domains (e.g. the frequency domain), data points of a time series are stored

Experimental results

In this section, we evaluate the performance of the data point importance evaluation methods, PIP-ED, PIP-VD and PIP-VD, the proposed point-by-point updating method of the SB-Tree and the dimensionality reduction methods, tree pruning method and the error threshold method. The experiments are implemented with the C programming language. They were performed on a Sun computer (Sun Solaris Unltra5 with 2 sets of 200 MHz UltraSPARC CPU and 256MB memory).

Mobile application

Unlike systems run on a fixed network (Saha et al., 2001), mobile devices operating in a wireless environment suffer from limited resources (Pham et al., 2001). Mobile devices are limited in display screen size, which makes it a challenging task in illustrating a complete time series chart clearly. Network bandwidth of a mobile device is also limited, and sometimes expensive when using cellular technology. Storage and computation capacity of mobile devices are also much inferior to their fixed

Conclusions

This paper has presented a financial time series representation based on a tree structure according to the importance of the data points. The process of Perceptual Important Point Identification, which evaluates the importance of a data point, has been illustrated. Three data point importance evaluation methods: PIP-ED, PIP-PD and PIP-VD are proposed. Experiments show that PIP-VD is a preferable method for evaluating the data point importance in most of the cases in financial domain. Then, a

References (30)

Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In: Proceedings of the...
A. Bargiela et al.
Recursive information granulation: aggregation and interpretation issues
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics
(2003)
Chan, K.P., Fu, A.C., 1999. Efficient time series matching by wavelets. In: Proceedings of the 15th International...
Chu, K.K.W., Wong, M.H., 1999. Fast time-series searching with scaling and shifting. In: Proceedings of the 18th ACM...
Chung, F.L., Fu, T.C., Luk, R., Ng, V., 2001. Flexible time series pattern matching based on perceptually important...
D. Douglas et al.
Algorithms for the reduction of the number of points required to represent a digitized line or its caricature
The Canadian Cartographer
(1973)
L. Feng et al.
A Method for segmentation of switching dynamic modes in time series
IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics
(2005)
E. Fink et al.
Indexing of compressed time series
Data Mining in Time Series Databases
(2003)
Fu, T.C., Chung, F.L., Tang, P.Y., Luk, R., Ng, C.M., 2005. Incremental stock time series data delivery and...
K. Fukunaga
Introduction to Statistical Pattern Recognition
(1990)

Hershberger, J., Snoeyink, J., 1992. Speeding up the Douglas–Peucker line-simplification algorithm. In: Proceedings of...

Kahveci, T., Singh, A., 2001. Variable length queries for time series data. In: Proceedings of the 17th International...

E. Keogh et al.

Dimensionality reduction for fast similarity Search in large time series databases

Journal of Knowledge and Information Systems

(2000)

Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M., 2001. Locally adaptive dimensionality reduction for indexing...

Keogh, E., Kasetty, S., 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration....

Cited by (76)

LLT: An R package for linear law-based feature space transformation
2024, SoftwareX
The goal of the linear law-based feature space transformation (LLT) algorithm is to assist with the classification of univariate and multivariate time series. The presented R package, called LLT, implements this algorithm in a flexible yet user-friendly way. This package first splits the instances into training and test sets. It then utilizes time-delay embedding and spectral decomposition techniques to identify the governing patterns (called linear laws) of each input sequence (initial feature) within the training set. Finally, it applies the linear laws of the training set to transform the initial features of the test set. These steps are performed by three separate functions called trainTest, trainLaw, and testTrans. Their application requires a predefined data structure; however, for fast calculation, they use only built-in functions. The LLT R package and a sample dataset with the appropriate data structure are publicly available on GitHub.
Soft sensor for non-invasive detection of process events based on Eigenresponse Fuzzy Clustering
2023, Applied Soft Computing
Changes in process states and properties can be observed through measured variables. In this way, by classifying time series segments of measured data, changes in model parameters can be detected and the system state can be inferred. Time series classification methods are used in many fields, but the work presented here focuses mainly on the field of manufacturing. In the category of whole-series time series classifiers, the Nearest Neighbor classifier is often used. The aim of this work is to introduce an alternative supervised method for time series classification — Eigenresponse Fuzzy Clustering (EFC). We introduce class eigenresponses, which are time series prototypes of a class. We propose the learning eigenresponses for each class using a fuzzy clustering technique. Unlike some existing methods, we propose the use of multiple prototypes per class to better describe a wider range of values for each class. Moreover, the presented method is evaluated on several datasets. Using a dataset obtained on an industrial test bench on an e-bike drive assembly line, the method correctly classifies all time series. To further validate the performance, a set of publicly available datasets (UCR Archive) is used. For the category of datasets most similar to the target industrial application, an improvement over the benchmark approach is obtained.
Research on temporal and spatial evolution of public's response to the mandatory waste separation policy based on big data mining
2022, Sustainable Production and Consumption
The mandatory waste separation policy is an important driving force for the public to actively participate in waste separation activities. It is of great significance to explore the effectiveness of policy from the perspective of public response. This study uses a big data mining technology to obtain 131, 422 attitude entries related to the mandatory waste separation policy in China and explores the public's response characteristics from the dimensions of cognition, emotion, and behavioral intention. The results show that the public's cognition and emotion have an impact on their behavioral intention; interestingly, gender is observed to play a moderating role in the impact of cognition on behavioral intention, and the impact of females is more significant than that of males. The results of temporal difference analysis show that Chinese public's attention towards the mandatory waste separation policy has gone through four periods (i.e., outbreak period, stationary period, second outbreak period, and recession period). During these periods, the percentage of positive cognition and emotion both show a trend of decreasing first and then increasing, while behavioral intention shows a gradual and slow upward trend. Notably, the total scores of cognition and emotion are higher than that of behavioral intention. The results of spatial difference analysis show that Zhejiang and Beijing have a high degree of public attention, Hong Kong and Hainan have a high proportion of positive cognition dimension, Ningxia and Jiangxi have a high proportion of positive emotion dimension, and Jiangxi and Hong Kong have a high proportion of positive behavioral intention dimension; this indicates that the public's cognition, emotion, and behavioral intention response to the mandatory waste separation policy present spatial inconsistency. The study provides important enlightenment for improving the effectiveness of waste separation guidance policies. The release of policies should be adjusted to the time and local conditions and government could formulate differentiated policies to guide heterogeneous groups to identify with mandatory waste separation policy.
Determination of the fatigue behavior of mechanical components through infrared thermography
2022, Engineering Failure Analysis
The determination of the fatigue behavior at a component level usually requires dedicated test rigs and an expressive amount of time. The hours spent on such machinery are expensive; therefore, solutions to reduce experimentation time are most welcomed. In this context, this investigation aims at developing a procedure for rapid determination of the fatigue strength of crankshafts by means of a thermographic methodology. The use of infrared cameras for fatigue strength analysis was first assessed in standard dog-bone specimens. Crankshafts were then tested in an in-house fatigue test rig using the conventional staircase method and the thermographic method. Sample batches with different manufacturing parameters were produced and tested to assess the robustness of the proposed alternative technique. Results of the dog-bone test campaign revealed a good correlation between fatigue strength estimates obtained with the conventional Wöhler curve and the thermographic methodology. Finally, the thermographic technique also delivered results in close agreement with the staircase method for all crankshaft batches. The proposed procedure was found to be a viable, rapid alternative to conventional fatigue test programs, with potential application for complex structural components such as crankshafts, among others.
A fast and accurate similarity measure for long time series classification based on local extrema and dynamic time warping
2021, Expert Systems with Applications
The problem of similarity measures is a major area of interest within the field of time series classification (TSC). With the ubiquitous of long time series and the increasing demand for analyzing them on limited resource devices, there is a crucial need for efficient and accurate measures to deal with such kind of data. In fact, there are a plethora of good time series similarity measures in the literature. However, most existing methods achieve good performance for short time series, but their effectiveness decreases quickly as time series are longer. In this paper, we develop a new parameter-free measure for the specific purpose of quickly and accurately assessing the similarity between two given long time series. The proposed “Local Extrema Dynamic Time Warping” (LE-DTW) consists of two steps. The first is a time series representation technique that starts by reducing the dimensionality of a given time series using its local extrema. Next, it physically separates the minima and maxima points for more intuitiveness and consistency of the so-obtained time series representation. The second step consists in adapting the Dynamic Time Warping (DTW) measure so as to evaluate the score of similarity between the generated representations. We test the performance of LE-DTW on a wide range of real-world problems from the UCR time series archive for TSC. Experimental results indicate that for short time series, the proposed method achieves reasonable classification accuracy as compared to DTW. However, for long time series, LE-DTW performs much better. Indeed, it outperforms DTW while providing competitive results against popular distance-based classifiers. Moreover, in terms of efficiency, LE-DTW is orders of magnitude faster than DTW.
Applying genetic algorithms with speciation for optimization of grid template pattern detection in financial markets
2020, Expert Systems with Applications
This paper presents a new computational finance approach. It combines a grid pattern recognition technique allied to an evolutionary computation optimization kernel based on Genetic Algorithms, creating a dynamic way to attribute a score to the signal that takes volatility into consideration and normalizing the pattern detection by fixing the grid size with the ultimate goal of reduce risk and increase profits. For pattern matching, a template based approach using a fixed size grid of weights is adopted to describe the desired trading patterns, taking not only the closing price into consideration, but also the variation of price in each considered time interval of the time series. The scores assigned to the grid of weights will be optimized by the Genetic Algorithm and, at the same time, the genetic diversity of possible solutions will be preserved using a speciation technique, giving time for individuals to be optimized within their own niche. The adoption of this approach has the goal of reducing the investment risk and check if it outperforms similar approaches. This system was tested against state-of-the-art solutions, namely the existing adaptable grid of weights and a non speciated approach, considering real data from the stock market. The developed approach using the grid of weights had 21.3% of average return over the testing period against 10.9% of the existing approach and the use of speciation improved some of the training results as genetic diversity was taken into consideration.

View all citing articles on Scopus

View full text

Representing financial time series based on data point importance

Abstract

Introduction

Section snippets

Defining and evaluating data point importance

Tree representation for dimensionality reduction

Experimental results

Mobile application

Conclusions

Recursive information granulation: aggregation and interpretation issues

IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics

Algorithms for the reduction of the number of points required to represent a digitized line or its caricature

The Canadian Cartographer

A Method for segmentation of switching dynamic modes in time series

IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics

Indexing of compressed time series

Data Mining in Time Series Databases

Introduction to Statistical Pattern Recognition

Dimensionality reduction for fast similarity Search in large time series databases

Journal of Knowledge and Information Systems