Establishing relationships among patterns in stock market data

doi:10.1016/j.datak.2008.10.001

Data & Knowledge Engineering

Volume 68, Issue 3, March 2009, Pages 318-337

https://doi.org/10.1016/j.datak.2008.10.001 Get rights and content

Abstract

Similarities among subsequences are typically regarded as categorical features of sequential data. We introduce an algorithm for capturing the relationships among similar, contiguous subsequences. Two time series are considered to be similar during a time interval if every contiguous subsequence of a predefined length satisfies the given similarity criterion. Our algorithm identifies patterns based on the similarity among sequences, captures the sequence–subsequence relationships among patterns in the form of a directed acyclic graph (DAG), and determines pattern conglomerates that allow the application of additional meta-analyses and mining algorithms. For example, our pattern conglomerates can be used to analyze time information that is lost in categorical representations. We apply our algorithm to stock market data as well as several other time series data sets and show the richness of our pattern conglomerates through qualitative and quantitative evaluations. An exemplary meta-analysis determines timing patterns representing relations between time series intervals and demonstrates the merit of pattern relationships as an extension of time series pattern mining.

Introduction

Time series data are ubiquitous in fields as diverse as economics, science, and industry; hence, it is not surprising that there has been a strong interest in applying data mining techniques to time series data. Time series can be very long, and users are often interested in similarities that extend over a comparatively short time interval, which suggests the use of sliding-window techniques. An approach that is based on sliding windows starts with all possible fixed length, contiguous subsequences of the time series under consideration. Note that the term “subsequence” has multiple meanings in the literature. We use subsequence in the sense of a contiguous section of a sequence that is also sometimes called “substring”. In order to address the properties of time series data, special similarity measures have been devised that are defined over variable-length subsequences, as well as making other generalizations [45], [7], [6], [12]. With well-established similarity measures in place, researchers have pursued pattern mining, clustering and classification tasks, as they are common in data mining.

The richness of temporal data is, however, not alone captured in modified similarity measures. In sequential data, strong reasons may be given as to why it can be beneficial to revise even the concept of pattern mining itself: conventionally pattern mining is seen as returning isolated, frequent occurrences in the data. Although relationships among patterns have been extensively used as a basis for pruning through closure properties [1], these set–subset relationships do not normally contribute much to the expressiveness of the result when time series are considered. In comparison to record data, time series data inherently provides an additional dimension (time) for each data item. The time dimension can be utilized not only for mining patterns but also for capturing the relationships among patterns. In our interpretation, a revised concept of pattern mining should include the interrelations among patterns.

For example, knowing that a group of stock series shares a pattern over a long period of time, while other stock series show a related pattern over a much shorter interval can provide valuable insights into the price developments of stocks. The relationships among patterns have important information content by themselves. It is our goal to capture the similarities among stock market time series such that their sequence–subsequence relationships are preserved. We identify patterns representing collections of contiguous subsequences that share the same shape for a particular time interval. Patterns are defined on the basis of contiguous sections of normalized sliding windows that show pairwise similarities among sequences. The relationships among sliding-window patterns are represented using a directed acyclic graph (DAG) that is constructed based on the overlap between patterns. Leaf nodes within the DAG denote entire sequences, internal nodes represent patterns, and the sequence–subsequence relationships among patterns are represented by the edges. In a directed graph, an internal node, in contrast to a leaf node, has at least one directed edge to another node. The information contained within the DAG, as well as timing information, is represented using a pattern conglomerate notation that constitutes a new level of abstraction. The pattern conglomerate concept is designed to allow meta-analyses. In the context of this paper, a meta-analysis is an analysis applied to the results of another analysis, i.e., our pattern conglomerates (result of the first analysis) can be used as input to another, second analysis (meta-analysis). A pattern conglomerate incorporates the structure of the DAG and the order of clustered sequences, as well as the extent of the subsequences considered during the execution of our algorithm (Section 3.3). The panel (a) of Fig. 1 depicts an example of four time series that shows a total of three characteristic shapes. The sliding-window pattern that is signified by × is shared by all four sequences. Sequences A and B show a longer pattern that extends as far as the section with a □. Time series C and D have a different extended pattern comprised of × and ○. The corresponding DAG representation is shown in panel (b) of Fig. 1. Each time series is represented by a leaf node, and all three patterns are represented as internal nodes. The root node, ×, connects to the two other internal nodes, which represent the longer patterns. Note that the DAG is different from similarity-based representations that are common in hierarchical clustering, where degrees of similarities are used to group sequences. In our case, length of overlap determines the position in the DAG and similarity is defined through a single window-based threshold. Accordingly, the × node is created based on the overlap between patterns A/B (□×) and C/D (×○) rather than the degree of the similarity between the sequences. The third panel (c) of Fig. 1 depicts the abstraction of the DAG in form of a pattern conglomerate. The structure of the DAG is represented using parentheses, and the beginning and ending of regions of similarity between pairs of sequences are indicated by braces with subscripts.

We demonstrate the usefulness of our pattern conglomerates by determining timing patterns of the form begins earlier, ends later, and is longer between time series of the same pattern conglomerate. Examples for timing patterns in Fig. 1a are A and B begin earlier than C and D. We apply our algorithm to 460 stock market time series of the S&P 500 index as well as to four additional time series data sets (Section 5.1). The additional data sets serve as a means to highlight the applicability of our approach to different time series data sets (Section 5.5) and to provide a more comprehensive performance analysis (Section 5.7).

The stock price of a company is influenced by a wealth of internal and external factors. An internal factor may be the perceived potential of the company to be successful in the future (e.g., competent management or ability to generate profit), and an external factor could be the future expectations of a market in which the company operates. There have been several studies addressing the influence of external factors such as news on stock market behavior [48], [26], [49], [5], [30]. We do not restrict our analysis by the assumption that there is a single external factor, such as a news report, affecting stock prices. It is our objective to observe the effects of combinations of external influences that have an impact on the stock prices of two or more companies. Note that we do not attempt to identify the nature of any factors but rather observe their effects. We assume that stocks of two companies may show a similar shape when major influences or economic pressures on these companies are similar. For example, if the future expectations of a particular market (e.g., e-commerce) are very positive (or negative), then the stock time series of companies that operate in this market are likely to show a very similar shape. The application of our algorithm to stock time series results in a DAG representation, e.g., Fig. 1b, where contiguous subsequences of stocks that exhibit a similar shape for some time interval are grouped together. Based on the above interpretation, the companies that issue these stocks are under the pressure of similar factors for that particular interval. Our exemplary meta-analysis focuses on the onset and progression of factors affecting two companies. Temporal relationships of interest include the observation that the impact of factors on some companies begin earlier, end later, and are longer than others.

Traditionally, work on stock market data has focused on predictive modeling [4], [11] and study of anomalies [41]. In recent years, data mining approaches have increasingly gained importance [19], [25], [40], [31], [15], despite negative connotations of the term “data mining”, which is sometimes interpreted as being “synonymous with data dredging and fishing” [21]. Predictive tasks are still in the foreground of data mining [25] and machine learning [43], [46], [27] technique development. Applications have been introduced to address the technical challenges of monitoring and mining time-critical financial data in conjunction with mobile computing devices [23].

The utilization of standard clustering algorithms for grouping fixed length, contiguous subsequences of time series [14] has been shown to be a challenging problem [24]. Although the observed problems are not insurmountable [16], [10], this paper avoids them by only comparing windows at a fixed time point and only considering windows that have matches that are statistically significant. Relationships between different time series have been studied in [2], [50], [36], [9]. These techniques are based on the interpretation of a sequence as an ordered list of events and are usually discussed under the term sequential pattern mining. Sequential pattern mining addresses the identification of frequent, but not necessarily contiguous subsequences [2], [36].

Analysis of time series is also discussed in the area of stream mining [18]. Typically, the application of mining algorithms to continuous data streams has real-time constraints and requires one-pass searches or fast responses [28], [47]. Accordingly, stream mining approaches are limited by the available computational resources and the frequency of newly arriving data.

Similar techniques to the ones discussed in this work have been applied to categorical gene sequences [17]. Fundamental ideas can often be applied both to sequences of categorical values such as gene sequences and to time series data. An example is dynamic time warping for time series data which corresponds to the Needleman–Wunsch alignment algorithm for categorical sequences. Differences in normalization, similarity measures and evaluation of thresholds require substantial new algorithm development. The focus of Dorr and Denton [17] is on the identification of motifs in biological sequences and it is shown that the identified motifs are useful for assigning functional annotations to protein sequences. In contrast, this paper addresses the sequence–subsequence relationships among stock market time series, and the usefulness of abstracting these relationships to pattern conglomerates is shown through timing patterns. Since time series are based on real numbers (Definition 1) and protein sequences are composed of categorical values, time series data must be processed differently than protein sequences. Several algorithms have been proposed for discovering motifs in general, and for addressing specific aspects of the discovery problem in particular. Some algorithms focus on the discovery of motifs with a particular length [8], [13], and others address the problem of identifying motifs satisfying certain composition criteria [51], [52]. The maximization of the number of sequences associated with a motif is the focus of Gouzy et al. [20] as well as Sonnhammer and Kahn [42], and the discovery of motifs that cannot be extended without reducing the number of supporting sequences is addressed by the algorithms TEIRESIAS [39] and Gemoda [22].

This paper explicitly focuses on the identification of relationships among patterns and their abstraction to pattern conglomerates. Pattern conglomerates are derived through a clustering-like algorithm that establishes a DAG representing the relationships among patterns. Algorithms have been proposed focusing on decomposition of clusterings [34], or utilizing a DAG for representing relationships among entire time series [38]. In contrast to Queen et al. [38], it is our objective to represent the relationships among subsequences of time series. Villafane et al. [44] focus on containment relationships among time series subsequences and use a DAG for representing these relationships. Our approach also captures the containment relationships among time series subsequences and additionally considers the overlap between subsequences that do not satisfy the containment criterion. Our timing patterns (begins earlier, ends later, and is longer) are a subset of Allen’s as well as Freska’s relations between intervals [31], [33]. We address the identification and abstraction of relationships among patterns and show the usefulness of our pattern conglomerates by deriving timing patterns.

In Section 2, fundamental definitions related to time series are introduced. Our approach is discussed in Sections 3 Approach, 4 Example meta-analysis: timing patterns provides an exemplary meta-analysis that utilizes our pattern conglomerates. The experimental evaluation in Section 5 provides several examples (Section 5.2), results for particular sectors (Section 5.3), identified timing patterns (Section 5.4) and a discussion addressing additional data sets (Section 5.5), as well as a significance (Section 5.6) and performance analysis (Section 5.7).

Section snippets

Time series

Often sequences are formed by collecting attributes of the same type at different points in time. If those data are real-valued, which is the case for stock prices, we commonly talk of time series data. For this paper we will limit our discussion to real-valued time series.

Definition 1

A time series T = t₁, …, t_n is a sequence of real numbers, corresponding to values of an observed quantity, collected at regular time intervals.

Definition 2

A subsequence of a time series T = t₁, …, t_n with length w′, is a contiguous sequence $T^{'} = t$

Clustering

Our algorithm represents sequence–subsequence relationships as edges in a DAG. Each leaf node within the DAG denotes a time series. An internal node represents a set of subsequences that show a direct or indirect mutual similarity. We refer to the internal nodes as sliding-window patterns.

Definition 5

A sliding-window pattern is given by a set of sequence sections of uniform length, all of which are nodes in a connected graph of sliding-window alignments (see Definition 4). All alignments are required to

Example meta-analysis: timing patterns

We show the usefulness of our pattern conglomerates by utilizing them in a meta-analysis. Timing patterns are determined that describe relationships between time series represented in the same pattern conglomerate. We determine whether a subsequence of one time series begins earlier, ends later, or is longer than a subsequence of another time series. Since a time series can be involved in multiple pattern conglomerates, relationships between subsequences of several time series can be observed

Data and parameter choices

We use daily historical data for stocks of the S&P 500 index from http://kumo.swcp.com/stocks/. The obtained data set includes stock prices for an entire year from 02/01/2006 to 01/31/2007. The analysis of the stocks is done using the closing values of each day, and excludes all those stocks that have not been a member of the S&P 500 index for the entire year. However, our approach does not need to be limited to time series of a particular length. Overall, the data set consists of 460 different

Conclusions

We introduce an algorithm for representing the sequence–subsequence relationships among patterns based on subsequence similarities. The relationships between similar, contiguous subsequences are based on their overlap and result in a directed acyclic graph (DAG). Our DAG representation is abstracted to pattern conglomerates, which in turn are evaluated by examining the differences between the beginning and ending positions of similar subsequences. We apply our approach to stock market time

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. IDM-0415190.

Dietmar Dorr is a Ph.D. student in Computer Science at the North Dakota State University (NDSU). He received his M.S. in Software Engineering at the University of St. Thomas, St. Paul, MN, USA. Recently, Dietmar joined the Research & Development team at Thomson Reuters. His research interests include data mining, information retrieval, and natural language processing.

References (52)

S. de Amo et al.
First-order temporal pattern mining with regular expression constraints
Data and Knowledge Engineering
(2007)
J. Gouzy et al.
Whole genome protein domain analysis using a new method for domain clustering
Computers & Chemistry
(1999)
H. Teoh et al.
Fuzzy time series model based on probabilistic approach and rough set rule induction for empirical research in stock markets
Data and Knowledge Engineering
(2008)
R. Agrawal et al.
Fast algorithms for mining association rules in large databases
R. Agrawal et al.
Mining sequential patterns
S.F. Altschul et al.
Gapped blast and psi-blast: a new generation of protein database search programs
Nucleic Acids Research
(1997)
E.M. Azoff
Neural Network Time Series Forecasting of Financial Markets
(1994)
B.S. Bernanke et al.
What explains the stock market’s reaction to federal reserve policy?
The Journal of Finance
(2005)
D.J. Berndt et al.
Using dynamic time warping to find patterns in time series
D.J. Berndt et al.
Advances in knowledge discovery and data mining, chap
Finding Patterns in Time Series: A Dynamic Programming Approach
(1996)

J. Buhler et al.

Finding motifs using random projections

G. Chen et al.

Sequential pattern mining in multiple streams

J. Chen

Making clustering in delay-vector space meaningful

Knowledge and Information Systems

(2007)

S.H. Chen et al.

Computational Intelligence in Economics and Finance (Advanced Information Processing)

(2006)

Y. Chen et al.

SpADe: On shape-based pattern detection in streaming time series

B. Chiu et al.

Probabilistic discovery of time series motifs

G. Das et al.

Rule discovery from time series

A. Denton

Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model

D.H. Dorr, A.M. Denton, Clustering sequences by overlap, International Journal of Data Mining and Bioinformatics, in...

M.M. Gaber et al.

Mining data streams: a review

ACM SIGMOD Record

(2005)

M. Gavrilov et al.

Mining the stock market: which measure is best?

D. Hand

Data mining: statistics and more?

The American Statistician

(1998)

K.L. Jensen et al.

A generic motif discovery algorithm for sequential data

Bioinformatics

(2006)

H. Kargupta et al.

MobiMine: monitoring the stock market from a PDA

ACM SIGKDD Explorations Newsletter

(2002)

E. Keogh et al.

Clustering of time-series subsequences is meaningless: implications for previous and future research

B. Kovalerchuk et al.

Data Mining in Finance: Advances in Relational and Hybrid Methods

(2000)

Cited by (28)

Document-specific keyphrase candidate search and ranking
2018, Expert Systems with Applications
Citation Excerpt :
Sequential pattern mining plays an important role in data mining, and was first introduced by Agrawal and Srikant (1995). It seeks to discover sets of frequent items sharing some temporal relationships, and such patterns have been found to be useful for many applications (Wu, Zhu, He, & Arslan, 2013): stock market (Dorr & Denton, 2009) and sequence classification (Exarchos, Tsipouras, Papaloukas, & Fotiadis, 2008), etc. A number of methods use gap constraints to mine patterns from DNA sequences (Zhang, Kao, Cheung, & Yip, 2007; Zhu & Wu, 2007), since gap constraints (wildcards) can provide a great flexibility for patterns to capture relations.
This paper proposes an approach KeyRank to extract proper keyphrases from a document in English. It first searches all keyphrase candidates from the document, and then ranks them for selecting top-N ones as final keyphrases. Existing studies show that extracting a complete keyphrase candidate set that includes semantic relations in context, and evaluating the effectiveness of each candidate are crucial to extract high quality keyphrases from documents. Based on that words do not repeatedly appear in an effective keyphrase in English, a novel keyphrase candidate search algorithm using sequential pattern mining with gap constraints (called KCSP) is proposed to extract keyphrase candidates for KeyRank. And then an effectiveness evaluation measure pattern frequency with entropy (called PF-H) is proposed for KeyRank to rank these keyphrase candidates. Our experimental results show that KeyRank has better performance. Its first component KCSP is much more efficient than a closely related approach SPMW, and its second component PF-H is an effective evaluation mechanism for ranking keyphrase candidates.¹
Short term stock selection with case-based reasoning technique
2014, Applied Soft Computing Journal
Citation Excerpt :
Most of these studies have focused on stock market index and individual stock prediction [7,29,28,6,17,13,14]. Recent studies have presented encouraging results on stock selection using data mining techniques such as rule induction, neural network, and combination of classifiers [20,24,25,58,4,31,27,14,34]. CBR technique is one of the popular methodologies in knowledge-based systems.
Stock selection is an important decision making problem. Trading strategies and rules based on fundamental and technical analysis can be used for decision making process. In this paper, we propose an intelligent stock selection method, which is called case-based reasoning (CBR). This technique uses the fundamental and technical indicators to identify the winning stocks around the earning announcements. CBR method is compared with other artificial intelligence techniques such as multi layer perceptron (MLP), decision trees (QUEST, Classification and Regression Trees, C5), generalized rule induction (GRI) and logistic regression. We show that the performance of CBR is better than the performance of other techniques in terms of classification accuracy, average return, Sharpe ratio and ideal profit.
Mining effective multi-segment sliding window for pathogen incidence rate prediction
2013, Data and Knowledge Engineering
Pathogen incidence rate prediction, which can be considered as time series modeling, is an important task for infectious disease incidence rate prediction and for public health. This paper investigates the application of a genetic computation technique, namely GEP, for pathogen incidence rate prediction. To overcome the shortcomings of traditional sliding windows in GEP-based time series modeling, the paper introduces the problem of mining effective sliding window, for discovering optimal sliding windows for building accurate prediction models. To utilize the periodical characteristic of pathogen incidence rates, a multi-segment sliding window consisting of several segments from different periodical intervals is proposed and used. Since the number of such candidate windows is still very large, a heuristic method is designed for enumerating the candidate effective multi-segment sliding windows. Moreover, methods to find the optimal sliding window and then produce a mathematical model based on that window are proposed. A performance study on real-world datasets shows that the techniques are effective and efficient for pathogen incidence rate prediction.
Grammar-based multi-objective algorithms for mining association rules
2013, Data and Knowledge Engineering
In association rule mining, the process of extracting relations from a dataset often requires the application of more than one quality measure and, in many cases, such measures involve conflicting objectives. In such a situation, it is more appropriate to attain the optimal trade-off between measures. This paper deals with the association rule mining problem under a multi-objective perspective by proposing grammar guided genetic programming (G3P) models, that enable the extraction of both numerical and nominal association rules in only one single step. The strength of G3P is its ability to restrict the search space and build rules conforming to a given context-free grammar. Thus, the proposals presented in this paper combine the advantages of G3P models with those of multi-objective approaches. Both approaches follow the philosophy of two well-known multi-objective algorithms: the Non-dominated Sort Genetic Algorithm (NSGA-2) and the Strength Pareto Evolutionary Algorithm (SPEA-2).
In the experimental stage, we compare both multi-objective algorithms to a single-objective G3P proposal for mining association rules and perform an analysis of the mined rules. The results obtained show that multi-objective proposals obtain very frequent (with support values above 95% in most cases) and reliable (with confidence values close to 100%) rules when attaining the optimal trade-off between support and confidence. Furthermore, for the trade-off between support and lift, the multi-objective proposals also produce very interesting and representative rules.
PMBC: Pattern mining from biological sequences with wildcard constraints
2013, Computers in Biology and Medicine
Citation Excerpt :
Sequential pattern mining: Sequential pattern mining seeks to discover sets of frequent items sharing some temporal relationships. Such patterns have been found to be useful for many applications, such as stock market [36], time-series microarray expression data [37], DNA Motif discovery [15], and sequence classification [38,59]. The main challenge of mining sequential patterns is the exponential growth of the candidate space [14,19,39], because if items are subject to some sequential orders, the permutation of the items will enlarge the candidate search space in an exponential order.
Patterns/subsequences frequently appearing in sequences provide essential knowledge for domain experts, such as molecular biologists, to discover rules or patterns hidden behind the data. Due to the inherent complex nature of the biological data, patterns rarely exactly reproduce and repeat themselves, but rather appear with a slightly different form in each of its appearances. A gap constraint (In this paper, a gap constraint (also referred to as a wildcard) is a character that can be substituted for any character predefined in an alphabet.) provides flexibility for users to capture useful patterns even if their appearances vary in the sequences. In order to find patterns, existing tools require users to explicitly specify gap constraints beforehand. In reality, it is often nontrivial or time-consuming for users to provide proper gap constraint values. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Therefore, it is desirable to automatically and efficiently find patterns without involving user-specified gap requirements. In this paper, we study the problem of frequent pattern mining without user-specified gap constraints and propose PMBC (namely $\underset{̲}{P} attern \underset{̲}{M} ining$ from $\underset{̲}{B} iological$ sequences with wildcard C onstraints) to solve the problem. Given a sequence and a support threshold value (i.e. pattern frequency threshold), PMBC intends to discover all subsequences with their support values equal to or greater than the given threshold value. The frequent subsequences then form patterns later on. Two heuristic methods (one-way vs. two-way scans) are proposed to discover frequent subsequences and estimate their frequency in the sequences. Experimental results on both synthetic and real-world DNA sequences demonstrate the performance of both methods for frequent pattern mining and pattern frequency estimation.
Mining association rules from time series to explain failures in a hot-dip galvanizing steel line
2012, Computers and Industrial Engineering
Citation Excerpt :
The issue of locating and acquiring hidden knowledge in large databases has been examined many times in the literature on data mining, and several techniques and applications have been considered (Han & Kamber, 2006; Hand, Mannila, & Smyth, 2001). However, out of all the applications and techniques considered for databases of all types, those for TSDB are of particular interest for current research, as time series can be found in most scientific, financial, meteorological and industrial processes (Dorr & Denton, 2009). The numerous fields and applications that illustrate how TSDB are handled are referred to as temporal data mining (TDM).
This paper presents an experience based on the use of association rules from multiple time series captured from industrial processes. The main goal is to seek useful knowledge for explaining failures in these processes. An overall method is developed to obtain association rules that represent the repeated relationships between pre-defined episodes in multiple time series, using a time window and a time lag. First, the process involves working in an iterative and interactive manner with several pre-processing and segmentation algorithms for each kind of time series in order to obtain significant events. In the next step, a search is made for sequences of events called episodes that are repeated among the various time series according to a pre-set consequent, a pre-established time window and a time lag. Extraction is then made of the association rules for those episodes that appear many times and have a high rate of hits. Finally, a case study is described regarding the application of this methodology to a historical database of 150 variables from an industrial process for galvanizing steel coils.

View all citing articles on Scopus

Anne Denton is Assistant Professor in the Computer Science Department at North Dakota State University (NDSU). She received her Ph.D. in Physics from the University of Mainz, Germany, in 1996, and a M.S. in Computer Science from NDSU in 2003. Her research interests center on data mining of diverse data, including time series, sequence-, graph-, vector- and item data. She serves on the editorial board of the Biomed Central journal Source Code for Biology and Medicine.

View full text