Establishing relationships among patterns in stock market data

https://doi.org/10.1016/j.datak.2008.10.001Get rights and content

Abstract

Similarities among subsequences are typically regarded as categorical features of sequential data. We introduce an algorithm for capturing the relationships among similar, contiguous subsequences. Two time series are considered to be similar during a time interval if every contiguous subsequence of a predefined length satisfies the given similarity criterion. Our algorithm identifies patterns based on the similarity among sequences, captures the sequence–subsequence relationships among patterns in the form of a directed acyclic graph (DAG), and determines pattern conglomerates that allow the application of additional meta-analyses and mining algorithms. For example, our pattern conglomerates can be used to analyze time information that is lost in categorical representations. We apply our algorithm to stock market data as well as several other time series data sets and show the richness of our pattern conglomerates through qualitative and quantitative evaluations. An exemplary meta-analysis determines timing patterns representing relations between time series intervals and demonstrates the merit of pattern relationships as an extension of time series pattern mining.

Introduction

Time series data are ubiquitous in fields as diverse as economics, science, and industry; hence, it is not surprising that there has been a strong interest in applying data mining techniques to time series data. Time series can be very long, and users are often interested in similarities that extend over a comparatively short time interval, which suggests the use of sliding-window techniques. An approach that is based on sliding windows starts with all possible fixed length, contiguous subsequences of the time series under consideration. Note that the term “subsequence” has multiple meanings in the literature. We use subsequence in the sense of a contiguous section of a sequence that is also sometimes called “substring”. In order to address the properties of time series data, special similarity measures have been devised that are defined over variable-length subsequences, as well as making other generalizations [45], [7], [6], [12]. With well-established similarity measures in place, researchers have pursued pattern mining, clustering and classification tasks, as they are common in data mining.

The richness of temporal data is, however, not alone captured in modified similarity measures. In sequential data, strong reasons may be given as to why it can be beneficial to revise even the concept of pattern mining itself: conventionally pattern mining is seen as returning isolated, frequent occurrences in the data. Although relationships among patterns have been extensively used as a basis for pruning through closure properties [1], these set–subset relationships do not normally contribute much to the expressiveness of the result when time series are considered. In comparison to record data, time series data inherently provides an additional dimension (time) for each data item. The time dimension can be utilized not only for mining patterns but also for capturing the relationships among patterns. In our interpretation, a revised concept of pattern mining should include the interrelations among patterns.

For example, knowing that a group of stock series shares a pattern over a long period of time, while other stock series show a related pattern over a much shorter interval can provide valuable insights into the price developments of stocks. The relationships among patterns have important information content by themselves. It is our goal to capture the similarities among stock market time series such that their sequence–subsequence relationships are preserved. We identify patterns representing collections of contiguous subsequences that share the same shape for a particular time interval. Patterns are defined on the basis of contiguous sections of normalized sliding windows that show pairwise similarities among sequences. The relationships among sliding-window patterns are represented using a directed acyclic graph (DAG) that is constructed based on the overlap between patterns. Leaf nodes within the DAG denote entire sequences, internal nodes represent patterns, and the sequence–subsequence relationships among patterns are represented by the edges. In a directed graph, an internal node, in contrast to a leaf node, has at least one directed edge to another node. The information contained within the DAG, as well as timing information, is represented using a pattern conglomerate notation that constitutes a new level of abstraction. The pattern conglomerate concept is designed to allow meta-analyses. In the context of this paper, a meta-analysis is an analysis applied to the results of another analysis, i.e., our pattern conglomerates (result of the first analysis) can be used as input to another, second analysis (meta-analysis). A pattern conglomerate incorporates the structure of the DAG and the order of clustered sequences, as well as the extent of the subsequences considered during the execution of our algorithm (Section 3.3). The panel (a) of Fig. 1 depicts an example of four time series that shows a total of three characteristic shapes. The sliding-window pattern that is signified by × is shared by all four sequences. Sequences A and B show a longer pattern that extends as far as the section with a □. Time series C and D have a different extended pattern comprised of × and ○. The corresponding DAG representation is shown in panel (b) of Fig. 1. Each time series is represented by a leaf node, and all three patterns are represented as internal nodes. The root node, ×, connects to the two other internal nodes, which represent the longer patterns. Note that the DAG is different from similarity-based representations that are common in hierarchical clustering, where degrees of similarities are used to group sequences. In our case, length of overlap determines the position in the DAG and similarity is defined through a single window-based threshold. Accordingly, the × node is created based on the overlap between patterns A/B (□×) and C/D (×○) rather than the degree of the similarity between the sequences. The third panel (c) of Fig. 1 depicts the abstraction of the DAG in form of a pattern conglomerate. The structure of the DAG is represented using parentheses, and the beginning and ending of regions of similarity between pairs of sequences are indicated by braces with subscripts.

We demonstrate the usefulness of our pattern conglomerates by determining timing patterns of the form begins earlier, ends later, and is longer between time series of the same pattern conglomerate. Examples for timing patterns in Fig. 1a are A and B begin earlier than C and D. We apply our algorithm to 460 stock market time series of the S&P 500 index as well as to four additional time series data sets (Section 5.1). The additional data sets serve as a means to highlight the applicability of our approach to different time series data sets (Section 5.5) and to provide a more comprehensive performance analysis (Section 5.7).

The stock price of a company is influenced by a wealth of internal and external factors. An internal factor may be the perceived potential of the company to be successful in the future (e.g., competent management or ability to generate profit), and an external factor could be the future expectations of a market in which the company operates. There have been several studies addressing the influence of external factors such as news on stock market behavior [48], [26], [49], [5], [30]. We do not restrict our analysis by the assumption that there is a single external factor, such as a news report, affecting stock prices. It is our objective to observe the effects of combinations of external influences that have an impact on the stock prices of two or more companies. Note that we do not attempt to identify the nature of any factors but rather observe their effects. We assume that stocks of two companies may show a similar shape when major influences or economic pressures on these companies are similar. For example, if the future expectations of a particular market (e.g., e-commerce) are very positive (or negative), then the stock time series of companies that operate in this market are likely to show a very similar shape. The application of our algorithm to stock time series results in a DAG representation, e.g., Fig. 1b, where contiguous subsequences of stocks that exhibit a similar shape for some time interval are grouped together. Based on the above interpretation, the companies that issue these stocks are under the pressure of similar factors for that particular interval. Our exemplary meta-analysis focuses on the onset and progression of factors affecting two companies. Temporal relationships of interest include the observation that the impact of factors on some companies begin earlier, end later, and are longer than others.

Traditionally, work on stock market data has focused on predictive modeling [4], [11] and study of anomalies [41]. In recent years, data mining approaches have increasingly gained importance [19], [25], [40], [31], [15], despite negative connotations of the term “data mining”, which is sometimes interpreted as being “synonymous with data dredging and fishing” [21]. Predictive tasks are still in the foreground of data mining [25] and machine learning [43], [46], [27] technique development. Applications have been introduced to address the technical challenges of monitoring and mining time-critical financial data in conjunction with mobile computing devices [23].

The utilization of standard clustering algorithms for grouping fixed length, contiguous subsequences of time series [14] has been shown to be a challenging problem [24]. Although the observed problems are not insurmountable [16], [10], this paper avoids them by only comparing windows at a fixed time point and only considering windows that have matches that are statistically significant. Relationships between different time series have been studied in [2], [50], [36], [9]. These techniques are based on the interpretation of a sequence as an ordered list of events and are usually discussed under the term sequential pattern mining. Sequential pattern mining addresses the identification of frequent, but not necessarily contiguous subsequences [2], [36].

Analysis of time series is also discussed in the area of stream mining [18]. Typically, the application of mining algorithms to continuous data streams has real-time constraints and requires one-pass searches or fast responses [28], [47]. Accordingly, stream mining approaches are limited by the available computational resources and the frequency of newly arriving data.

Similar techniques to the ones discussed in this work have been applied to categorical gene sequences [17]. Fundamental ideas can often be applied both to sequences of categorical values such as gene sequences and to time series data. An example is dynamic time warping for time series data which corresponds to the Needleman–Wunsch alignment algorithm for categorical sequences. Differences in normalization, similarity measures and evaluation of thresholds require substantial new algorithm development. The focus of Dorr and Denton [17] is on the identification of motifs in biological sequences and it is shown that the identified motifs are useful for assigning functional annotations to protein sequences. In contrast, this paper addresses the sequence–subsequence relationships among stock market time series, and the usefulness of abstracting these relationships to pattern conglomerates is shown through timing patterns. Since time series are based on real numbers (Definition 1) and protein sequences are composed of categorical values, time series data must be processed differently than protein sequences. Several algorithms have been proposed for discovering motifs in general, and for addressing specific aspects of the discovery problem in particular. Some algorithms focus on the discovery of motifs with a particular length [8], [13], and others address the problem of identifying motifs satisfying certain composition criteria [51], [52]. The maximization of the number of sequences associated with a motif is the focus of Gouzy et al. [20] as well as Sonnhammer and Kahn [42], and the discovery of motifs that cannot be extended without reducing the number of supporting sequences is addressed by the algorithms TEIRESIAS [39] and Gemoda [22].

This paper explicitly focuses on the identification of relationships among patterns and their abstraction to pattern conglomerates. Pattern conglomerates are derived through a clustering-like algorithm that establishes a DAG representing the relationships among patterns. Algorithms have been proposed focusing on decomposition of clusterings [34], or utilizing a DAG for representing relationships among entire time series [38]. In contrast to Queen et al. [38], it is our objective to represent the relationships among subsequences of time series. Villafane et al. [44] focus on containment relationships among time series subsequences and use a DAG for representing these relationships. Our approach also captures the containment relationships among time series subsequences and additionally considers the overlap between subsequences that do not satisfy the containment criterion. Our timing patterns (begins earlier, ends later, and is longer) are a subset of Allen’s as well as Freska’s relations between intervals [31], [33]. We address the identification and abstraction of relationships among patterns and show the usefulness of our pattern conglomerates by deriving timing patterns.

In Section 2, fundamental definitions related to time series are introduced. Our approach is discussed in Sections 3 Approach, 4 Example meta-analysis: timing patterns provides an exemplary meta-analysis that utilizes our pattern conglomerates. The experimental evaluation in Section 5 provides several examples (Section 5.2), results for particular sectors (Section 5.3), identified timing patterns (Section 5.4) and a discussion addressing additional data sets (Section 5.5), as well as a significance (Section 5.6) and performance analysis (Section 5.7).

Section snippets

Time series

Often sequences are formed by collecting attributes of the same type at different points in time. If those data are real-valued, which is the case for stock prices, we commonly talk of time series data. For this paper we will limit our discussion to real-valued time series.

Definition 1

A time series T = t1, …, tn is a sequence of real numbers, corresponding to values of an observed quantity, collected at regular time intervals.

Definition 2

A subsequence of a time series T = t1, …, tn with length w′, is a contiguous sequence T=t

Clustering

Our algorithm represents sequence–subsequence relationships as edges in a DAG. Each leaf node within the DAG denotes a time series. An internal node represents a set of subsequences that show a direct or indirect mutual similarity. We refer to the internal nodes as sliding-window patterns.

Definition 5

A sliding-window pattern is given by a set of sequence sections of uniform length, all of which are nodes in a connected graph of sliding-window alignments (see Definition 4). All alignments are required to

Example meta-analysis: timing patterns

We show the usefulness of our pattern conglomerates by utilizing them in a meta-analysis. Timing patterns are determined that describe relationships between time series represented in the same pattern conglomerate. We determine whether a subsequence of one time series begins earlier, ends later, or is longer than a subsequence of another time series. Since a time series can be involved in multiple pattern conglomerates, relationships between subsequences of several time series can be observed

Data and parameter choices

We use daily historical data for stocks of the S&P 500 index from http://kumo.swcp.com/stocks/. The obtained data set includes stock prices for an entire year from 02/01/2006 to 01/31/2007. The analysis of the stocks is done using the closing values of each day, and excludes all those stocks that have not been a member of the S&P 500 index for the entire year. However, our approach does not need to be limited to time series of a particular length. Overall, the data set consists of 460 different

Conclusions

We introduce an algorithm for representing the sequence–subsequence relationships among patterns based on subsequence similarities. The relationships between similar, contiguous subsequences are based on their overlap and result in a directed acyclic graph (DAG). Our DAG representation is abstracted to pattern conglomerates, which in turn are evaluated by examining the differences between the beginning and ending positions of similar subsequences. We apply our approach to stock market time

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. IDM-0415190.

Dietmar Dorr is a Ph.D. student in Computer Science at the North Dakota State University (NDSU). He received his M.S. in Software Engineering at the University of St. Thomas, St. Paul, MN, USA. Recently, Dietmar joined the Research & Development team at Thomson Reuters. His research interests include data mining, information retrieval, and natural language processing.

References (52)

  • J. Buhler et al.

    Finding motifs using random projections

  • G. Chen et al.

    Sequential pattern mining in multiple streams

  • J. Chen

    Making clustering in delay-vector space meaningful

    Knowledge and Information Systems

    (2007)
  • S.H. Chen et al.

    Computational Intelligence in Economics and Finance (Advanced Information Processing)

    (2006)
  • Y. Chen et al.

    SpADe: On shape-based pattern detection in streaming time series

  • B. Chiu et al.

    Probabilistic discovery of time series motifs

  • G. Das et al.

    Rule discovery from time series

  • A. Denton

    Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model

  • D.H. Dorr, A.M. Denton, Clustering sequences by overlap, International Journal of Data Mining and Bioinformatics, in...
  • M.M. Gaber et al.

    Mining data streams: a review

    ACM SIGMOD Record

    (2005)
  • M. Gavrilov et al.

    Mining the stock market: which measure is best?

  • D. Hand

    Data mining: statistics and more?

    The American Statistician

    (1998)
  • K.L. Jensen et al.

    A generic motif discovery algorithm for sequential data

    Bioinformatics

    (2006)
  • H. Kargupta et al.

    MobiMine: monitoring the stock market from a PDA

    ACM SIGKDD Explorations Newsletter

    (2002)
  • E. Keogh et al.

    Clustering of time-series subsequences is meaningless: implications for previous and future research

  • B. Kovalerchuk et al.

    Data Mining in Finance: Advances in Relational and Hybrid Methods

    (2000)
  • Cited by (28)

    • Document-specific keyphrase candidate search and ranking

      2018, Expert Systems with Applications
      Citation Excerpt :

      Sequential pattern mining plays an important role in data mining, and was first introduced by Agrawal and Srikant (1995). It seeks to discover sets of frequent items sharing some temporal relationships, and such patterns have been found to be useful for many applications (Wu, Zhu, He, & Arslan, 2013): stock market (Dorr & Denton, 2009) and sequence classification (Exarchos, Tsipouras, Papaloukas, & Fotiadis, 2008), etc. A number of methods use gap constraints to mine patterns from DNA sequences (Zhang, Kao, Cheung, & Yip, 2007; Zhu & Wu, 2007), since gap constraints (wildcards) can provide a great flexibility for patterns to capture relations.

    • Short term stock selection with case-based reasoning technique

      2014, Applied Soft Computing Journal
      Citation Excerpt :

      Most of these studies have focused on stock market index and individual stock prediction [7,29,28,6,17,13,14]. Recent studies have presented encouraging results on stock selection using data mining techniques such as rule induction, neural network, and combination of classifiers [20,24,25,58,4,31,27,14,34]. CBR technique is one of the popular methodologies in knowledge-based systems.

    • PMBC: Pattern mining from biological sequences with wildcard constraints

      2013, Computers in Biology and Medicine
      Citation Excerpt :

      Sequential pattern mining: Sequential pattern mining seeks to discover sets of frequent items sharing some temporal relationships. Such patterns have been found to be useful for many applications, such as stock market [36], time-series microarray expression data [37], DNA Motif discovery [15], and sequence classification [38,59]. The main challenge of mining sequential patterns is the exponential growth of the candidate space [14,19,39], because if items are subject to some sequential orders, the permutation of the items will enlarge the candidate search space in an exponential order.

    • Mining association rules from time series to explain failures in a hot-dip galvanizing steel line

      2012, Computers and Industrial Engineering
      Citation Excerpt :

      The issue of locating and acquiring hidden knowledge in large databases has been examined many times in the literature on data mining, and several techniques and applications have been considered (Han & Kamber, 2006; Hand, Mannila, & Smyth, 2001). However, out of all the applications and techniques considered for databases of all types, those for TSDB are of particular interest for current research, as time series can be found in most scientific, financial, meteorological and industrial processes (Dorr & Denton, 2009). The numerous fields and applications that illustrate how TSDB are handled are referred to as temporal data mining (TDM).

    View all citing articles on Scopus

    Dietmar Dorr is a Ph.D. student in Computer Science at the North Dakota State University (NDSU). He received his M.S. in Software Engineering at the University of St. Thomas, St. Paul, MN, USA. Recently, Dietmar joined the Research & Development team at Thomson Reuters. His research interests include data mining, information retrieval, and natural language processing.

    Anne Denton is Assistant Professor in the Computer Science Department at North Dakota State University (NDSU). She received her Ph.D. in Physics from the University of Mainz, Germany, in 1996, and a M.S. in Computer Science from NDSU in 2003. Her research interests center on data mining of diverse data, including time series, sequence-, graph-, vector- and item data. She serves on the editorial board of the Biomed Central journal Source Code for Biology and Medicine.

    View full text