A two-stage methodology for sequence classification based on sequential pattern mining and optimization

doi:10.1016/j.datak.2008.05.007

Data & Knowledge Engineering

Volume 66, Issue 3, September 2008, Pages 467-487

https://doi.org/10.1016/j.datak.2008.05.007 Get rights and content

Abstract

We present a methodology for sequence classification, which employs sequential pattern mining and optimization, in a two-stage process. In the first stage, a sequence classification model is defined, based on a set of sequential patterns and two sets of weights are introduced, one for the patterns and one for classes. In the second stage, an optimization technique is employed to estimate the weight values and achieve optimal classification accuracy. Extensive evaluation of the methodology is carried out, by varying the number of sequences, the number of patterns and the number of classes and it is compared with similar sequence classification approaches.

Introduction

Sequence classification is an important problem which arises in many real-world applications, such as protein function prediction, text classification or speech recognition [16]. Sequential data are sequences of ordered “events” representing a situation, where each event might be described by a set of predicates. Examples of sequential data include text, biosequences (DNA, proteins), web-usage data, multiplayer games, plan-execution traces, etc. Classification is the procedure in which given a collection of training records, each one containing a set of attributes and a class, to find a model that maps the features of each record to a class attribute. Subsequently, this model can be used in order to provide predictions for new records. Based on that, given a sequence (constructed from letters drawn from a finite alphabet; i.e. 20-letter alphabet of amino acids in the case of protein classification; a vocabulary of English words in text classification), a sequence classifier assigns a class label (typically drawn from a finite set of mutually exclusive class labels) to this sequence. Data mining and machine learning algorithms offer a number of effective approaches to design sequence classifiers, when a training set of labeled sequences is available [18].

A sequential pattern is a sequence of itemsets that frequently occur in a specific order. An itemset is a non-empty subset of elements, called items, from a set which is called alphabet. In this manner, an itemset represents the set of items that occur together. Sequential pattern mining is a procedure that discovers sequential patterns existing in databases of sequences. Sequential pattern mining is widely used in a variety of domains, ranging from text to proteins and DNA sequences. The problem was first introduced by Agrawal and Srikant [2], and since then the goal of sequential pattern mining is to discover all frequent sequences of itemsets in a dataset.

The problem of sequence classification has been addressed in the literature in many ways; the earliest approaches employed finite automata and entropy based approaches [29]. Several methodologies have also been proposed which use either hidden Markov models [35], [40] or support vector machines [9], [27]. Also, several sequence classification methods have been proposed, as applications in specific domains, such as protein classification [11], [24], [30], [36], text classification [20], [22], speech [35] and handwriting [19] recognition. A different category of techniques treat the problem of sequence classification as a feature mining problem [25], [26], [38], i.e. they mine features from a set of training sequences and then use these features as input in a standard classification algorithm. The FeatureMine algorithm uses these features with the naïve Bayes and Winnow algorithm [25], [26]. The Classify By Sequences (CBS) algorithm uses a simple scoring function [38]. Tseng and Lee [38] proposed two different approaches, the CBS_ALL and the CBS_CLASS. Experimental results showed that CBS_CLASS outperforms CBS_ALL [38].

Recently, data mining techniques like association rule mining, sequential pattern mining, clustering and classification, emerged in various research topics [1], [2], [6], [34], [41]. However, most of the existing data mining methods are designed for solving a specific problem. On the other hand, some few compound methods integrate two or more types of data mining techniques to solve complex problems. These compound methods can effectively utilize the advantages of each individual mining technique to improve the overall performance in data mining tasks. For example, the Classification Based on Associations (CBA) [28] method provides higher accuracy than traditional classification methods such as C4.5 [34]. Hence, it is a promising direction to integrate different types of data mining methods to form a new methodology for solving complex data mining problems.

In this work, we propose a novel methodology for the generation of sequence classification models, that consists of two stages. In the first stage, a sequence classification model based on sequential patterns is created. This first stage is similar to the CBS_CLASS algorithm, which also builds a sequence classification model from the extracted sequential patterns. The innovation of the proposed methodology is the introduction of weights, which are applied to the sequential patterns and to the classes, and their tuning through optimization, during the second stage, which is an extension of the previously reported CBS_CLASS algorithm. The methodology can be considered as a compound data mining method that uses sequential pattern mining for sequence classification. The input to our methodology is a set of labeled training sequences, and the output is a function mapping a new, unknown sequence, to a class. The classification of an unknown sequence is realized automatically. The methodology employs a sequential pattern mining algorithm, a scoring function that uses the sequential patterns for classification and an optimization technique, in order to automatically assign weights to the sequential patterns and to the classes for improving the classification accuracy. The proposed methodology is evaluated using both artificial and real data. Artificial data are employed in order to present a working example of the proposed methodology, while real data correspond to two biological problems of high importance: protein fold recognition and class prediction.

The pattern weights that are assigned to the sequential patterns, after the optimization stage, identify the relative significance of each pattern. The motivation for the use of class weights is that sequential patterns extracted from sequences do not describe all classes with the same adequacy; some classes are over described from the sequential patterns while for others this description is rather poor. Thus, the class weights are introduced to equalize this preference, and subsequently, an optimization technique is used to automatically calculate optimal values for them. To our knowledge, there is no other work in the literature which assigns weights to the extracted sequential patterns and to the classes for sequence classification. Our results indicate that this integration leads to high classification accuracy, superior to previously reported sequence classification methods. Furthermore, the weights assigned to the patterns can provide to the experts additional knowledge on the domain of application, through the identification of the most important patterns. Finally, the proposed methodology is generic and can incorporate different algorithms/approaches in any of its stages.

Section snippets

Methods

The list of symbols employed in this work and their explanation are summarized in Table 1. The proposed methodology includes two stages (Fig. 1). In stage 1, a sequence classification methodology is defined, based on sequential patterns. For the realization of stage 1, a dataset D = {S_i, c_i}, i = 1, … , l_S, where S_i is a sequence and c_i is its class, with l_c different classes (c_i = {1, … , l_c}) and l_S is the number of sequences in the dataset (∣D∣ − l_S), a vector of sequential pattern weights wp and a vector

Implementation

In order to apply the above described methodology and automatically create a sequence classification model, the following elements need to be defined: (i) the SPM algorithm for the extraction of sequential patterns, (ii) the scoring function for the calculation of the values of all PSM^j matrices and (iii) the optimization elements, such as the objective function of the optimization procedure, the optimization algorithm and the optimization approaches for the calculation of the optimal weights.

Dataset

The artificial dataset is based on the alphabet I = {a, b, c}, by generating sequences of 4 items, which belong to three classes (l_c = 3), thus D = {S_i, c_i} with S_i being the sequence and c_i the corresponding class. Twenty fourr sequences were created, eight from each class. From those, six sequences from each class are used to create the sequence classification model and the remaining are used for testing. Thus, the training dataset D_train, consists of 18 sequences (∣D_train∣ = 18) and the test dataset D

Application to real data

The proposed methodology is evaluated using a biological sequence dataset. Results of the proposed methodology are presented without the use of stage 2 [14], [15] and with the use of stage 2, by applying all five different optimization approaches (App. 2–App. 6).

Discussion

We presented a novel methodology for the automated generation of sequence classification models, that can be applied in any (discrete) sequential domain. Initially, sequential patterns are extracted from a set of (training) sequences. The scores for each sequential pattern and each class are computed. In addition, optimal weights for each pattern and for each class are calculated using an optimization technique. The obtained optimal pattern and class weights along with the extracted sequential

Conclusions

A two-stage methodology for sequence classification has been presented along with an extensive evaluation. The methodology provides high classification results in the sequence classification problem, comparable or better with previously reported works. The optimization stage introduced significantly improves the results and the optimal calculated parameters can provide significant knowledge to the experts of the domain of application. Future work will focus on the use of methods for sequential

Themis P. Exarchos was born in Ioannina, Greece, in 1980. He received the Diploma Degree in Computer Engineering and Informatics from the University of Patras, in 2003. He is currently working toward the Ph.D. degree in Medical Physics at the University of Ioannina. His research interests include data mining, decision support systems in healthcare, biomedical applications and bioinformatics.

References (42)

F. Bonchi et al.
Extending the state-of-the-art of constraint-based pattern discovery
Data and Knowledge Engineering
(2007)
T.-Z. Chen et al.
Mining frequent tree-like patterns in large databases
Data and Knowledge Engineering
(2007)
G.A. Evangelakis et al.
MERLIN – A portable system for multidimensional optimization
Computer Physics Communications
(1987)
T.P. Exarchos et al.
Mining sequential patterns for protein fold recognition
Journal of Biomedical Informatics
(2008)
J. Hu et al.
Writer independent on-line handwriting recognition using an HMM approach
Pattern Recognition
(2000)
H.-C. Kum et al.
Benchmarking the effectiveness of sequential pattern mining methods
Data and Knowledge Engineering
(2007)
C. Lampros et al.
Sequence-based protein structure prediction using a reduced state-space hidden Markov model
Computers in Biology and Medicine
(2007)
S. Mehta et al.
ConsDiff: an algorithm for the detection of conserved differences between protein sequences
Data and Knowledge Engineering
(2005)
A.G. Murzin et al.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology
(1995)
D.G. Papageorgiou et al.
MERLIN-3.1.1. A new version of the Merlin optimization environment
Computer Physics Communications
(2004)

R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th International...

R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data...

S. Amo et al.

First-order temporal pattern mining with regular expression constraints

Data and Knowledge Engineering

(2007)

J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using bitmaps, in: Proceedings of the Eighth ACM...

S.D. Bay et al.

Detecting group differences: mining contrast sets

Data Mining and Knowledge Discovery

(2001)

R.J. Bayardo Jr., Brute-force mining of high-confidence classification rules, in: Proceedings of the Third...

H.M. Berman et al.

The Protein Data Bank

Nucleic Acids Research

(2000)

S. Chakrabartty, G. Cauwenberghs, Forward decoding kernel machines: a hybrid HMM/SVM approach to sequence recognition,...

C. Ding et al.

Multi-class protein fold recognition using support vector machines and neural networks

Bioinformatics

(2001)

G. Dong, J. Li, Efficient mining of emerging patterns: discovering trends and differences, in: Proceedings of the...

T.P. Exarchos, C. Papaloukas, C. Lampros, D.I. Fotiadis, Protein classification using sequential pattern mining, in:...

Cited by (0)

Markos G. Tsipouras was born in Athens, Greece, in 1977. He received the diploma degree and the M.Sc. in computer science from the University of Ioannina, Greece, in 1999 and 2002, respectively. He holds a Ph.D. degree in the Automated Diagnosis of Cardiovascular Diseases, from the Department of Computer Science at the University of Ioannina. His research interests include biomedical engineering, decision support and medical expert systems and biomedical applications.

Costas Papaloukas was born in Ioannina, Greece, in 1974. He received the diploma degree in computer science and the Ph.D. degree in biomedical technology from the University of Ioannina, Ioannina, Greece, in 1997 and 2001, respectively. He is an Assistant Professor of Bioinformatics with the Department of Biological Applications and Technology, University of Ioannina. His research interests include biomedical engineering and bioinformatics.

Dimitrios I. Fotiadis was born in Ioannina, Greece, in 1961. He received the Diploma degree in chemical engineering from National Technical University of Athens, Greece, and the Ph.D. degree in chemical engineering from the University of Minnesota, Twin Cities. Since 1995, he has been with the Department of Computer Science, University of Ioannina, Greece, where he currently is an Associate Professor. He is the director of the Unit of Medical Technology and Intelligent Information Systems. His research interests include biomedical technology, biomechanics, scientific computing, and intelligent information systems.

View full text

A two-stage methodology for sequence classification based on sequential pattern mining and optimization

Abstract

Introduction

Section snippets

Methods

Implementation

Dataset

Application to real data

Discussion

Conclusions

Data and Knowledge Engineering

Data and Knowledge Engineering

Computer Physics Communications

Journal of Biomedical Informatics

Pattern Recognition

Data and Knowledge Engineering

Computers in Biology and Medicine

Data and Knowledge Engineering

Journal of Molecular Biology

Computer Physics Communications

First-order temporal pattern mining with regular expression constraints

Data and Knowledge Engineering

Detecting group differences: mining contrast sets

Data Mining and Knowledge Discovery

The Protein Data Bank

Nucleic Acids Research

Multi-class protein fold recognition using support vector machines and neural networks

Bioinformatics