Abstract
Group comparison per se is a fundamental task in many scientific endeavours but is also the basis of any classifier. Contrast sets and emerging patterns contrast between groups of categorical data. Comparing groups of sequence data is a relevant task in many applications. We define Emerging Sequences (ESs) as subsequences that are frequent in sequences of one group and less frequent in the sequences of another, and thus distinguishing or contrasting sequences of different classes. There are two challenges to distinguish sequence classes: the extraction of ESs is not trivially efficient and only exact matches of sequences are considered. In our work we address those problems by a suffix tree-based framework and a similar matching mechanism. We propose a classifier based on Emerging Sequences. Evaluating against two learning algorithms based on frequent subsequences and exact matching subsequences, the experiments on two datasets show that our model outperforms the baseline approaches by up to 20% in prediction accuracy.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Bay, S.D., Pazzani, M.J.: Detecting change in categorical data: Mining contrast sets. In: KDD, pp. 302–306 (1999)
Deng, K., Zaïane, O.R.: Technical report, Department of Computing Science, University of Alberta (2009), http://www.cs.ualberta.ca/~kdeng2/postscript/deng09.pdf
Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: KDD, pp. 43–52. ACM Press, New York (1999)
EL-Manzalawy, Y., Dobbs, D., Honavar, V.: On evaluating mhc-ii binding peptide prediction methods. PLoS ONE 3(9), e3268 (2008)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Han, J., Kamber, M.: Data Mining, Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Jazayeri, S.V., Zaïane, O.R.: Plant protein localization using discriminative and frequent partition-based subsequences. In: ICDM Workshops, pp. 228–237 (2008)
Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst. 11(3), 259–286 (2007)
Ramamohanarao, J.B.K., Dong, G.: tutorial Contrast Data Mining: Methods and Applications. In: ICDM (2007)
Langley, P., Iba, W., Thompson, K.: An analysis of bayesian classifiers. In: National Conference on Artificial Intelligence, pp. 223–228 (1992)
Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: KDD, pp. 342–346 (1999)
Li, J., Yang, Q.: Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets. IEEE Transactions on Information Technology in Biomedicine 11(5), 544–552 (2007)
Liao, T.F.: Statoistical Group Comparison. Wiley’s Series in probability and Statistics (2002)
Lo, D., Cheng, H., Han, J., Khoo, S.-C.: Classification of software behaviors for failure detection: A discriminative pattern mining approach. In: KDD (2009)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Prefixspan: Mining sequential patterns by prefix-projected growth. In: ICDE, pp. 215–224 (2001)
Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI workshop (2001)
Wang, L., Zhao, H., Dong, G., Li, J.: On the complexity of finding emerging patterns. Theor. Comput. Sci. 335(1), 15–27 (2005)
Zaki, M.J.: Efficient enumeration of frequent sequences. In: CIKM, pp. 68–75 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, K., Zaïane, O.R. (2009). Contrasting Sequence Groups by Emerging Sequences. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-04747-3_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04746-6
Online ISBN: 978-3-642-04747-3
eBook Packages: Computer ScienceComputer Science (R0)