Abstract
Distinguishing sequential pattern (DSP) mining has been widely employed in many applications, such as building classifiers and comparing/analyzing protein families. However, in previous studies on DSP mining, the gap constraints are very rigid – they are identical for all discovered patterns and at all positions in the discovered patterns, in addition to being predetermined. This paper considers a more flexible way to handle gap constraint, allowing the gap constraints between different pairs of adjacent elements in a pattern to be different and allowing different patterns to use different gap constraints. The associated DSPs will be called DSPs with flexible gap constraints. After discussing the importance of specifying/determining gap constraints flexibly in DSP mining, we present GepDSP, a heuristic mining method based on Gene Expression Programming, for mining DSPs with flexible gap constraints. Our empirical study on real-world data sets demonstrates that GepDSP is effective and efficient, and DSPs with flexible gap constraints are more effective in capturing discriminating sequential patterns.
This work was supported in part by NSFC 61572332, and China Postdoctoral Science Foundation 2014M552371.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We can set the values of \(\mathcal {G}_{min}\) and \(\mathcal {G}_{max}\) to 0 and the maximal sequence length (of the dataset under consideration), respectively, as default.
References
Dong, G., Bailey, J.: Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press, Boca Raton (2012)
Ji, X., Bailey, J., Dong, G.: Mining minimal distinguishing subsequence patterns with gap constraints. Knowl. Inf. Syst. 11(3), 259–286 (2007)
Wang, X., Duan, L., Dong, G., Yu, Z., Tang, C.: Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part I. LNCS, vol. 8421, pp. 372–387. Springer, Heidelberg (2014)
Yang, H., Duan, L., Dong, G., Nummenmaa, J., Tang, C., Li, X.: Mining itemset-based distinguishing sequential patterns with gap constraint. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications, pp. 39–54 (2015)
Li, C., Yang, Q., Wang, J., Li, M.: Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 2:1–2:39 (2012)
Zhang, M., Kao, B., Cheung, D.W., Yip, K.Y.: Mining periodic patterns with gap requirement from sequences. ACM Trans. Knowl. Discov. Data (TKDD) 1(2), 7 (2007)
Xie, F., Wu, X., Hu, X., Gao, J., Guo, D., Fei, Y., Hua, E.: MAIL: mining sequential patterns with wildcards. Int. J. Data Min. Bioinf. 8(1), 1–23 (2013)
She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–445 (2003)
Shah, C.C., Zhu, X., Khoshgoftaar, T.M., Beyer, J.: Contrast pattern mining with gap constraints for peptide folding prediction. In: Proceedings of the 21st International Florida Artificial Intelligence Research Society Conference, pp. 95–100 (2008)
Yang, H., Duan, L., Hu, B., Deng, S., Wang, W., Qin, P.: Mining top-k distinguishing sequential patterns with gap constraint. J. Softw. 26(11), 2994–3009 (2015)
Ferreira, C.: Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. SCI, vol. 21. Springer, Heidelberg (2006)
Peng, Y., Yuan, C., Qin, X., Huang, J., Shi, Y.: An improved gene expression programming approach for symbolic regression problems. Neurocomputing 137, 293–301 (2014)
Zhou, C., Xiao, W., Tirpak, T.M., Nelson, P.C.: Evolving accurate and compact classification rules with gene expression programming. IEEE Trans. Evol. Comput. 7(6), 519–531 (2003)
Duan, L., Tang, C., Li, X., Dong, G., Wang, X., Zuo, J., Jiang, M., Li, Z., Zhang, Y.: Mining effective multi-segment sliding window for pathogen incidence rate prediction. Data Knowl. Eng. 87, 425–444 (2013)
Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., et al.: Pfam: the protein families database. Nucleic Acids Res. 42(D1), 222–230 (2014)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Gao, C., Duan, L., Dong, G., Zhang, H., Yang, H., Tang, C. (2016). Mining Top-k Distinguishing Sequential Patterns with Flexible Gap Constraints. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9658. Springer, Cham. https://doi.org/10.1007/978-3-319-39937-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-39937-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39936-2
Online ISBN: 978-3-319-39937-9
eBook Packages: Computer ScienceComputer Science (R0)