Abstract
Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods.
Similar content being viewed by others
Notes
Available at http://www.ncbi.nlm.nih.gov.
Available at http://pbil.univ-lyon1.fr.
References
Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor 12(1):40–48
Kelil A, Wang S (2008) SCS: a new similarity measure for categorical sequences. In: Proceedings of the IEEE ICDM, pp 343–352
Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
Xu Y, Lu H, Zhou D, Zheng J, Zhang J (2017) Image matching optimization based on Taguchi method and adaptive spatial clustering with SIFT features. Int J Patt Recognit Artif Intell 31(11). https://doi.org/10.1142/S021800141755014X
Cao F, Yu L, Huang J, Liang J (2017) K-mw-modes: an algorithm for clustering categorical matrix-object data. Appl Soft Comput 57:605–614
Guo G, Chen L, Ye Y, Jiang Q (2016) Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2016.2608354
Chen L (2014) EM-type method for measuring graph dissimilarity. Int J Mach Learn Cybern 5:625–633
Herranz J, Nin J, Sol\(\acute{e}\) M (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353
Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf 13:174
Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of the IEEE ICDE, pp 101–112
Xiong T, Wang S, Jiang Q, Huang JZ (2014) A novel variable-order Markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng 26:2339–2353
Sbakan YC, Kurt B, Cemgil AT, Sankurc B (2014) Probabilistic sequence clustering with spectral learning. Digital Signal Process 29:1–19
Fink GA (2008) Markov models for pattern recognition: from theory to applications. Springer, New York, Berlin Heidelberg
Namiki Y, Ishida T, Akiyama Y (2013) Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinf 14(Suppl 8):S7
Basu T, Murthy CA (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892
Domeniconi C, Gunopulos S, Ma S, Yan B, Razgan MA, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1):63–97
Yuan L, Hong Z, Chen L, Cai Q (2016) Clustering categorical sequences with variable-length tuples representation, In: Proceedings of the KSEM, pp 15–27
Bezdek JC (1998) Pattern recognition in handbook of fuzzy computation. IOP Publishing Ltd, Bristol
Wu D, Ren J (2017) Sequence clustering algorithm based on weighted vector identification. Int J Mach Learn Cybern 8(3):731–738
Loiselle S, Rouat J, Pressnitzer D, Thorpe S (2005) Exploration of rank order coding with spiking neural networks for speech recognition. Proc IEEE IJCNN 4:2076–2080
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant no. 61672157, and the Innovative Research Team of Probability and Statistics: Theory and Application (IRTL1704). The authors would like to thank Guiren Lian for his constructive comments that contributed to improving the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yuan, L., Wang, W. & Chen, L. Two-stage pruning method for gram-based categorical sequence clustering. Int. J. Mach. Learn. & Cyber. 10, 631–640 (2019). https://doi.org/10.1007/s13042-017-0744-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-017-0744-y