Skip to main content
Log in

Two-stage pruning method for gram-based categorical sequence clustering

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Available at http://www.ncbi.nlm.nih.gov.

  2. Available at http://pbil.univ-lyon1.fr.

References

  1. Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor 12(1):40–48

    Article  Google Scholar 

  2. Kelil A, Wang S (2008) SCS: a new similarity measure for categorical sequences. In: Proceedings of the IEEE ICDM, pp 343–352

  3. Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin

    MATH  Google Scholar 

  4. Xu Y, Lu H, Zhou D, Zheng J, Zhang J (2017) Image matching optimization based on Taguchi method and adaptive spatial clustering with SIFT features. Int J Patt Recognit Artif Intell 31(11). https://doi.org/10.1142/S021800141755014X

  5. Cao F, Yu L, Huang J, Liang J (2017) K-mw-modes: an algorithm for clustering categorical matrix-object data. Appl Soft Comput 57:605–614

    Article  Google Scholar 

  6. Guo G, Chen L, Ye Y, Jiang Q (2016) Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2016.2608354

  7. Chen L (2014) EM-type method for measuring graph dissimilarity. Int J Mach Learn Cybern 5:625–633

    Article  Google Scholar 

  8. Herranz J, Nin J, Sol\(\acute{e}\) M (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554

  9. Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353

    Article  Google Scholar 

  10. Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf 13:174

    Article  Google Scholar 

  11. Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of the IEEE ICDE, pp 101–112

  12. Xiong T, Wang S, Jiang Q, Huang JZ (2014) A novel variable-order Markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng 26:2339–2353

    Article  Google Scholar 

  13. Sbakan YC, Kurt B, Cemgil AT, Sankurc B (2014) Probabilistic sequence clustering with spectral learning. Digital Signal Process 29:1–19

    Article  MathSciNet  Google Scholar 

  14. Fink GA (2008) Markov models for pattern recognition: from theory to applications. Springer, New York, Berlin Heidelberg

    MATH  Google Scholar 

  15. Namiki Y, Ishida T, Akiyama Y (2013) Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinf 14(Suppl 8):S7

    Article  Google Scholar 

  16. Basu T, Murthy CA (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892

    Article  Google Scholar 

  17. Domeniconi C, Gunopulos S, Ma S, Yan B, Razgan MA, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1):63–97

    Article  MathSciNet  Google Scholar 

  18. Yuan L, Hong Z, Chen L, Cai Q (2016) Clustering categorical sequences with variable-length tuples representation, In: Proceedings of the KSEM, pp 15–27

  19. Bezdek JC (1998) Pattern recognition in handbook of fuzzy computation. IOP Publishing Ltd, Bristol

    Google Scholar 

  20. Wu D, Ren J (2017) Sequence clustering algorithm based on weighted vector identification. Int J Mach Learn Cybern 8(3):731–738

    Article  Google Scholar 

  21. Loiselle S, Rouat J, Pressnitzer D, Thorpe S (2005) Exploration of rank order coding with spiking neural networks for speech recognition. Proc IEEE IJCNN 4:2076–2080

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant no. 61672157, and the Innovative Research Team of Probability and Statistics: Theory and Application (IRTL1704). The authors would like to thank Guiren Lian for his constructive comments that contributed to improving the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lifei Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, L., Wang, W. & Chen, L. Two-stage pruning method for gram-based categorical sequence clustering. Int. J. Mach. Learn. & Cyber. 10, 631–640 (2019). https://doi.org/10.1007/s13042-017-0744-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0744-y

Keywords

Navigation