Two-stage pruning method for gram-based categorical sequence clustering

Yuan, Liang; Wang, Wenjian; Chen, Lifei

doi:10.1007/s13042-017-0744-y

Two-stage pruning method for gram-based categorical sequence clustering

Original Article
Published: 15 November 2017

Volume 10, pages 631–640, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Liang Yuan¹,
Wenjian Wang² &
Lifei Chen³

303 Accesses
4 Citations
Explore all metrics

Abstract

Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Notes

Available at http://www.ncbi.nlm.nih.gov.
Available at http://pbil.univ-lyon1.fr.

References

Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor 12(1):40–48
Article Google Scholar
Kelil A, Wang S (2008) SCS: a new similarity measure for categorical sequences. In: Proceedings of the IEEE ICDM, pp 343–352
Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
MATH Google Scholar
Xu Y, Lu H, Zhou D, Zheng J, Zhang J (2017) Image matching optimization based on Taguchi method and adaptive spatial clustering with SIFT features. Int J Patt Recognit Artif Intell 31(11). https://doi.org/10.1142/S021800141755014X
Cao F, Yu L, Huang J, Liang J (2017) K-mw-modes: an algorithm for clustering categorical matrix-object data. Appl Soft Comput 57:605–614
Article Google Scholar
Guo G, Chen L, Ye Y, Jiang Q (2016) Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2016.2608354
Chen L (2014) EM-type method for measuring graph dissimilarity. Int J Mach Learn Cybern 5:625–633
Article Google Scholar
Herranz J, Nin J, Sol\(\acute{e}\) M (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353
Article Google Scholar
Wei D, Jiang Q, Wei Y, Wang S (2012) A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf 13:174
Article Google Scholar
Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of the IEEE ICDE, pp 101–112
Xiong T, Wang S, Jiang Q, Huang JZ (2014) A novel variable-order Markov model for clustering categorical sequences. IEEE Trans Knowl Data Eng 26:2339–2353
Article Google Scholar
Sbakan YC, Kurt B, Cemgil AT, Sankurc B (2014) Probabilistic sequence clustering with spectral learning. Digital Signal Process 29:1–19
Article MathSciNet Google Scholar
Fink GA (2008) Markov models for pattern recognition: from theory to applications. Springer, New York, Berlin Heidelberg
MATH Google Scholar
Namiki Y, Ishida T, Akiyama Y (2013) Acceleration of sequence clustering using longest common subsequence filtering. BMC Bioinf 14(Suppl 8):S7
Article Google Scholar
Basu T, Murthy CA (2016) A supervised term selection technique for effective text categorization. Int J Mach Learn Cybern 7(5):877–892
Article Google Scholar
Domeniconi C, Gunopulos S, Ma S, Yan B, Razgan MA, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc 14(1):63–97
Article MathSciNet Google Scholar
Yuan L, Hong Z, Chen L, Cai Q (2016) Clustering categorical sequences with variable-length tuples representation, In: Proceedings of the KSEM, pp 15–27
Bezdek JC (1998) Pattern recognition in handbook of fuzzy computation. IOP Publishing Ltd, Bristol
Google Scholar
Wu D, Ren J (2017) Sequence clustering algorithm based on weighted vector identification. Int J Mach Learn Cybern 8(3):731–738
Article Google Scholar
Loiselle S, Rouat J, Pressnitzer D, Thorpe S (2005) Exploration of rank order coding with spiking neural networks for speech recognition. Proc IEEE IJCNN 4:2076–2080

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant no. 61672157, and the Innovative Research Team of Probability and Statistics: Theory and Application (IRTL1704). The authors would like to thank Guiren Lian for his constructive comments that contributed to improving the paper.

Author information

Authors and Affiliations

Network Operation Maintenance Center, University of Electronic Science and Technology of China, Chengdu, China
Liang Yuan
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China
Wenjian Wang
Digit Fujian Internet-of-Thing Laboratory of Environment Monitoring, Fujian Normal University, Fuzhou, China
Lifei Chen

Authors

Liang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Wenjian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lifei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lifei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, L., Wang, W. & Chen, L. Two-stage pruning method for gram-based categorical sequence clustering. Int. J. Mach. Learn. & Cyber. 10, 631–640 (2019). https://doi.org/10.1007/s13042-017-0744-y

Download citation

Received: 10 February 2017
Accepted: 07 November 2017
Published: 15 November 2017
Issue Date: 02 April 2019
DOI: https://doi.org/10.1007/s13042-017-0744-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Two-stage pruning method for gram-based categorical sequence clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stage pruning method for gram-based categorical sequence clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation