Abstract
Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bicego M, Dovier A, Murino V (2001) Designing the minimal structure of hidden Markov model by bisimulation. Energy Minimization Methods Comput Vis Pattern Recognit 2001:75–90
Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. Mach Learn Data Min Pattern Recognit 2003:95–104
Borges J, Levene M (2004) A dynamic clustering-based Markov model for web usage mining. arxiv:cs/0406032
Brinkhoff T (2003) Generating traffic data. IEEE Data Eng Bull 26(2):19–25
Cakmak A, Özsoyoglu G (2008) Taxonomy-superimposed graph mining. In: Proceedings of 11th international conference on Extending Database Technology (EDBT)
Cao H, Jiang D, Pei J, He Q, Liao Z, Chen E, Li H (2008) Context-aware query suggestion by mining click-through and session data. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD)
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience
Felzenszwalb PF, Huttenlocher DP, Kleinberg JM (2004) Fast algorithms for large-state-space hmms with applications to web usage analysis. In: Advances in Neural Information Processing Systems (NIPS)
Girolami M, Kaban A (2003) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Advances in Neural Information Processing Systems (NIPS)
Guralnik V, Karypis G (2001) A scalable algorithm for clustering protein sequences. In: BIOKDD
Kemeny J, Snell JL (1959) Finite Markov chains. Springer-Verlag
Law MH, Kwok JT (2000) Rival penalized competitive learning for model-based sequence clustering. Pattern Recognition, International Conference on 2
Lee HK, Kim JH (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10): 961–973
Lee JG, Han J, Li X, Gonzalez H (2008) raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. In: Proceedings of the 34th international conference on Very Large Databases (VLDB)
Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (SIGMOD)
Li X, Han J, Lee JG, Gonzalez H (2007) Traffic density-based discovery of hot routes in road networks. In: Proceedings of the 10th international Symposium on Advances in Spatial and Temporal Databases (SSTD)
Manavoglu E, Pavlov D, Giles CL (2003) Probabilistic user behavior models. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM)
Manning AM, Brass A, Goble CA, Keane JA (1997) Clustering techniques in biological sequence analysis. In: PKDD
Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31(2): 240–272
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2)
Simon H, Ando J (1961) Aggregation of variables in dynamic systems. Econometrica 29: 111–138
Srikant R, Agrawal R (1995) Mining generalized association rules. In Proceedings of 21th international conference on Very Large Data Bases (VLDB)
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference on Extending Database Technology (EDBT)
Smyth P (1997) Clustering sequences with hidden Markov models. In: Advances in neural information processing systems, vol 9, pp 648–654
Stolcke A, Omohundro SM (1994) Best-first model merging for hidden Markov model induction
Tijms H (1986) Stochastic modelling and analysis: a computational approach. Wiley, New York
Wang J, Zhang Y, Zhou L, Karypis G, Aggarwal CC (2007) Discriminating subsequence discovery for sequence clustering. In: SDM
Welch LR (2003) Hidden Markov models and the baum-welch algorithm. IEEE Inf Theory Soc Newsl 53(4)
White LB, Mahony R, Brushe GD (2000) Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Trans Automat Contr 45(12)
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe, Taylor.
Rights and permissions
About this article
Cite this article
Bonchi, F., Castillo, C., Donato, D. et al. Taxonomy-driven lumping for sequence mining. Data Min Knowl Disc 19, 227–244 (2009). https://doi.org/10.1007/s10618-009-0141-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0141-6