Abstract
We address the sequence classification problem using a probabilistic model based on hidden Markov models (HMMs). In contrast to commonly-used likelihood-based learning methods such as the joint/conditional maximum likelihood estimator, we introduce a discriminative learning algorithm that focuses on class margin maximization. Our approach has two main advantages: (i) As an extension of support vector machines (SVMs) to sequential, non-Euclidean data, the approach inherits benefits of margin-based classifiers, such as the provable generalization error bounds. (ii) Unlike many algorithms based on non-parametric estimation of similarity measures that enforce weak constraints on the data domain, our approach utilizes the HMM’s latent Markov structure to regularize the model in the high-dimensional sequence space. We demonstrate significant improvements in classification performance of the proposed method in an extensive set of evaluations on time-series sequence data that frequently appear in data mining and computer vision domains.
Similar content being viewed by others
References
Alon J, Sclaroff S, Kollios G, Pavlovic V (2003) Discovering clusters in motion time-series data. In: Computer vision pattern recognition, Madison, WI
Altun Y, Tsochantaridis I, Hofmann T (2003) Hidden Markov support vector machines. In: International conference on machine learning, Washington, DC
Bartlett PL (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inform Theory 44(2): 525–536
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Nashua
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual ACM workshop on computational learning theory, Pittsburgh, PA
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Empirical methods in natural language processing, Philadelphia, PA
Crammer K, Singer Y, Cristianini N, Shawe-Taylor J,Williamson B (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Machine Learn Res 2:265–292
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39: 185–197
Duan K, Keerthi S (2003) Which is the best multiclass SVM method? An empirical study. In: Neural information processing systems, Vancouver, BC, Canada
Durbin R, Eddy S, Krogh A, Mitchenson G (2002) Biological sequence analysis. Cambridge University Press, Cambridge
Greiner R, Zhou W (2002) Structural extension to logistic regression: discriminative parameter learning of belief net classifiers. In: Proceedings of annual meeting of the American Association for Artificial Intelligence, Edmonton, Alberta, Canada
Hastie T, Tibshirani R (1998) Classification by pairwise coupling. In: Neural information processing systems, Vancouver, BC, Canada
Heigold G, Schluter R, Ney H (2007) On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields. In: Proceedings of the international conference on spoken language processing (Interspeech). Antwerp, Belgium
Hettich S, Bay SD (1999) The UCI KDD archive. University of California, Department of Information and Computer Science, Irvine. http://kdd.ics.uci.edu
Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. In: International conference on intelligent systems for molecular biology, Heidelberg, Germany
Juang BH, Rabiner LR (1985) A probabilistic distance measure for hidden Markov models. AT & T Tech J 64:391–408
Keogh E, Folias T (2002) The UCR time series data mining archive. University of California – Computer Science & Engineering Department, Riverside. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html
Keshet J, Shalev-Shwartz S, Bengio S, Singer Y, Chazan D (2006) Discriminative kernel-based phoneme sequence recognition. In: The 9th international conference on spoken language processing (INTERSPEECH), Pittsburgh, PA
Krogh A (1994) Hidden markov models for labeled sequences. In: In proceedings of the 12th IAPR ICPR’94, IEEE Computer Society Press, pp. 140–144
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International conference on machine learning, Williamstown, MA
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. Pacific Symp Biocomput 7: 566–575
Li X, Jiang H, Liu C (2005) Large margin hidden Markov models for speech recognition. In: International conference on acoustics, speech, and signal processing, Philadelphia, PA
Li J, Yuan M, Lee CH (2006) Soft margin estimation of hidden Markov model parameters. In: International conference on spoken language processing, Pittsburgh, PA
Liu C, Jiang H, Li X (2005) Discriminative training of CDHMMs for maximum relative separation margin. In: International conference on acoustics, speech, and signal processing, Philadelphia, PA
Nadas A (1983) A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood. IEEE Trans Acoust Speech Signal Process 31(4): 814–817
Ng AY, Jordan M (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and Naive Bayes. In: Neural information processing systems, Vancouver, BC, Canada
Pernkopf F, Bilmes J (2005) Discriminative versus generative parameter and structure learning of Bayesian Network Classifiers. In: International conference on machine learning, Bonn, Germany
Quattoni A, Wang S, Morency LP, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10): 1848–1852
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2): 257–286
Ratanamahatana CA, Keogh E (2004) Making time-series classification more accurate using learned constraints. In: SIAM international conference on data mining, Lake Buena Vista, FL
Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping. In: SIAM international conference on data mining, Newport Beach, CA
Sakoe H, Chiba C (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26(1): 43–49
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of human language technology-NAACL, Edmonton, Alberta, Canada
Sha F, Saul LK (2007) Large margin hidden Markov models for automatic speech recognition. In: Neural information processing systems, Vancouver, BC, Canada
Shawe-Taylor J, Bartlett P, Williamson R, Anthony M (1996) A framework for structural risk minimisation. In: Proceedings of the 9th annual conference on computational learning theory, Desenzano sul Garda, Italy
Starner T, Pentland A (1995) Real-time American sign language recognition from video using hidden Markov models. In: International symposium on computer vision, Coral Gables, FL
Tanawongsuwan R, Bobick A (2003) Performance analysis of time-distance gait parameters under different speeds. In: International conference on audio and video based biometric person authentication, Guildford, UK
Taskar B, Guestrin C, Koller D (2003) Max-margin Markov networks. In: Neural information processing systems, Vancouver, BC, Canada
Taskar B, Lacoste-Julien S, Klein D (2005) A discriminative matching approach to word alignment. In: Empirical methods in natural language processing, Vancouver, BC, Canada
Tian TP, Li R, Sclaroff S (2005) Articulated pose estimation in a learned smooth space of feasible solutions. In: Proceedings of IEEE workshop in computer vision and pattern recognition, San Diego, CA
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Veeraraghavan A, Chellappa R, Roy-Chowdhury A (2006) The function space of an activity. In: Computer vision and pattern recognition, New York, NY
Wilson AD, Bobick AF (1999) Parametric hidden Markov models for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(9): 884–900
Woodland P, Povey D (2002) Large scale discriminative training of hidden Markov models for speech recognition. Comput Speech Lang 16(1): 25–47
Zhang T (2002) Covering number bounds of certain regularized linear function classes. J Mach Learn Res 2: 527–550
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charles Elkan.
Rights and permissions
About this article
Cite this article
Kim, M., Pavlovic, V. Sequence classification via large margin hidden Markov models. Data Min Knowl Disc 23, 322–344 (2011). https://doi.org/10.1007/s10618-010-0206-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0206-6