Combination of sequential class distributions from multiple channels using Markov fusion networks

Glodek, Michael; Schels, Martin; Schwenker, Friedhelm; Palm, Günther

doi:10.1007/s12193-014-0149-0

Combination of sequential class distributions from multiple channels using Markov fusion networks

Original Paper
Published: 08 March 2014

Volume 8, pages 257–272, (2014)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Michael Glodek¹,
Martin Schels¹,
Friedhelm Schwenker¹ &
…
Günther Palm¹

277 Accesses
10 Citations
Explore all metrics

Abstract

The recognition of patterns in real-time scenarios has become an important trend in the field of multi-modal user interfaces in human computer interaction. Cognitive technical systems aim to improve the human computer interaction by means of recognizing the situative context, e.g. by activity recognition (Ahad et al. in IEEE, 1896–1901, 2008), or by estimating the affective state (Zeng et al., IEEE Trans Pattern Anal Mach Intell 31(1):39–58, 2009) of the human dialogue partner. Classifier systems developed for such applications must operate on multiple modalities and must integrate the available decisions over large time periods. We address this topic by introducing the Markov fusion network (MFN) which is a novel classifier combination approach, for the integration of multi-class and multi-modal decisions continuously over time. The MFN combines results while meeting real-time requirements, weighting decisions of the modalities dynamically, and dealing with sensor failures. The proposed MFN has been evaluated in two empirical studies: the recognition of objects involved in human activities, and the recognition of emotions where we successfully demonstrate its outstanding performance. Furthermore, we show how the MFN can be applied in a variety of different architectures and the several options to configure the model in order to meet the demands of a distinct problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hidden Markov model a tool for recognition of human contexts using sensors of smart mobile phone

Article 24 June 2016

Ensemble of Multi-channel CNNs for Multi-class Time-Series Classification. Depth-Based Human Activity Recognition

Model-Based and Class-Based Fusion of Multisensor Data

Notes

The Kinect™ camera is an input device developed by Microsoft^®. http://www.xbox.com/en-US/Kinect (14/01/2014).
The F\(_1\) measure is defined by \(F_1 = 2 \frac{P\cdot R}{P + R}\) where \(P\) is the precision and \(R\) the recall.
A comprehensive description and the data can be found at http://sspnet.eu/avec2011/ (14/01/2014).

References

Ahad MAR, Tan J, Kim H, Ishikawa S (2008) Human activity recognition: various paradigms. In: Proceedings of the international conference on control, automation and systems (ICCAS). IEEE, pp 1896–1901. doi:10.1109/ICCAS.2008.4694407
Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. In: Proceedings of the international conference on machine learning and data mining (MLDM), Lecture Notes in Computer Science (LNCS), vol 2734. Springer, Berlin, pp 95–104. doi:10.1007/3-540-45065-3-8
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Brand M, Oliver N, Pentland A (1997) Coupled hidden Markov models for complex action recognition. In: Proceedings of the international IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 994–999. doi:10.1109/CVPR.1997.609450
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1007/BF00058655
MathSciNet MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi:10.1023/A:1010933404324
Article MATH Google Scholar
Buss M, Beetz M, Wollherr D (2007) CoTeSys—cognition for technical systems. In: Proceedings of the COE workshop on human adaptive mechatronics (HAM)
Castellano G, Leite I, Pereira A, Martinho C, Paiva A, McOwan PW (2010) Affect recognition for interactive companions: challenges and design in real world scenarios. J Multimodal User Interfaces 3(1–2):89–98. doi:10.1007/s12193-009-0033-5
Article Google Scholar
Christiani N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Diebel J, Thrun S (2006) An application of Markov random fields to range sensing. In: Proceedings of advances in neural information processing systems (NIPS), vol 18. MIT Press, Cambridge, pp 291–298
Dietrich C, Palm G, Riede K, Schwenker F (2004) Classification of bioacoustic time series based on the combination of global and local decisions. Pattern Recognit 37(12):2293–2305. doi:10.1016/j.patcog.2004.04.004
Article Google Scholar
Dietrich CR (2004) Temporal sensorfusion for the classification of bioacoustic time. Ph.D. thesis, Institut of Neural Information Processing, University of Ulm, Ulm, Germany
Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40(1–2):33–60. doi:10.1016/S0167-6393(02)00070-5
Article MATH Google Scholar
Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3–4):169–200
Article Google Scholar
Fontaine J, Scherer K, Roesch E, Ellsworth P (2007) The world of emotions is not two-dimensional. Psychol Sci 18(12):1050
Article Google Scholar
Freeman W, Roth M (1995) Orientation histograms for hand gesture recognition. Tech. Rep. TR94-03, Mitsubishi Electrical Research Laboratories. Originally published at the International Workshop on Automatic Face and Gesture Recognition
Glodek M, Bigalke L, Schels M, Schwenker F (2011) Incorporating uncertainty in a layered HMM architecture for human activity recognition. In: Proceedings of the joint workshop on human gesture and behavior understanding (J-HGBU). ACM, pp 33–34. doi:10.1145/2072572.2072584
Glodek M, Reuter S, Schels M, Dietmayer K, Schwenker F (2013) Kalman filter based classifier fusion for affective state recognition. In: Zhou ZH, Roli F, Kittler J (eds) Multiple classifier systems (MCS), Lecture Notes in Computer Science (LNCS), vol 7872. Springer, Berlin, pp 85–94. doi:10.1007/978-3-642-38067-9_8
Glodek M, Schels M, Palm G, Schwenker F (2012) Multiple classifier combination using reject options and Markov fusion networks. In: Proceedings of the international ACM conference on multimodal interaction (ICMI). ACM, pp 465–472. doi:10.1145/2388676.2388778
Glodek M, Scherer S, Schwenker F (2011) Conditioned hidden Markov model fusion for multimodal classification. In: Proceedings of the annual conference of the international speech communication association (Interspeech). ISCA, pp 2269–2272
Glodek M, Schwenker F, Palm G (2012) Detecting actions by integrating sequential symbolic and sub-symbolic information in human activity recognition. In: Perner P (ed) Proceedings of the international conference on machine learning and data mining (MLDM), Lecture Notes in Computer Science (LNCS), vol 7376. Springer, Berlin. pp 394–404. doi:10.1007/978-3-642-31537-4_31
Glodek M, Trentin E, Schwenker F, Palm G (2013) Hidden Markov models with graph densities for action recognition. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 964–969
Glodek M, Tschechne S, Layher G, Schels M, Brosch T, Scherer S, Kächele M, Schmidt M, Neumann H, Palm G, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotional states. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Affective computing and intelligent interaction, Lecture Notes in Computer Science (LNCS), vol 6975. Springer, Berlin, pp 359–368. doi:10.1007/978-3-642-24571-8_47
Huang X, Acero A, Hon H (2001) Spoken language processing: a Guide to Theory. Prentice Hall, Algorithm and System Development
Kim M, Kumar S, Pavlovic V, Rowley H (2008) Face tracking and recognition with visual constraints in real-world videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1–8. doi:10.1109/CVPR.2008.4587572
Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239. doi:10.1109/34.667881
Article Google Scholar
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press, Cambridge
Krell G, Glodek M, Panning A, Siegert I, Michaelis B, Wendemuth A, Schwenker F (2012) Fusion of fragmentary classifier decisions for affective state recognition. In: Schwenker F, Scherer S, Morency LP (eds) Multimodal pattern recognition of social signals in human-computer-interaction, Lecture Notes in Computer Science (LNCS), vol 7742. Springer, Berlin, pp 116–130. doi:10.1007/978-3-642-37081-6_13
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley , New York. doi:10.1002/0471660264
Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bartlett M (2011) The computer expression recognition toolbox (CERT). In: Proceedings of the international conference IEEE on automatic face gesture recognition and workshops (FG). IEEE, pp 298–305. doi:10.1109/FG.2011.5771414
McKeown G, Valstar M, Cowie R, Pantic M (2010) The SEMAINE corpus of emotionally coloured character interactions. In: Proceedings of the international conference on multimedia and expo (ICME). IEEE, pp 1079–1084. doi:10.1109/ICME.2010.5583006
Meng H, Bianchi-Berthouze N (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden markov models. In: D’Mello S, Graesser A, Schuller B, Martin JC(eds) Proceedings of the international conference on affective computing and intelligent interaction (ACII), Lecture Notes in Computer Science (LNCS), vol 6975. Springer, pp 378–387. doi:10.1007/978-3-642-24571-8_49
Oliver N, Garg A, Horvitz E (2004) Layered representations for learning and inferring office activity from multiple sensory channels. Comput Vis Image Underst 96(2):163–180. doi:10.1016/j.cviu.2004.02.004. (Special issue: Event Detection in video)
Palm G, Glodek M (2013) Towards emotion recognition in human computer interaction. In: Esposito A, Squartini S, Palm G (eds) Neural nets and surroundings, smart innovation, systems and technologies, vol 19. Springer, pp 323–336. doi:10.1007/978-3-642-35467-0_32
Pan H, Levinson S, Huang T, Liang ZP (2004) A fused hidden Markov model with application to bimodal speech processing. IEEE Trans Signal Process 52(3):573–581. doi:10.1109/TSP.2003.822353
Article MathSciNet Google Scholar
Platt J (2000) Probabilistic outputs for SV machines, chap. 5. Neural Information Processing Series. MIT Press, Cambridge, pp 61–74
Ramirez GA, Baltrušaitis T, Morency LP (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Proceedings of the international conference on affective computing and intelligent interaction (ACII), Lecture Notes in Computer Science (LNCS), vol 6975. Springer, pp 396–406. doi:10.1007/978-3-642-24571-8_51
Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Palm G, Neumann H, Traue H, Schwenker F (2013) Multi-modal classifier-fusion for the recognition of emotions. In: Coverbal synchrony in Human-Machine Interaction. CRC Press, pp 73–97
Schels M, Glodek M, Meudt S, Schmidt M, Hrabal D, Böck R, Walter S, Schwenker F (2012) Multi-modal classifier-fusion for the classification of emotional states in WOZ scenarios. In: Ji YG (ed) Advances in affective and pleasurable design, vol 22 in Advances in Human Factors and Ergonomics Series. CRC Press, pp 644–653. doi:10.1201/b12525-78
Schels M, Kächele M, Glodek M, Hrabal D, Walter S, Schwenker F (2013) Using unlabeled data to improve classification of emotional states in human computer interaction. J Multimodal User Interfaces 1–12. doi:10.1007/s12193-013-0133-0 (Special Issue: From Multimodal Analysis to Real-Time Interactions with Virtual Agents)
Schels M, Kächele M, Hrabal D, Walter S, Traue H, Schwenker F (2012) Classification of emotional states in a Woz scenario exploiting labeled and unlabeled bio-physiological data. In: Schwenker F, Trentin E (eds) Proceedings of the international conference on partially supervised learning (PSL), Lecture Notes in Computer Science (LNCS), vol 7081. Springer, pp 138–147. doi:10.1007/978-3-642-28258-4_15
Schels M, Scherer S, Glodek M, Kestler H, Palm G, Schwenker F (2013) On the discovery of events in EEG data utilizing information fusion. Comput Stat 28(1):5–18. doi:10.1007/s00180-011-0292-y
Article MathSciNet Google Scholar
Scherer S, Glodek M, Layher G, Schels M, Schmidt M, Brosch T, Tschechne S, Schwenker F, Neumann H, Palm G (2012) A generic framework for the inference of user states in human computer interaction: How patterns of low level behavioral cues support complex user states in HCI. J Multimodal User Interfaces 6(3–4):117–141. doi:10.1007/s12193-012-0093-9
Article Google Scholar
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the international conference on pattern recognition (ICPR), vol 3. IEEE, pp 32–36
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proceedings of the international IEEE conference on acoustics, speech and signal processing (ICASSP), vol 4. IEEE, pp 941–944. doi:10.1109/ICASSP.2007.367226
Schuller B, Valstar M, Eyben F, McKeown G, Cowie R, Pantic M (2011) AVEC 2011—the first international audio visual emotion challenges. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Proceedings of the international conference on affective computing and intelligent interaction (ACII), Lecture Notes in Computer Science (LNCS), vol 6975. Springer, pp 415–424. doi:10.1007/978-3-642-24571-8_53 (Part II)
Schwenker F, Dietrich CR, Thiel C, Palm G (2006) Learning of decision fusion mappings for pattern recognition. J Artif Intell Mach Learn 17–21 (Special issue: Multiple Classifier Systems)
Swain M, Ballard D (1991) Color indexing. Int J Comput Vis 7(1):11–32
Article Google Scholar
Szczot M, Löhlein O, Palm G (2012) Dempster-Shafer fusion of context sources for pedestrian recognition. In: Denoeux T, Masson MH (eds) Belief functions: theory and applications, advances in intelligent and soft computing, vol 164. Springer, pp 319–326
Thiel C (2010) Multiple classifier systems incorporating uncertainty. Verlag Dr, Hut
Vinciarelli A, Pantic M, Bourlard H, Pentland A (2008) Social signal processing: State-of-the-art and future perspectives of an emerging domain. In: Proceedings of the international ACM conference on multimedia (MM). ACM, pp 1061–1070. doi:10.1145/1459359.1459573
Vlasenko B, Schuller B, Wendemuth A, Rigoll G (2007) Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing. In: Paiva AC, Prada R, Picard RW (eds) Proceedings of the internation conference on affective computing and intelligent interaction (ACII), Lecture Notes in Computer Science (LNCS), vol 4738. Springer, pp 139–147. doi:10.1007/978-3-540-74889-2_13
Wahlster W (2003) SmartKom: symmetric multimodality in an adaptive and reusable dialogue shell. In: Krahl R, Günther D (eds) Proceedings of the status conference “Human Computer Interaction”. DLR, pp 47–62
Wendemuth A, Biundo S (2012) A companion technology for cognitive technical systems. In: Esposito A, Esposito AM, Vinciarelli A, Hoffmann R, Müller VC (eds) Cognitive behavioural systems, Lecture Notes in Computer Science (LNCS), vol 7403. Springer, pp 89–103. doi:10.1007/978-3-642-34584-5_7
Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of the annual conference of the international speech communication association (ISCA), interspeech, pp 2362–2365
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58. doi:10.1109/TPAMI.2008.52
Article Google Scholar
Zhu X (2005) Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison

Download references

Acknowledgments

This paper is based on work done within the Transregional Collaborative Research Centre SFB/TRR 62 Companion-Technology for Cognitive Technical Systems funded by the German Research Foundation (DFG).

Author information

Authors and Affiliations

Institute of Neural Information Processing, Ulm, Germany
Michael Glodek, Martin Schels, Friedhelm Schwenker & Günther Palm

Authors

Michael Glodek
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schels
View author publications
You can also search for this author in PubMed Google Scholar
Friedhelm Schwenker
View author publications
You can also search for this author in PubMed Google Scholar
Günther Palm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Glodek.

Appendix

This section contains the gradient decent algorithm to determine the MFN estimate (Algorithm 1), and the parameters used within the second experiment (Table 7).

Table 7 Emotion recognition parameter setting of the MFN

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Glodek, M., Schels, M., Schwenker, F. et al. Combination of sequential class distributions from multiple channels using Markov fusion networks. J Multimodal User Interfaces 8, 257–272 (2014). https://doi.org/10.1007/s12193-014-0149-0

Download citation

Received: 27 March 2013
Accepted: 06 February 2014
Published: 08 March 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s12193-014-0149-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combination of sequential class distributions from multiple channels using Markov fusion networks

Abstract

Access this article

Similar content being viewed by others

Hidden Markov model a tool for recognition of human contexts using sensors of smart mobile phone

Ensemble of Multi-channel CNNs for Multi-class Time-Series Classification. Depth-Based Human Activity Recognition

Model-Based and Class-Based Fusion of Multisensor Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combination of sequential class distributions from multiple channels using Markov fusion networks

Abstract

Access this article

Similar content being viewed by others

Hidden Markov model a tool for recognition of human contexts using sensors of smart mobile phone

Ensemble of Multi-channel CNNs for Multi-class Time-Series Classification. Depth-Based Human Activity Recognition

Model-Based and Class-Based Fusion of Multisensor Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation