Skip to main content
Log in

Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

Spoken dialog systems are employed in various devices to help users operate them. An advantage of a spoken dialog system is that the user can make input utterances freely, but the system sometimes makes it difficult for the user to speak to it. The system should estimate the state of a user who encounters a problem when starting a dialog and then give appropriate help before the user abandons the dialog. Based on this assumption, our research aims to construct a system which responds to a user who does not reply to the system. In this paper, we propose a method of discriminating the user’s state based on vector quantization of non-verbal information such as prosodic features, facial feature points, and gaze. The experimental results showed that the proposed method outperforms the conventional approaches and achieves a discrimination ratio of 72.0%. Then, we examined sequential discrimination for responding to the user at an appropriate timing. The results indicate that the discrimination ratio reached equal to the end of the session at around 6.0 s.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Adelhardt J, Shi R, Frank C, Zeißler V, Batliner A, Nöth E, Niemann H (2003) Multimodal user state recognition in a modern dialogue system. In: Proceedings of the 26th german conference on artificial intelligence, pp 591–605

  2. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  3. Brennan SE, Williams M (1995) The feeling of another’s knowing: prosody and filled pauses as cues to listeners about the metacognitive states of speakers. J Mem Lang 34(3):383–398

    Article  Google Scholar 

  4. Callejas Z, Griol D, López-Cózar R (2011) Predicting user mental states in spoken dialogue systems. EURASIP J Adv Signal Process 6:1–21

    Google Scholar 

  5. Chiba Y, Ito A (2012) Estimating a user’s internal state before the first input utterance. Adv Hum Comput Inter. doi:10.1155/2012/865362

    Google Scholar 

  6. Chiba Y, Ito M, Ito A (2012) Effect of linguistic contents on human estimation of internal state of dialog system users. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog, pp 11–14

  7. Chiba Y, Ito M, Ito A (2013) Estimation of user’s state during a dialog turn with sequential multi-modal features. In: HCI international 2013-posters’ extended abstracts, pp 572–576

  8. Chiba Y, Ito M, Ito A (2014a) Modeling user’s state during dialog turn using HMM for multi-modal spoken dialog system. In: Proceedigs of the 7th international conference on advances in computer–human interactions, pp 343–346

  9. Chiba Y, Nose T, Ito A, Ito M (2014b) User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor’s internal state. In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue, pp 74–78

  10. Collignon O, Girard S, Gosselin F, Roy S, Saint-Amour D, Lassonde M, Lepore F (2008) Audio-visual integration of emotion expression. Brain Res 1242:126–135

    Article  Google Scholar 

  11. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision, pp 1–22

  12. de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. J Biomed Inform 39(5):514–531

    Article  Google Scholar 

  13. Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceediings of the 21st international conference on machine learning, pp 225–232

  14. Forbes-Riley K, Litman D (2011a) Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Commun 53(9–10):1115–1136

    Article  Google Scholar 

  15. Forbes-Riley K, Litman D (2011b) Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Comput Speech Lang 25(1):105–126

    Article  Google Scholar 

  16. Griol D, Molina JM, Callejas Z (2014) Modeling the user state for context-aware spoken interaction in ambient assisted living. Appl Intell 40(4):749–771

    Article  Google Scholar 

  17. Hudson S, Fogarty J, Atkeson C, Avrahami D, Forlizzi J, Kiesler S, Lee J, Yang J (2003) Predicting human interruptibility with sensors: a Wizard of Oz feasibility study. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 257–264

  18. Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 494–501

  19. Jokinen K, Kanto K (2004) User expertise modelling and adaptivity in a speech-based e-mail system. In: Proceedings of the 42nd annual meeting on association for computational linguistics, pp 88–95

  20. Kobayashi A, Kayama K, Mizukami E, Misu T, Kashioka H, Kawai H, Nakamura S (2010) Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system. In: Proceedings of the international workshop on spoken dialogue systems technology, pp 73–84

  21. Koda T, Maes P (1996) Agents with faces: the effect of personification. In: Proceedings of the IEEE international workshop on robot and human communication, pp 189–194

  22. Lin JC, Wu CH, Wei WL (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156

    Article  Google Scholar 

  23. Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198

    Article  Google Scholar 

  24. Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: Proceedings of the 9th IEEE international workshop on advanced motion control, pp 762–767

  25. Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. In: Proceedings of computer vision and pattern recognition, pp 1298–1305

  26. Paliwal KK, Atal BS (1993) Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans Speech Audio Process 1:3–14

    Article  Google Scholar 

  27. Pargellis AN, Kuo HKJ, Lee CH (2004) An automatic dialogue generation platform for personalized dialogue applications. Speech Commun 42(3–4):329–351

    Article  Google Scholar 

  28. Paulmann S, Pell MD (2011) Is there an advantage for recognizing multi-modal emotional stimuli? Motiv Emot 35(2):192–201

    Article  Google Scholar 

  29. Pon-Barry H, Schultz K, Bratt EO, Clark B, Peters S (2006) Responding to student uncertainty in spoken tutorial dialogue systems. Int J Artif Intell Educ 16(2):171–194

    Google Scholar 

  30. Saragih JM, Lucey S, Cohn JF (2011) Deformable model fitting by regularized landmark mean-shift. Int J Comput Vis 91(2):200–215

    Article  MathSciNet  MATH  Google Scholar 

  31. Satake S, Kanda T, Glas DF, Imai M, Ishiguro H, Hagita N (2009) How to approach humans? Strategies for social robots to initiate interaction. In: Proceedings of the 4th ACM/IEEE international conference on human–robot interaction, pp 109–116

  32. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565

    MathSciNet  MATH  Google Scholar 

  33. Swerts M, Krahmer E (2005) Audiovisual prosody and feeling of knowing. J Mem Lang 53(1):81–94

    Article  Google Scholar 

  34. Walker JH, Sproull L, Subramani R (1994) Using a human face in an interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 85–91

  35. Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607

    Article  Google Scholar 

  36. Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163

    Article  Google Scholar 

  37. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

Download references

Acknowledgements

Funding was provided by Grant-in-Aid for JSPS Research Fellow (Grant No. 263989) and Grants-in-Aid for Scientific Research (Grant No. JP15H02720).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuya Chiba.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chiba, Y., Nose, T. & Ito, A. Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt. J Multimodal User Interfaces 11, 185–196 (2017). https://doi.org/10.1007/s12193-017-0238-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-017-0238-y

Keywords

Navigation