Speech and language processing for assessing child–adult interaction based on diarization and location

Hansen, John H. L.; Najafian, Maryam; Lileikyte, Rasa; Irvin, Dwight; Rous, Beth

doi:10.1007/s10772-019-09590-0

Speech and language processing for assessing child–adult interaction based on diarization and location

Published: 05 June 2019

Volume 22, pages 697–709, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

John H. L. Hansen ORCID: orcid.org/0000-0003-1382-9929¹,
Maryam Najafian¹,
Rasa Lileikyte¹,
Dwight Irvin^2,3 &
…
Beth Rous³

553 Accesses
Explore all metrics

Abstract

Understanding and assessing child verbal communication patterns is critical in facilitating effective language development. Typically speaker diarization is performed to explore children’s verbal engagement. Understanding which activity areas stimulate verbal communication can help promote more efficient language development. In this study, we present a two-stage children vocal engagement prediction system that consists of (1) a near to real-time, noise robust system that measures the duration of child-to-adult and child-to-child conversations, and tracks the number of conversational turn-takings, (2) a novel child location tracking strategy, that determines in which activity areas a child spends most/least of their time. A proposed child–adult turn-taking solution relies exclusively on vocal cues observed during the interaction between a child and other children, and/or classroom teachers. By employing a threshold optimized speech activity detection using a linear combination of voicing measures, it is possible to achieve effective speech/non-speech segment detection prior to conversion assessment. This TO-COMBO-SAD reduces classification error rates for adult-child audio by 21.34% and 27.3% compared to a baseline i-Vector and standard Bayesian Information Criterion diarization systems, respectively. In addition, this study presents a unique location tracking system adult-child that helps determine the quantity of child–adult communication in specific activity areas, and which activities stimulate voice communication engagement in a child–adult education space. We observe that our proposed location tracking solution offers unique opportunities to assess speech and language interaction for children, and quantify the location context which would contribute to improve verbal communication.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Paralinguistic Analysis of Children’s Speech in Natural Environments

A thorough evaluation of the Language Environment Analysis (LENA) system

Article 29 July 2020

Correlation and agreement between Language ENvironment Analysis (lena™) and manual transcription for Dutch natural language recordings

Article 21 September 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://www.lenafoundation.org.

References

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.
Article Google Scholar
Bahari, M. H., McLaren, M., van Leeuwen, D. A., et al. (2014). Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, 34, 99–108.
Article Google Scholar
Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1505–1512.
Article Google Scholar
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In: Proceedings of the institute of phonetic sciences (Vol. 17, pp. 97–110). Amsterdam.
Bonastre, J.-F., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N.W., Fauve, B.G., & Mason, J.S. (2008). ALIZE/spkdet: A state-of-the-art open source software for speaker recognition. In: Odyssey. p. 20.
Connaghan, D., Hughes, S., May, G., Kelly, P., Conaire, C.Ó., O’Connor, N.E., O’Gorman, D., Smeaton, A.F., & Moyna, N. (2009). A sensing platform for physiological and contextual feedback to tennis athletes. In: Wearable and implantable body sensor networks, 2009 (pp. 224–229). BSN 2009. IEEE.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011a). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Article Google Scholar
Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., & Dehak, R. (2011b). Language recognition via i-vectors and dimensionality reduction. In Twelfth Annual Conference of the International Speech Communication Association, INTERSPEECH.
Delano, M., & Snell, M. E. (2006). The effects of social stories on the social engagement of children with autism. Journal of Positive Behavior Interventions, 8(1), 29–42.
Article Google Scholar
Gauvain, J.-L., & Lee, C.-H. (1991). Bayesian learning of Gaussian mixture densities for hidden Markov models. In Speech and natural language: Proceedings of a Workshop Held at Pacific Grove, California, 19-22 February, 1991.
Ghaemmaghami, H., Dean, D., & Sridharan, S. (2015). A cluster-voting approach for speaker diarization and linking of Australian broadcast news recordings. In ICASSP (pp. 4829–4833). IEEE.
Ghaemmaghami, H., Dean, D., Vogt, R. & Sridharan, S. (2011). Extending the task of diarization to speaker attribution. In Interspeech 2011, 28–31 August 2011, Florence.
Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J.H., Janin, A., Lee, B.S., Lei, Y., & Mitra, V., et al., (2013). All for one: feature combination for highly channel-degraded speech activity detection. In INTERSPEECH (pp. 709–713).
Gravier, G., Betser, M., & Ben, M. (2010). AudioSeg: Audio segmentation toolkit, release 1.2. IRISA, January.
Gupta, R., Bone, D., Lee, S., & Narayanan, S. (2016). Analysis of engagement behavior in children during dyadic interactions using prosodic cues. Computer Speech & Language, 37, 47–66.
Article Google Scholar
Hart, B., & Risley, T. R. (1995). Meaningful differences in the everyday experience of young American children. Baltimore, MD: Paul H Brookes Publishing.
Google Scholar
Huijbregts, M. A.H. (2008). Segmentation, diarization and speech transcription: Surprise data unraveled. Ph.D. thesis, Centre for Telematics and Information Technology University of Twente.
Kasari, C., Gulsrud, A. C., Wong, C., Kwon, S., & Locke, J. (2010). Randomized controlled caregiver mediated joint engagement intervention for toddlers with autism. Journal of Autism and Developmental Disorders, 40(9), 1045–1056.
Article Google Scholar
Meignier, S., & Merlin, T. (2010). Lium spkdiarization: an open source toolkit for diarization. In CMU SPUD Workshop (Vol. 2010). Le Mans: Universite du Maine.
Meignier, S., Moraru, D., Fredouille, C., Bonastre, J.-F., & Besacier, L. (2006). Step-by-step and integrated approaches in broadcast news speaker diarization. Computer Speech & Language, 20(2), 303–330.
Article Google Scholar
Najafian, M., Irvin, D., Luo, Y., Rous, B.S., & Hansen, J.H. (2016). Employing speech and location information for automatic assessment of child language environments. In Sensing, processing and learning for intelligent machines (SPLINE). IEEE, pp. 1–5.
Phebey, T. (2010). The Ubisense assembly control solution for BMW solution for BMW. Proccedings of RFID Journal Europe Live. Retrieved 18 August, 2016.
Reynolds, D.A., Singer, E., Carlson, B.A., O’Leary, G.C., McLaughlin, J.J., & Zissman, M.A. (1998). Blind clustering of speech utterances based on speaker and language characteristics. In Fifth International Conference on spoken language processing—ICSP.
Riehle, T.H., Lichter, P., Giudice, N.A. (2008). An indoor navigation system to support the visually impaired. In Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE. IEEE, pp. 4435–4438.
Sadjadi, S. O., & Hansen, J. H. (2013). Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Processing Letters, 20(3), 197–200.
Article Google Scholar
Safavi, S., Russell, M., & Jančovič, P. (2014). Identification of age-group from children’s speech by computers and humans. In Fifteenth Annual Conference of the International Speech Communication Association—INTERSPEECH.
Scheirer, E., & Slaney, M. (1997). Construction and evaluation of a robust multifeature speech/music discriminator. In IEEE International Conference on acoustics, speech, and signal processing, 1997. IEEE. ICASSP-97 (Vol. 2, pp. 1331–1334).
Siegler, M.A., Jain, U., Raj, B., & Stern, R.M., (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop. Vol. 1997.
Swedberg, C. (2011). Bmw finds the right tool. RFID Journal, 1, 2009.
Google Scholar
Tranter, S. E., & Reynolds, D. A. (2006). An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1557–1565.
Article Google Scholar
Vijayasenan, D., & Valente, F. (2012). Diartk: An open source toolkit for research in multistream speaker diarization and its application to meetings recordings. In Thirteenth Annual Conference of the International Speech Communication Association—INTERSPEECH. Portland.
Walker, D., Greenwood, C., Hart, B., & Carta, J. (1994). Prediction of school outcomes based on early language production and socioeconomic factors. Child Development, 65, 606–621.
Article Google Scholar
Woźniak, M., Odziemczyk, W., & Nagórski, K. (2013). Investigation of practical and theoretical accuracy of wireless indoor positioning system ubisense. Reports on Geodesy and Geoinformatics, 95(1), 36–48.
Google Scholar
Yella, S. H. (2015). Speaker diarization of spontaneous meeting room conversations. PhD thesis, EPFL, Lausanne.
Zhao, Q., Kawamata, M., & Higuchi, T. (1988). Controllability, observability and model reduction of separable denominator MD systems. IEICE Transactions (1976–1990), 71(5), 505–513.
Google Scholar
Ziaei, A., Kaushik, L., Sangwan, A., Hansen, J.H., & Oard, D.W. (2014). Speech activity detection for nasa apollo space missions: Challenges and solutions. In Fifteenth Annual Conference of the International Speech Communication Association.
Ziaei, A., Sangwan, A., & Hansen, J.H. (2013). Prof-Life-Log: Personal interaction analysis for naturalistic audio streams. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7770–7774). IEEE.

Download references

Acknowledgements

Authors wish to express our sincere thanks to Univ. of Kentucky for the joint collaboration efforts on this study. In particular, wish to thank Ying Luo for collecting, organizing the child database used in this study.

Author information

Authors and Affiliations

Center for Robust Speech Systems, University of Texas at Dallas, 2601 N. Floyd Road, EC33, Richardson, TX, 75080-1407, USA
John H. L. Hansen, Maryam Najafian & Rasa Lileikyte
Life Span Institute University of Kansas, Kansas City, KS, USA
Dwight Irvin
College of Education, University of Kentucky, Lexington, KY, USA
Dwight Irvin & Beth Rous

Authors

John H. L. Hansen
View author publications
You can also search for this author inPubMed Google Scholar
Maryam Najafian
View author publications
You can also search for this author inPubMed Google Scholar
Rasa Lileikyte
View author publications
You can also search for this author inPubMed Google Scholar
Dwight Irvin
View author publications
You can also search for this author inPubMed Google Scholar
Beth Rous
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to John H. L. Hansen.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hansen, J.H.L., Najafian, M., Lileikyte, R. et al. Speech and language processing for assessing child–adult interaction based on diarization and location. Int J Speech Technol 22, 697–709 (2019). https://doi.org/10.1007/s10772-019-09590-0

Download citation

Received: 09 August 2018
Accepted: 09 January 2019
Published: 05 June 2019
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10772-019-09590-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech and language processing for assessing child–adult interaction based on diarization and location

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Paralinguistic Analysis of Children’s Speech in Natural Environments

A thorough evaluation of the Language Environment Analysis (LENA) system

Correlation and agreement between Language ENvironment Analysis (lena™) and manual transcription for Dutch natural language recordings

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now