Abstract
Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings.
In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9 important keywords, our approach gave an average sensitivity of 38% compared to 24% for Google Speech-to-text, while both methods maintained high average specificity of 90% and 92%.
On average, sensitivity improved from 24% to 38% for our proposed approach. On the other hand, specificity remained high for both methods (90% to 92%).
This material is based upon work supported by the National Science Foundation under Grant No.1613637, No.1842220, and No.1949230.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Google cloud speech-to-text API. https://cloud.google.com/speech-to-text
Brannan, D.A., Esplen, M.F., Gray, J.J.: Geometry, 2nd edn. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9781139003001
Celedón-Pattichis, S., LópezLeiva, C.A., Pattichis, M.S., Llamocca, D.: An interdisciplinary collaboration between computer engineering and mathematics/bilingual education to develop a curriculum for underrepresented middle school students. Cultural Stud. Sci. Educ. 8(4), 873–887 (2013). https://doi.org/10.1007/s11422-013-9516-5
Ephrat, A., et al.: Looking to listen at the cocktail party. ACM Trans. Graph. (2018)
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, 2nd edn (2004). ISBN: 0521540518
Jacoby, A.R., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Context-sensitive human activity classification in collaborative learning environments. In: 2018 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 1–4, April 2018. https://doi.org/10.1109/SSIAI.2018.8470331
Jatla, V., LópezLeiva, C.: Long-term human video activity quantification of student participation. Asilomar Conference on Signals, Systems, and Computers, Invited (2021)
Jocher, G., et al.: ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervisely and YouTube integrations, April 2021. https://doi.org/10.5281/zenodo.4679653
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. CoRR (2018)
Scheibler, R., Bezzam, E., Dokmanic, I.: Pyroomacoustics: a python package for audio room simulations and array processing algorithms. CoRR abs/1710.04196 (2017). http://arxiv.org/abs/1710.04196
Shao, S., et al.: Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)
Shi, W., P.M.C.P.S., LópezLeiva, C.: Person detection in collaborative group learning environments using multiple representations. Asilomar Conference on Signals, Systems, and Computers, Accepted (2021)
Shi, W., LópezLeiva, C.: Talking detection in collaborative learning environments. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)
Shi, W., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Dynamic group interactions in collaborative learning videos. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 1528–1531, October 2018
Shi, W., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Robust head detection in collaborative learning environments using AM-FM representations. In: 2018 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 1–4, April 2018. https://doi.org/10.1109/SSIAI.2018.8470355
Shi, W.: Human Attention Detection Using AM-FM Representations. Master’s thesis, University of New Mexico (2016)
Teeparthi, S., LópezLeiva, C.: Fast hand detection in collaborative learning environments. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)
Tran, P., LópezLeiva, C.: Facial recognition in collaborative learning videos. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sanchez Tapia, L. et al. (2021). Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds) Computer Analysis of Images and Patterns. CAIP 2021. Lecture Notes in Computer Science(), vol 13052. Springer, Cham. https://doi.org/10.1007/978-3-030-89128-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-89128-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89127-5
Online ISBN: 978-3-030-89128-2
eBook Packages: Computer ScienceComputer Science (R0)