Skip to main content

Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

  • Conference paper
  • First Online:
Computer Analysis of Images and Patterns (CAIP 2021)

Abstract

Speech recognition is very challenging in student learning environments that are characterized by significant cross-talk and background noise. To address this problem, we present a bilingual speech recognition system that uses an interactive video analysis system to estimate the 3D speaker geometry for realistic audio simulations. We demonstrate the use of our system in generating a complex audio dataset that contains significant cross-talk and background noise that approximate real-life classroom recordings. We then test our proposed system with real-life recordings.

In terms of the distance of the speakers from the microphone, our interactive video analysis system obtained a better average error rate of 10.83% compared to 33.12% for a baseline approach. Our proposed system gave an accuracy of 27.92% that is 1.5% better than Google Speech-to-text on the same dataset. In terms of 9 important keywords, our approach gave an average sensitivity of 38% compared to 24% for Google Speech-to-text, while both methods maintained high average specificity of 90% and 92%.

On average, sensitivity improved from 24% to 38% for our proposed approach. On the other hand, specificity remained high for both methods (90% to 92%).

This material is based upon work supported by the National Science Foundation under Grant No.1613637, No.1842220, and No.1949230.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Google cloud speech-to-text API. https://cloud.google.com/speech-to-text

  2. Brannan, D.A., Esplen, M.F., Gray, J.J.: Geometry, 2nd edn. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9781139003001

    Book  MATH  Google Scholar 

  3. Celedón-Pattichis, S., LópezLeiva, C.A., Pattichis, M.S., Llamocca, D.: An interdisciplinary collaboration between computer engineering and mathematics/bilingual education to develop a curriculum for underrepresented middle school students. Cultural Stud. Sci. Educ. 8(4), 873–887 (2013). https://doi.org/10.1007/s11422-013-9516-5

    Article  Google Scholar 

  4. Ephrat, A., et al.: Looking to listen at the cocktail party. ACM Trans. Graph. (2018)

    Google Scholar 

  5. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, 2nd edn (2004). ISBN: 0521540518

    Google Scholar 

  6. Jacoby, A.R., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Context-sensitive human activity classification in collaborative learning environments. In: 2018 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 1–4, April 2018. https://doi.org/10.1109/SSIAI.2018.8470331

  7. Jatla, V., LópezLeiva, C.: Long-term human video activity quantification of student participation. Asilomar Conference on Signals, Systems, and Computers, Invited (2021)

    Google Scholar 

  8. Jocher, G., et al.: ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervisely and YouTube integrations, April 2021. https://doi.org/10.5281/zenodo.4679653

  9. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. CoRR (2018)

    Google Scholar 

  10. Scheibler, R., Bezzam, E., Dokmanic, I.: Pyroomacoustics: a python package for audio room simulations and array processing algorithms. CoRR abs/1710.04196 (2017). http://arxiv.org/abs/1710.04196

  11. Shao, S., et al.: Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)

  12. Shi, W., P.M.C.P.S., LópezLeiva, C.: Person detection in collaborative group learning environments using multiple representations. Asilomar Conference on Signals, Systems, and Computers, Accepted (2021)

    Google Scholar 

  13. Shi, W., LópezLeiva, C.: Talking detection in collaborative learning environments. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)

    Google Scholar 

  14. Shi, W., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Dynamic group interactions in collaborative learning videos. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 1528–1531, October 2018

    Google Scholar 

  15. Shi, W., Pattichis, M.S., Celedón-Pattichis, S., LópezLeiva, C.: Robust head detection in collaborative learning environments using AM-FM representations. In: 2018 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 1–4, April 2018. https://doi.org/10.1109/SSIAI.2018.8470355

  16. Shi, W.: Human Attention Detection Using AM-FM Representations. Master’s thesis, University of New Mexico (2016)

    Google Scholar 

  17. Teeparthi, S., LópezLeiva, C.: Fast hand detection in collaborative learning environments. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)

    Google Scholar 

  18. Tran, P., LópezLeiva, C.: Facial recognition in collaborative learning videos. In: The 19th International Conference on Computer Analysis of Images and Patterns (CAIP), accepted (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Sanchez Tapia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sanchez Tapia, L. et al. (2021). Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds) Computer Analysis of Images and Patterns. CAIP 2021. Lecture Notes in Computer Science(), vol 13052. Springer, Cham. https://doi.org/10.1007/978-3-030-89128-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89128-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89127-5

  • Online ISBN: 978-3-030-89128-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics