skip to main content
10.1145/3627915.3628019acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaeConference Proceedingsconference-collections
research-article

Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis

Published:21 December 2023Publication History

ABSTRACT

In the collaboration scenario, video calls can not only improve the understanding of the collaborative conversation content but also assist group members in coordinating tasks reasonably and obtaining richer collaboration information. Although they can share or view the recorded call video repeatedly, with the increase in video duration or the influence of surrounding noise, the audio-visual information contained in the collaborative conversation becomes more complex and difficult to understand. So, it's a great challenge for group members to obtain information about the collaboration topic. This paper starts with the video calls in collaborative scenarios and analyzes its interpretable elements from the obtained audio-visual dialogue information. Based on the TM-CTC model (CTC Transformer) and FaceNet algorithm, we construct a framework for deep audio-visual dialogue analysis in the Collaborative Working Environment (CWE). Subsequently, according to the framework mentioned above, we build a collaborative audio-visual dialogue server that is further applied to CWEs to analyze the video calls, generate audio-visual dialogue elements, and present the results to team members. Finally, an application example is utilized to analyze the audio-visual dialogue produced in collaborative video communications. The results show that our proposed framework can improve communication efficiency in team collaboration to a certain extent.

References

  1. I. C. Yadav, S. Shahnawazuddin, D. Govind and G. Pradhan (2018). Spectral smoothing by variationalmode decomposition and its effect on noise and pitch robustness of ASR system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5629-5633.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Potamianos, C. Neti, G. Gravier, (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.Google ScholarGoogle ScholarCross RefCross Ref
  3. Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5), 314-331.Google ScholarGoogle ScholarCross RefCross Ref
  4. M. Breeuwer and R. Plomp (1985). Speechreading supplemented with formant‐frequency information from voiced speech. The Journal of the Acoustical Society of America, 77(1), 314-317.Google ScholarGoogle ScholarCross RefCross Ref
  5. B. Dodd and R. Campbell (1987). Hearing by eye: The psychology of lip-reading. American Journal of Psychology, 72(6), 598-602.Google ScholarGoogle Scholar
  6. E. D Petajan (1985). Automatic lipreading to enhance speech recognition. Proc. IEEE-CS Conference on Computer Vision and Pattern Recognition. 40-47.Google ScholarGoogle Scholar
  7. K. Saenko, T. Darrell and J.R Glass (2004). Articulatory features for robust visual speech recognition. Proceedings of the 6th international conference on Multimodal interfaces. 152-158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Galatas, G. Potamianos and F. Makedon (2012). Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In 2012 Proceedings of the 20th European Signal Processing Conference. 2714-2717.Google ScholarGoogle Scholar
  9. A. Biswas, P.K Sahu, A. Bhowmick and M. Chandra (2013). Audio visual isolated Oriya digit recognition using HMM and DWT. In Conference on Advances in Communication and Control Systems. 234-238.Google ScholarGoogle Scholar
  10. J. Huang, E. Marcheret and K. Visweswariah (2005). Rapid feature space speaker adaptation for multi-stream HMM-based audio-visual speech recognition. In 2005 IEEE International Conference on Multimedia and Expo. 338-341.Google ScholarGoogle ScholarCross RefCross Ref
  11. O. Braga, T. Makino, O. Siohan and H. Liao (2020). End-to-end multi-person audio/visual automatic speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 6994-6998.Google ScholarGoogle ScholarCross RefCross Ref
  12. Y.H Lee, D.W Jang, J.B Kim, R.H Park and H.M Park (2020). Audio–visual speech recognition based on dual cross-modality attentions with the transformer model. Applied Sciences, 10(20), 7263.Google ScholarGoogle ScholarCross RefCross Ref
  13. Z.Q Zhang, J Zhang, J.S Zhang, M.H Wu, X Fang and L.R Dai (2022). Learning contextually fused audio-visual representations for audio-visual speech recognition. In 2022 IEEE International Conference on Image Processing. 1346-1350.Google ScholarGoogle ScholarCross RefCross Ref
  14. D.W Massaro and M.M Cohen (1983). Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: human perception and performance, 9(5), 753.Google ScholarGoogle ScholarCross RefCross Ref
  15. B. Li, Z. Wu and Y. Wang (2022). Cross-modal mask fusion and modality-balanced audio-visual speech recognition. In 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems. 371-375.Google ScholarGoogle ScholarCross RefCross Ref
  16. Q. Li, M. H. Abel and J. P Barthès (2012). Sharing working experience: Using a model of Collaborative Traces. In Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD). 221-227.Google ScholarGoogle ScholarCross RefCross Ref
  17. S. Parthasarathy and S. Sundaram (2021). Detecting expressions with multimodal transformers. In 2021 IEEE Spoken Language Technology Workshop. 636-643.Google ScholarGoogle ScholarCross RefCross Ref
  18. R. G Boothe (2001). Perception of the visual environment. Springer Science & Business Media.Google ScholarGoogle Scholar
  19. M. K Kapadia, G Westheimer and C.D Gilbert (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. Journal of neurophysiology, 84(4), 2048-2062.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. C Middlebrooks, J. Z Simon, A. N Popper and R. R Fay (2017). The auditory system at the cocktail party (Vol. 60). New York: Springer.Google ScholarGoogle ScholarCross RefCross Ref
  21. H. Ramoser, J. Muller-Gerking and G. Pfurtscheller (2000). Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE transactions on rehabilitation engineering, 8(4), 441-446.Google ScholarGoogle ScholarCross RefCross Ref
  22. J Navarra and S Soto-Faraco (2007). Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological research, 71, 4-12.Google ScholarGoogle Scholar
  23. L Busse, K.C Roberts, R.E Crist, D.H Weissman and M.G Woldorff (2005). The spread of attention across modalities and space in a multisensory object. Proceedings of the National Academy of Sciences, 102(51), 18751-18756.Google ScholarGoogle ScholarCross RefCross Ref
  24. W. P. Boyce, A. Lindsay, A. Zgonnikov, I. Rañó and K. Wong-Lin (2020). Optimality and limitations of audio-visual integration for cognitive systems. Frontiers in Robotics and AI, 7, 94.Google ScholarGoogle ScholarCross RefCross Ref
  25. Y He, K. P Seng and L. M Ang (2023). Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild. Sensors, 23(4), 1834.Google ScholarGoogle ScholarCross RefCross Ref
  26. B Qin, J Zhang, K Ma, P Peng and H Wang (2022). Knowledge Mining Based Collaborative Framework for Manufacturing Value Chains. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design. 847-854.Google ScholarGoogle Scholar
  27. V Subhash (2023). Starting with FFmpeg. In Quick Start Guide to FFmpeg: Learn to Use the Open Source Multimedia-Processing Tool like a Pro. 11-15.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Bhat, R. K. Jha and V Kedia (2022). Robust Face Detection and Recognition using Image Processing and OpenCV. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC). 1273-1278.Google ScholarGoogle ScholarCross RefCross Ref
  29. G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700-4708.Google ScholarGoogle Scholar
  30. T Afouras, J.S Chung, A Senior, O Vinyals and A Zisserman (2018). Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12), 8717-8727.Google ScholarGoogle Scholar
  31. A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A.N Gomez and I Polosukhin (2017). Attention is all you need. Advances in neural information processing systems. 30.Google ScholarGoogle Scholar
  32. D. Li, H. Mei, Y. Shen, S. Su, W. Zhang, J. Wang and W. Chen (2018). ECharts: a declarative framework for rapid construction of web-based visualization. Visual Informatics, 2(2), 136-146.Google ScholarGoogle ScholarCross RefCross Ref
  33. A Graves, S Fernández, F Gomez and J Schmidhuber (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369-376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Shillingford, Y. Assael, M.W Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, M. Denil, B. Coppin, B. Laurie, A. Senior and N.d Freitas (2019) Large-Scale Visual Speech Recognition. Proc. Interspeech, pp 4135-4139, doi: 10.21437/Interspeech.2019-1669.Google ScholarGoogle ScholarCross RefCross Ref
  35. S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarGoogle ScholarCross RefCross Ref
  36. D. P. Kingma and J Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  37. S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarGoogle ScholarCross RefCross Ref
  38. K Zhang, Z Zhang, Z Li and Y Qiao (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503.Google ScholarGoogle Scholar
  39. S. Sengupta, J. C. Cheng, C. D. Castillo, V. M. Patel, R. Chellappa and D. W. Jacobs (2016). Frontal to profile face verification in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1-9.Google ScholarGoogle ScholarCross RefCross Ref
  40. G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller (2008). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. University of Massachusetts, Amherst, Tech. 07-49.Google ScholarGoogle Scholar
  41. S. L. Aung, N. Funabiki, S. H. Shwe, S. T. Aung and W. C Kao (2022). An implementation of code writing problem platform for Python programming learning using Node.js. In 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE). 854-855.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering
      October 2023
      358 pages
      ISBN:9798400700590
      DOI:10.1145/3627915

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 December 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate368of770submissions,48%
    • Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format