ABSTRACT
In the collaboration scenario, video calls can not only improve the understanding of the collaborative conversation content but also assist group members in coordinating tasks reasonably and obtaining richer collaboration information. Although they can share or view the recorded call video repeatedly, with the increase in video duration or the influence of surrounding noise, the audio-visual information contained in the collaborative conversation becomes more complex and difficult to understand. So, it's a great challenge for group members to obtain information about the collaboration topic. This paper starts with the video calls in collaborative scenarios and analyzes its interpretable elements from the obtained audio-visual dialogue information. Based on the TM-CTC model (CTC Transformer) and FaceNet algorithm, we construct a framework for deep audio-visual dialogue analysis in the Collaborative Working Environment (CWE). Subsequently, according to the framework mentioned above, we build a collaborative audio-visual dialogue server that is further applied to CWEs to analyze the video calls, generate audio-visual dialogue elements, and present the results to team members. Finally, an application example is utilized to analyze the audio-visual dialogue produced in collaborative video communications. The results show that our proposed framework can improve communication efficiency in team collaboration to a certain extent.
- I. C. Yadav, S. Shahnawazuddin, D. Govind and G. Pradhan (2018). Spectral smoothing by variationalmode decomposition and its effect on noise and pitch robustness of ASR system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5629-5633.Google ScholarDigital Library
- G. Potamianos, C. Neti, G. Gravier, (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.Google ScholarCross Ref
- Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5), 314-331.Google ScholarCross Ref
- M. Breeuwer and R. Plomp (1985). Speechreading supplemented with formant‐frequency information from voiced speech. The Journal of the Acoustical Society of America, 77(1), 314-317.Google ScholarCross Ref
- B. Dodd and R. Campbell (1987). Hearing by eye: The psychology of lip-reading. American Journal of Psychology, 72(6), 598-602.Google Scholar
- E. D Petajan (1985). Automatic lipreading to enhance speech recognition. Proc. IEEE-CS Conference on Computer Vision and Pattern Recognition. 40-47.Google Scholar
- K. Saenko, T. Darrell and J.R Glass (2004). Articulatory features for robust visual speech recognition. Proceedings of the 6th international conference on Multimodal interfaces. 152-158.Google ScholarDigital Library
- G. Galatas, G. Potamianos and F. Makedon (2012). Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In 2012 Proceedings of the 20th European Signal Processing Conference. 2714-2717.Google Scholar
- A. Biswas, P.K Sahu, A. Bhowmick and M. Chandra (2013). Audio visual isolated Oriya digit recognition using HMM and DWT. In Conference on Advances in Communication and Control Systems. 234-238.Google Scholar
- J. Huang, E. Marcheret and K. Visweswariah (2005). Rapid feature space speaker adaptation for multi-stream HMM-based audio-visual speech recognition. In 2005 IEEE International Conference on Multimedia and Expo. 338-341.Google ScholarCross Ref
- O. Braga, T. Makino, O. Siohan and H. Liao (2020). End-to-end multi-person audio/visual automatic speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 6994-6998.Google ScholarCross Ref
- Y.H Lee, D.W Jang, J.B Kim, R.H Park and H.M Park (2020). Audio–visual speech recognition based on dual cross-modality attentions with the transformer model. Applied Sciences, 10(20), 7263.Google ScholarCross Ref
- Z.Q Zhang, J Zhang, J.S Zhang, M.H Wu, X Fang and L.R Dai (2022). Learning contextually fused audio-visual representations for audio-visual speech recognition. In 2022 IEEE International Conference on Image Processing. 1346-1350.Google ScholarCross Ref
- D.W Massaro and M.M Cohen (1983). Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: human perception and performance, 9(5), 753.Google ScholarCross Ref
- B. Li, Z. Wu and Y. Wang (2022). Cross-modal mask fusion and modality-balanced audio-visual speech recognition. In 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems. 371-375.Google ScholarCross Ref
- Q. Li, M. H. Abel and J. P Barthès (2012). Sharing working experience: Using a model of Collaborative Traces. In Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD). 221-227.Google ScholarCross Ref
- S. Parthasarathy and S. Sundaram (2021). Detecting expressions with multimodal transformers. In 2021 IEEE Spoken Language Technology Workshop. 636-643.Google ScholarCross Ref
- R. G Boothe (2001). Perception of the visual environment. Springer Science & Business Media.Google Scholar
- M. K Kapadia, G Westheimer and C.D Gilbert (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. Journal of neurophysiology, 84(4), 2048-2062.Google ScholarCross Ref
- J. C Middlebrooks, J. Z Simon, A. N Popper and R. R Fay (2017). The auditory system at the cocktail party (Vol. 60). New York: Springer.Google ScholarCross Ref
- H. Ramoser, J. Muller-Gerking and G. Pfurtscheller (2000). Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE transactions on rehabilitation engineering, 8(4), 441-446.Google ScholarCross Ref
- J Navarra and S Soto-Faraco (2007). Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological research, 71, 4-12.Google Scholar
- L Busse, K.C Roberts, R.E Crist, D.H Weissman and M.G Woldorff (2005). The spread of attention across modalities and space in a multisensory object. Proceedings of the National Academy of Sciences, 102(51), 18751-18756.Google ScholarCross Ref
- W. P. Boyce, A. Lindsay, A. Zgonnikov, I. Rañó and K. Wong-Lin (2020). Optimality and limitations of audio-visual integration for cognitive systems. Frontiers in Robotics and AI, 7, 94.Google ScholarCross Ref
- Y He, K. P Seng and L. M Ang (2023). Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild. Sensors, 23(4), 1834.Google ScholarCross Ref
- B Qin, J Zhang, K Ma, P Peng and H Wang (2022). Knowledge Mining Based Collaborative Framework for Manufacturing Value Chains. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design. 847-854.Google Scholar
- V Subhash (2023). Starting with FFmpeg. In Quick Start Guide to FFmpeg: Learn to Use the Open Source Multimedia-Processing Tool like a Pro. 11-15.Google ScholarCross Ref
- A. Bhat, R. K. Jha and V Kedia (2022). Robust Face Detection and Recognition using Image Processing and OpenCV. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC). 1273-1278.Google ScholarCross Ref
- G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700-4708.Google Scholar
- T Afouras, J.S Chung, A Senior, O Vinyals and A Zisserman (2018). Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12), 8717-8727.Google Scholar
- A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A.N Gomez and I Polosukhin (2017). Attention is all you need. Advances in neural information processing systems. 30.Google Scholar
- D. Li, H. Mei, Y. Shen, S. Su, W. Zhang, J. Wang and W. Chen (2018). ECharts: a declarative framework for rapid construction of web-based visualization. Visual Informatics, 2(2), 136-146.Google ScholarCross Ref
- A Graves, S Fernández, F Gomez and J Schmidhuber (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369-376.Google ScholarDigital Library
- B. Shillingford, Y. Assael, M.W Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, M. Denil, B. Coppin, B. Laurie, A. Senior and N.d Freitas (2019) Large-Scale Visual Speech Recognition. Proc. Interspeech, pp 4135-4139, doi: 10.21437/Interspeech.2019-1669.Google ScholarCross Ref
- S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarCross Ref
- D. P. Kingma and J Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
- S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarCross Ref
- K Zhang, Z Zhang, Z Li and Y Qiao (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503.Google Scholar
- S. Sengupta, J. C. Cheng, C. D. Castillo, V. M. Patel, R. Chellappa and D. W. Jacobs (2016). Frontal to profile face verification in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1-9.Google ScholarCross Ref
- G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller (2008). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. University of Massachusetts, Amherst, Tech. 07-49.Google Scholar
- S. L. Aung, N. Funabiki, S. H. Shwe, S. T. Aung and W. C Kao (2022). An implementation of code writing problem platform for Python programming learning using Node.js. In 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE). 854-855.Google ScholarCross Ref
Index Terms
- Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis
Recommendations
Steps towards collaborative multimodal dialogue (sustained contribution award)
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionThis talk will discuss progress in building collaborative multimodal systems, both systems that offer a collaborative interface that augments human performance, and autonomous systems with which one can collaborate. To begin, I discuss what we will ...
Designing a visualization tool for children to reflect on their collaborative dialogue
AbstractCollaborative learning is an essential part of children’s development, positively impacting academic achievement and fostering higher levels of reasoning. However, young learners often face challenges with taking turns in conversation, ...
Confusion, Conflict, Consensus: Modeling Dialogue Processes During Collaborative Learning with Hidden Markov Models
Artificial Intelligence in EducationAbstractThere is growing recognition that AI technologies can, and should, support collaborative learning. To provide this support, we need models of collaborative talk that reflect the ways in which learners interact. Great progress has been made in ...
Comments