research-article

Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis

Authors:
Qiang Li

School of Computer Science and Engineering, North Minzu University, China

School of Computer Science and Engineering, North Minzu University, China

0000-0003-4422-3524
View Profile

,
Zhibo Zhang

School of Computer Science and Engineering, North Minzu University, China

School of Computer Science and Engineering, North Minzu University, China

0009-0009-3672-0193
View Profile

,
Zijin Liu

School of Computer Science and Engineering, North Minzu University, China

School of Computer Science and Engineering, North Minzu University, China

0009-0007-0392-4582
View Profile

,
Qianyu Mai

School of Computer Science and Engineering, North Minzu University, China

School of Computer Science and Engineering, North Minzu University, China

0000-0002-8616-3075
View Profile

,
Wenxia Qiao

School of Computer Science and Engineering, North Minzu University, China

School of Computer Science and Engineering, North Minzu University, China

0000-0003-2017-4334
View Profile

,
Mingjuan Ma

School of Economics, North Minzu University, China

School of Economics, North Minzu University, China

0000-0003-3459-2078
View Profile

CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application EngineeringOctober 2023Article No.: 27Pages 1–9https://doi.org/10.1145/3627915.3628019

Published:21 December 2023Publication History

CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering

Pages 1–9

ABSTRACT

In the collaboration scenario, video calls can not only improve the understanding of the collaborative conversation content but also assist group members in coordinating tasks reasonably and obtaining richer collaboration information. Although they can share or view the recorded call video repeatedly, with the increase in video duration or the influence of surrounding noise, the audio-visual information contained in the collaborative conversation becomes more complex and difficult to understand. So, it's a great challenge for group members to obtain information about the collaboration topic. This paper starts with the video calls in collaborative scenarios and analyzes its interpretable elements from the obtained audio-visual dialogue information. Based on the TM-CTC model (CTC Transformer) and FaceNet algorithm, we construct a framework for deep audio-visual dialogue analysis in the Collaborative Working Environment (CWE). Subsequently, according to the framework mentioned above, we build a collaborative audio-visual dialogue server that is further applied to CWEs to analyze the video calls, generate audio-visual dialogue elements, and present the results to team members. Finally, an application example is utilized to analyze the audio-visual dialogue produced in collaborative video communications. The results show that our proposed framework can improve communication efficiency in team collaboration to a certain extent.

References

I. C. Yadav, S. Shahnawazuddin, D. Govind and G. Pradhan (2018). Spectral smoothing by variationalmode decomposition and its effect on noise and pitch robustness of ASR system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5629-5633.Google ScholarDigital Library
G. Potamianos, C. Neti, G. Gravier, (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.Google ScholarCross Ref
Q. Summerfield (1979). Use of visual information for phonetic perception. Phonetica, 36(4-5), 314-331.Google ScholarCross Ref
M. Breeuwer and R. Plomp (1985). Speechreading supplemented with formant‐frequency information from voiced speech. The Journal of the Acoustical Society of America, 77(1), 314-317.Google ScholarCross Ref
B. Dodd and R. Campbell (1987). Hearing by eye: The psychology of lip-reading. American Journal of Psychology, 72(6), 598-602.Google Scholar
E. D Petajan (1985). Automatic lipreading to enhance speech recognition. Proc. IEEE-CS Conference on Computer Vision and Pattern Recognition. 40-47.Google Scholar
K. Saenko, T. Darrell and J.R Glass (2004). Articulatory features for robust visual speech recognition. Proceedings of the 6th international conference on Multimodal interfaces. 152-158.Google ScholarDigital Library
G. Galatas, G. Potamianos and F. Makedon (2012). Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In 2012 Proceedings of the 20th European Signal Processing Conference. 2714-2717.Google Scholar
A. Biswas, P.K Sahu, A. Bhowmick and M. Chandra (2013). Audio visual isolated Oriya digit recognition using HMM and DWT. In Conference on Advances in Communication and Control Systems. 234-238.Google Scholar
J. Huang, E. Marcheret and K. Visweswariah (2005). Rapid feature space speaker adaptation for multi-stream HMM-based audio-visual speech recognition. In 2005 IEEE International Conference on Multimedia and Expo. 338-341.Google ScholarCross Ref
O. Braga, T. Makino, O. Siohan and H. Liao (2020). End-to-end multi-person audio/visual automatic speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 6994-6998.Google ScholarCross Ref
Y.H Lee, D.W Jang, J.B Kim, R.H Park and H.M Park (2020). Audio–visual speech recognition based on dual cross-modality attentions with the transformer model. Applied Sciences, 10(20), 7263.Google ScholarCross Ref
Z.Q Zhang, J Zhang, J.S Zhang, M.H Wu, X Fang and L.R Dai (2022). Learning contextually fused audio-visual representations for audio-visual speech recognition. In 2022 IEEE International Conference on Image Processing. 1346-1350.Google ScholarCross Ref
D.W Massaro and M.M Cohen (1983). Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: human perception and performance, 9(5), 753.Google ScholarCross Ref
B. Li, Z. Wu and Y. Wang (2022). Cross-modal mask fusion and modality-balanced audio-visual speech recognition. In 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems. 371-375.Google ScholarCross Ref
Q. Li, M. H. Abel and J. P Barthès (2012). Sharing working experience: Using a model of Collaborative Traces. In Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD). 221-227.Google ScholarCross Ref
S. Parthasarathy and S. Sundaram (2021). Detecting expressions with multimodal transformers. In 2021 IEEE Spoken Language Technology Workshop. 636-643.Google ScholarCross Ref
R. G Boothe (2001). Perception of the visual environment. Springer Science & Business Media.Google Scholar
M. K Kapadia, G Westheimer and C.D Gilbert (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. Journal of neurophysiology, 84(4), 2048-2062.Google ScholarCross Ref
J. C Middlebrooks, J. Z Simon, A. N Popper and R. R Fay (2017). The auditory system at the cocktail party (Vol. 60). New York: Springer.Google ScholarCross Ref
H. Ramoser, J. Muller-Gerking and G. Pfurtscheller (2000). Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE transactions on rehabilitation engineering, 8(4), 441-446.Google ScholarCross Ref
J Navarra and S Soto-Faraco (2007). Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological research, 71, 4-12.Google Scholar
L Busse, K.C Roberts, R.E Crist, D.H Weissman and M.G Woldorff (2005). The spread of attention across modalities and space in a multisensory object. Proceedings of the National Academy of Sciences, 102(51), 18751-18756.Google ScholarCross Ref
W. P. Boyce, A. Lindsay, A. Zgonnikov, I. Rañó and K. Wong-Lin (2020). Optimality and limitations of audio-visual integration for cognitive systems. Frontiers in Robotics and AI, 7, 94.Google ScholarCross Ref
Y He, K. P Seng and L. M Ang (2023). Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild. Sensors, 23(4), 1834.Google ScholarCross Ref
B Qin, J Zhang, K Ma, P Peng and H Wang (2022). Knowledge Mining Based Collaborative Framework for Manufacturing Value Chains. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design. 847-854.Google Scholar
V Subhash (2023). Starting with FFmpeg. In Quick Start Guide to FFmpeg: Learn to Use the Open Source Multimedia-Processing Tool like a Pro. 11-15.Google ScholarCross Ref
A. Bhat, R. K. Jha and V Kedia (2022). Robust Face Detection and Recognition using Image Processing and OpenCV. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC). 1273-1278.Google ScholarCross Ref
G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700-4708.Google Scholar
T Afouras, J.S Chung, A Senior, O Vinyals and A Zisserman (2018). Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12), 8717-8727.Google Scholar
A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A.N Gomez and I Polosukhin (2017). Attention is all you need. Advances in neural information processing systems. 30.Google Scholar
D. Li, H. Mei, Y. Shen, S. Su, W. Zhang, J. Wang and W. Chen (2018). ECharts: a declarative framework for rapid construction of web-based visualization. Visual Informatics, 2(2), 136-146.Google ScholarCross Ref
A Graves, S Fernández, F Gomez and J Schmidhuber (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369-376.Google ScholarDigital Library
B. Shillingford, Y. Assael, M.W Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, M. Denil, B. Coppin, B. Laurie, A. Senior and N.d Freitas (2019) Large-Scale Visual Speech Recognition. Proc. Interspeech, pp 4135-4139, doi: 10.21437/Interspeech.2019-1669.Google ScholarCross Ref
S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarCross Ref
D. P. Kingma and J Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
S Petridis, T Stafylakis, P Ma, G Tzimiropoulos and M Pantic (2018). Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop. 513-520.Google ScholarCross Ref
K Zhang, Z Zhang, Z Li and Y Qiao (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503.Google Scholar
S. Sengupta, J. C. Cheng, C. D. Castillo, V. M. Patel, R. Chellappa and D. W. Jacobs (2016). Frontal to profile face verification in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1-9.Google ScholarCross Ref
G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller (2008). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. University of Massachusetts, Amherst, Tech. 07-49.Google Scholar
S. L. Aung, N. Funabiki, S. H. Shwe, S. T. Aung and W. C Kao (2022). An implementation of code writing problem platform for Python programming learning using Node.js. In 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE). 854-855.Google ScholarCross Ref

Index Terms

Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools

Recommendations

Steps towards collaborative multimodal dialogue (sustained contribution award)
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

This talk will discuss progress in building collaborative multimodal systems, both systems that offer a collaborative interface that augments human performance, and autonomous systems with which one can collaborate. To begin, I discuss what we will ...
Read More
Designing a visualization tool for children to reflect on their collaborative dialogue
Abstract
Collaborative learning is an essential part of children’s development, positively impacting academic achievement and fostering higher levels of reasoning. However, young learners often face challenges with taking turns in conversation, ...
Read More
Confusion, Conflict, Consensus: Modeling Dialogue Processes During Collaborative Learning with Hidden Markov Models
Artificial Intelligence in Education
Abstract
There is growing recognition that AI technologies can, and should, support collaborative learning. To provide this support, we need models of collaborative talk that reflect the ways in which learners interact. Great progress has been made in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering
October 2023
358 pages
ISBN:9798400700590
DOI:10.1145/3627915
Editors:
Ali Emrouznejad,
Yong Yue,
Hongbo Jiang
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 December 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Audio-visual Speech Recognition
Collaborative Dialogue
Collaborative Working Environment
Deep Learning
Multi-face Recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate368of770submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 10
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Beyond Conversational Discourse: A Framework for Collaborative Dialogue Analysis

CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Steps towards collaborative multimodal dialogue (sustained contribution award)

Designing a visualization tool for children to reflect on their collaborative dialogue

Confusion, Conflict, Consensus: Modeling Dialogue Processes During Collaborative Learning with Hidden Markov Models