skip to main content
10.1145/3607540.3617141acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

Published:29 October 2023Publication History

ABSTRACT

Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.

References

  1. [n. d.]. https://www.youtube.com/Google ScholarGoogle Scholar
  2. [n. d.]. https://www.vimeo.com/Google ScholarGoogle Scholar
  3. [n. d.]. https://www.tiktok.com/Google ScholarGoogle Scholar
  4. Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12487--12496Google ScholarGoogle ScholarCross RefCross Ref
  5. Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2019. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR) 52, 6 (2019), 1--37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  7. David L. Chen and William B. Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011). Portland, OR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ming Chen, Yingming Li, Zhongfei Zhang, and Siyu Huang. 2018. Tvt: Twoview transformer network for video captioning. In Asian Conference on Machine Learning. PMLR, 847--862.Google ScholarGoogle Scholar
  9. Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16. Springer, 333--351.Google ScholarGoogle Scholar
  10. Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. 2019. Deep Learning for Video Captioning: A Review.. In IJCAI, Vol. 1. 2.Google ScholarGoogle Scholar
  11. Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV). 358--373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye, and Yongjian Wu. 2022. Global2Local: A Joint-Hierarchical Attention for Video Captioning. arXiv:2203.06663 [cs.CV]Google ScholarGoogle Scholar
  13. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.Google ScholarGoogle Scholar
  18. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 131--135. https: //doi.org/10.1109/ICASSP.2017.7952132Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Saiful Islam, Aurpan Dash, Ashek Seum, Amir Hossain Raj, Tonmoy Hossain, and Faisal Muhammad Shah. 2021. Exploring video captioning techniques: A comprehensive survey on deep learning methods. SN Computer Science 2, 2 (2021), 1--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 24th ACM international conference on Multimedia. 1087--1091.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17949--17958.Google ScholarGoogle ScholarCross RefCross Ref
  23. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  24. Beth Logan et al. 2000. Mel frequency cepstral coefficients for music modeling.. In Ismir, Vol. 270. Plymouth, MA, 11.Google ScholarGoogle Scholar
  25. Kevin J Ma, Radim Barto?, and Swapnil Bhatia. 2011. A survey of schemes for Internet-based video delivery. Journal of Network and Computer Applications 34, 5 (2011), 1572--1586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  27. Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10867--10876. https://doi.org/10.1109/CVPR42600. 2020.01088Google ScholarGoogle ScholarCross RefCross Ref
  28. Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the 24th ACM international conference on Multimedia. 1092--1096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137--1149. https://doi.org/10.1109/TPAMI.2016.2577031Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. 2022. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17959--17968.Google ScholarGoogle ScholarCross RefCross Ref
  31. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  32. Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. 2020. A short note on the kinetics-700--2020 human action dataset. arXiv preprint arXiv:2010.10864 (2020).Google ScholarGoogle Scholar
  33. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.Google ScholarGoogle ScholarCross RefCross Ref
  34. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  36. Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7512--7520.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xin Wang, Yuan-Fang Wang, and William Yang Wang. 2018. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. arXiv preprint arXiv:1804.05448 (2018).Google ScholarGoogle Scholar
  38. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia. 537--545.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. BERT Representations for Video Question Answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 1545--1554. https://doi.org/10.1109/WACV45572.2020.9093596Google ScholarGoogle ScholarCross RefCross Ref
  41. Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. 2018. S3d: single shot multi-span detector via fully 3d convolutional networks. arXiv preprint arXiv:1807.08069 (2018).Google ScholarGoogle Scholar
  42. Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13096--13105.Google ScholarGoogle ScholarCross RefCross Ref
  43. Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590--7598. https://www.aaai.org/ocs/index.php/AAAI/ AAAI18/paper/view/1734Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos
        October 2023
        82 pages
        ISBN:9798400702778
        DOI:10.1145/3607540

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)54
        • Downloads (Last 6 weeks)7

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader