Skip to main content

Youtube Engagement Analytics via Deep Multimodal Fusion Model

  • Conference paper
  • First Online:
Image and Video Technology (PSIVT 2022)

Abstract

As popularity of video-sharing platforms, content creators have a high demand to produce content which attracts the large amount of viewers. There are many factors related to engagement: visual, sound, transcript, title etc. To take into account of these factors, we propose a deep multi-modal hybrid fusion for YouTube video engagement. Our architecture allows us to be easy to adapt state-of-the-art models for a particular task or variety of modalities, then fuse them to obtain more information aim to classify better. A proposed residual block as a simple neuron architecture search is used to get better features extracted. Our work is at the forefront of classifying YouTube video engagement and promises to broaden the research community’s reach. Through detailed experiments, we proved that the model is the state-of-the-art in problem YouTube video engagement analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wattenhofer, M., Wattenhofer, R., Zhu, Z. (eds.): The YouTube Social Network (2012)

    Google Scholar 

  2. EnTube: A Dataset for YouTube Video Engagement Analytics (2022)

    Google Scholar 

  3. Wiebe, E.N., et al.: Measuring engagement in video game-based environments: investigation of the user engagement scale. Comput. Hum. Behav. 32, 123–132 (2014)

    Article  Google Scholar 

  4. Fox, C.M., Brockmyer, J.H.: The development of the game engagement questionnaire: a measure of engagement in video game playing: response to reviews. Interact. Comput. 25(4), 290–293 (2013)

    Article  Google Scholar 

  5. Wu, S., Rizoiu, M.-A., Xie, L.: Beyond views: measuring and predicting engagement in online videos. In: Twelfth International AAAI Conference on Web and Social Media (2018)

    Google Scholar 

  6. Bulathwela, S., et al.: Predicting engagement in video lectures. arXiv preprint. arXiv:2006.00592 (2020)

  7. Chaturvedi, I., et al.: Predicting video engagement using heterogeneous DeepWalk. Neurocomputing 465, 228–237 (2021)

    Article  Google Scholar 

  8. Aguiar, E., Nagrecha, S., Chawla, N.V.: Predicting online video engagement using clickstreams. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. IEEE (2015)

    Google Scholar 

  9. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  10. Yu, Z., Shi, N.: A multi-modal deep learning model for video thumbnail selection. arXiv preprint. arXiv:2101.00073 (2020)

  11. Joshi, G., Walambe, R., Kotecha, K.: A review on explainability in multimodal deep neural nets. IEEE Access 9, 59800–59821 (2021)

    Article  Google Scholar 

  12. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: pre-trained language models for Vietnamese. arXiv preprint. arXiv:2003.00744 (2020)

  13. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp. 6105–6114 (2019)

    Google Scholar 

  14. Hershey, S.: CNN architectures for large-scale audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)

    Google Scholar 

  15. Feichtenhofer, C., et al.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  17. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgment

This research is supported by research funding from Faculty of Information Technology, University of Science, Vietnam National University - Ho Chi Minh City, and Gender & Diversity Project - APNIC Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huy Tien Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen-Thi, MV., Le, H., Le, T., Le, T., Nguyen, H.T. (2023). Youtube Engagement Analytics via Deep Multimodal Fusion Model. In: Wang, H., et al. Image and Video Technology. PSIVT 2022. Lecture Notes in Computer Science, vol 13763. Springer, Cham. https://doi.org/10.1007/978-3-031-26431-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26431-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26430-6

  • Online ISBN: 978-3-031-26431-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics