Skip to main content

DCA: Diversified Co-attention Towards Informative Live Video Commenting

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12431))

Abstract

We focus on the task of Automatic Live Video Commenting (ALVC), which aims to generate real-time video comments with both video frames and other viewers’ comments as inputs. A major challenge in this task is how to properly leverage the rich and diverse information carried by video and text. In this paper, we aim to collect diversified information from video and text for informative comment generation. To achieve this, we propose a Diversified Co-Attention (DCA) model for this task. Our model builds bidirectional interactions between video frames and surrounding comments from multiple perspectives via metric learning, to collect a diversified and informative context for comment generation. We also propose an effective parameter orthogonalization technique to avoid excessive overlap of information learned from different perspectives. Results show that our approach outperforms existing methods in the ALVC task, achieving new state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We concatenate all surrounding comments into a single sequence \(\textit{\textbf{x}}\).

  2. 2.

    https://github.com/lancopku/livebot.

  3. 3.

    https://www.bilibili.com.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015)

    Google Scholar 

  2. Chen, Y., Gao, Q., Rau, P.L.P.: Watching a movie alone yet together: understanding reasons for watching Danmaku videos. Int. J. Hum. Comput. Interact. 33(9), 731–743 (2017)

    Article  Google Scholar 

  3. Cissé, M., Bojanowski, P., Grave, E., Dauphin, Y.N., Usunier, N.: Parseval networks: improving robustness to adversarial examples. In: ICML 2017 (2017)

    Google Scholar 

  4. Das, A., et al.: Visual dialog. In: CVPR 2017 (2017)

    Google Scholar 

  5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016 (2016)

    Google Scholar 

  6. Hsu, K., Lin, Y., Chuang, Y.: Co-attention CNNs for unsupervised object co-segmentation. In: IJCAI 2018 (2018)

    Google Scholar 

  7. Jiang, T., et al.: CTGA: graph-based biomedical literature search. In: BIBM 2019 (2019)

    Google Scholar 

  8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR 2015 (2015)

    Google Scholar 

  9. Kulis, B.: Metric learning: a survey. Found. Trends Mach. Learn. 5(4), 287–364 (2013)

    Article  MathSciNet  Google Scholar 

  10. Li, W., Xu, J., He, Y., Yan, S., Wu, Y., Sun, X.: Coherent comments generation for Chinese articles with a graph-to-sequence model. In: ACL 2019 (2019)

    Google Scholar 

  11. Li, X., et al.: Beyond RNNs: positional self-attention with co-attention for video question answering. In: AAAI 2019 (2019)

    Google Scholar 

  12. Li, X., Zhou, Z., Chen, L., Gao, L.: Residual attention-based LSTM for video captioning. World Wide Web 22(2), 621–636 (2018). https://doi.org/10.1007/s11280-018-0531-z

    Article  Google Scholar 

  13. Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR 2017 (2017)

    Google Scholar 

  14. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NeurIPS 2016 (2016)

    Google Scholar 

  15. Ma, S., Cui, L., Dai, D., Wei, F., Sun, X.: LiveBot: generating live video comments based on visual and textual contexts. In: AAAI 2019 (2019)

    Google Scholar 

  16. Ma, S., Cui, L., Wei, F., Sun, X.: Unsupervised machine commenting with neural variational topic model. ArXiv preprint arXiv:1809.04960 (2018)

  17. Nguyen, D., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR 2018 (2018)

    Google Scholar 

  18. Qin, L., et al.: Automatic article commenting: the task and dataset. In: ACL 2018 (2018)

    Google Scholar 

  19. Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. In: ICLR 2017 (2017)

    Google Scholar 

  20. Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR 2017 (2017)

    Google Scholar 

  21. Tay, Y., Luu, A.T., Hui, S.C.: Multi-pointer co-attention networks for recommendation. In: KDD 2018 (2018)

    Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. In: NeurIPS 2017 (2017)

    Google Scholar 

  23. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV 2015 (2015)

    Google Scholar 

  24. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR 2015 (2015)

    Google Scholar 

  25. Wu, W., et al.: Proactive human-machine conversation with explicit conversation goal. In: ACL 2019 (2019)

    Google Scholar 

  26. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.J.: Distance metric learning with application to clustering with side-information. In: NeurIPS 2002 (2002)

    Google Scholar 

  27. Xiong, Y., Dai, B., Lin, D.: Move forward and tell: a progressive generator of video descriptions. In: ECCV 2018 (2018)

    Google Scholar 

  28. Xu, H., Li, B., Ramanishka, V., Sigal, L., Saenko, K.: Joint event detection and description in continuous video streams. In: WACV 2019 (2019)

    Google Scholar 

  29. Yang, P., Zhang, Z., Luo, F., Li, L., Huang, C., Sun, X.: Cross-modal commentator: automatic machine commenting based on cross-modal information. In: ACL 2019 (2019)

    Google Scholar 

  30. Yu, A.W., et al.: QANet: combining local convolution with global self-attention for reading comprehension. In: ICLR 2018 (2018)

    Google Scholar 

  31. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR 2019 (2019)

    Google Scholar 

  32. Zeng, W., Abuduweili, A., Li, L., Yang, P.: Automatic generation of personalized comment based on user profile. In: ACL 2019 (2019)

    Google Scholar 

  33. Zhou, H., Zheng, C., Huang, K., Huang, M., Zhu, X.: KdConv: a Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation. In: ACL 2020 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhihan Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Z., Yin, Z., Ren, S., Li, X., Li, S. (2020). DCA: Diversified Co-attention Towards Informative Live Video Commenting. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2020. Lecture Notes in Computer Science(), vol 12431. Springer, Cham. https://doi.org/10.1007/978-3-030-60457-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60457-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60456-1

  • Online ISBN: 978-3-030-60457-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics