Skip to main content

Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12363))

Abstract

A movie’s key moments stand out of the screenplay to grab an audience’s attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised(The term “supervised” refers to the approach with access to the manual ground-truth annotations for training.) approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks.

L. Wang—This work was done when Lezi Wang worked as an intern at Netflix.

R. Puri—This work was done when Rohit Puri was with Netflix.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Trailer_(promotion).

  2. 2.

    https://pytorch.org/.

  3. 3.

    Our Co-Attention module is not applicable for the VHD task since there are no video pairs in VHD as the trailer-movie pairs in MTMD.

References

  1. Bamman, D., O’Connor, B., Smith, N.A.: Learning latent personas of film characters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 352–361 (2013)

    Google Scholar 

  2. Cai, S., Zuo, W., Davis, L.S., Zhang, L.: Weakly-supervised video summarization using variational encoder-decoder and web prior. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–200 (2018)

    Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)

    Google Scholar 

  4. Chu, E., Roy, D.: Audio-visual sentiment analysis for learning emotional arcs in movies. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 829–834. IEEE (2017)

    Google Scholar 

  5. Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3584–3592 (2015)

    Google Scholar 

  6. Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (NeurIPs), pp. 2069–2077 (2014)

    Google Scholar 

  7. Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098 (2015)

    Google Scholar 

  8. Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1001–1009 (2016)

    Google Scholar 

  9. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555 (2018)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  11. Huang, Q., Xiong, Y., Xiong, Y., Zhang, Y., Lin, D.: From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341 (2018)

  12. Kang, H.W., Matsushita, Y., Tang, X., Chen, X.Q.: Space-time video montage. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 1331–1338. IEEE (2006)

    Google Scholar 

  13. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)

    Google Scholar 

  14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  15. Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 2 (2013)

    Google Scholar 

  16. Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 202–211 (2017)

    Google Scholar 

  17. McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  18. Mendi, E., Clemente, H.B., Bayrak, C.: Sports video summarization based on motion analysis. Comput. Electr. Eng. 39(3), 790–796 (2013)

    Article  Google Scholar 

  19. Oosterhuis, H., Ravi, S., Bendersky, M.: Semantic video trailers. arXiv preprint arXiv:1609.01819 (2016)

  20. Panda, R., Das, A., Wu, Z., Ernst, J., Roy-Chowdhury, A.K.: Weakly supervised summarization of web videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666 (2017)

    Google Scholar 

  21. Panda, R., Roy-Chowdhury, A.K.: Collaborative summarization of topic-related videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7083–7092 (2017)

    Google Scholar 

  22. Park, S.B., Kim, Y.W., Uddin, M.N., Jo, G.S.: Character-net: character network analysis from video. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 01. pp. 305–308. IEEE Computer Society (2009)

    Google Scholar 

  23. Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35

    Chapter  Google Scholar 

  24. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5533–5541 (2017)

    Google Scholar 

  25. Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for TV baseball programs. In: Proceedings of the Eighth ACM International Conference on Multimedia, pp. 105–115. ACM (2000)

    Google Scholar 

  26. Shi, W., Gong, Y., Ding, C., MaXiaoyu Tao, Z., Zheng, N.: Transductive semi-supervised deep learning using min-max features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 299–315 (2018)

    Google Scholar 

  27. Simões, G.S., Wehrmann, J., Barros, R.C., Ruiz, D.D.: Movie genre classification with convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 259–266. IEEE (2016)

    Google Scholar 

  28. Smeaton, A.F., Over, P., Doherty, A.R.: Video shot boundary detection: seven years of TRECVid activity. Comput. Vis. Image Underst. (CVIU) 114(4), 411–418 (2010)

    Article  Google Scholar 

  29. Smith, J.R., Joshi, D., Huet, B., Hsu, W., Cota, J.: Harnessing AI for augmenting creativity: application to movie trailer creation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1799–1808 (2017)

    Google Scholar 

  30. Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pp. 5179–5187 (2015)

    Google Scholar 

  31. Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 787–802. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_51

    Chapter  Google Scholar 

  32. Tang, H., Kwatra, V., Sargin, M.E., Gargi, U.: Detecting highlights in sports videos: cricket as a test case. In: 2011 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2011)

    Google Scholar 

  33. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pp. 4631–4640 (2016)

    Google Scholar 

  34. Thomson, D.: Moments That Made the Movies. Thames & Hudson (2014). https://books.google.com/books?id=_vNFngEACAAJ

  35. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008. Curran Associates Inc. (2017)

    Google Scholar 

  36. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)

    Google Scholar 

  37. Weng, C.Y., Chu, W.T., Wu, J.L.: RoleNet: movie analysis from the perspective of social networks. IEEE Trans. Multimedia 11(2), 256–271 (2009)

    Article  Google Scholar 

  38. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. In: Proceedings of the European Conference on Computer Vision ECCV (2018)

    Google Scholar 

  39. Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1258–1267 (2019)

    Google Scholar 

  40. Xiong, Z., Radhakrishnan, R., Divakaran, A., Huang, T.S.: Highlights extraction from sports video based on an audio-visual marker detection framework. In: 2005 IEEE International Conference on Multimedia and Expo (ICME), p. 4. IEEE (2005)

    Google Scholar 

  41. Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4633–4641 (2015)

    Google Scholar 

  42. Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 982–990 (2016)

    Google Scholar 

  43. Zhou, H., Hermans, T., Karandikar, A.V., Rehg, J.M.: Movie genre classification via scene categorization. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 747–750. ACM (2010)

    Google Scholar 

  44. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lezi Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 63246 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, L., Liu, D., Puri, R., Metaxas, D.N. (2020). Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58523-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58522-8

  • Online ISBN: 978-3-030-58523-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics