Skip to main content

Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10704))

Abstract

Automatic video captioning is one of the ultimate challenges of Natural Language Processing, boosted by the omnipresence of video and the release of large-scale annotated video benchmarks. However, the specificity and quality of the captions vary considerably, having an adverse effect on the quality of the trained captioning models. In this work, we address this issue by proposing automatic strategies for optimizing the annotations of video material, removing annotations that are not semantically relevant and generating new and more informative captions. We evaluate our approach on the MSR-VTT challenge with a state-of-the-art deep learning video-to-language model. Our code is available at https://github.com/lpmayos/mcv_thesis.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://visionlearninggroup.github.io/caption-guided-saliency.

  2. 2.

    Note that 1.5xIQR, the interquartile range, is a standard definition for suspected outliers.

References

  1. Awad, G., et al.: Trecvid 2016: evaluating video search, video event detection, localization, and hyperlinking. In: Proceedings of TRECVID, vol. 2016 (2016)

    Google Scholar 

  2. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)

    Google Scholar 

  3. Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. CL 31(3), 297–328 (2005)

    Google Scholar 

  4. Bengio, Y., et al.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)

    MATH  Google Scholar 

  5. Bing, L., et al.: Abstractive multi-document summarization via phrase selection and merging. arXiv preprint arXiv:1506.01597 (2015)

  6. Boudin, F., Morin, E.: Keyphrase extraction for n-best reranking in multi-sentence compression. In: NAACL (2013)

    Google Scholar 

  7. Cheung, J.C.K., Penn, G.: Unsupervised sentence enhancement for automatic summarization. In: EMNLP, pp. 775–786 (2014)

    Google Scholar 

  8. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th ICML, pp. 160–167. ACM (2008)

    Google Scholar 

  9. Elsner, M., Santhanam, D.: Learning to fuse disparate sentences. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 54–63. ACL (2011)

    Google Scholar 

  10. Filippova, K.: Multi-sentence compression: finding shortest paths in word graphs. In: Proceedings of the 23rd ICCL, pp. 322–330. ACL (2010)

    Google Scholar 

  11. Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: Proceedings of the CEMNLP, pp. 177–185. ACL (2008)

    Google Scholar 

  12. Han, L., et al.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: * SEM@ NAACL-HLT, pp. 44–52 (2013)

    Google Scholar 

  13. Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: learning sense embeddings for word and relational similarity. In: ACL, vol. 1, pp. 95–105 (2015)

    Google Scholar 

  14. Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  15. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  16. Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th ICML, pp. 641–648. ACM (2007)

    Google Scholar 

  17. Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 216–225. ACL (2010)

    Google Scholar 

  18. Ramanishka, V., et al.: Multimodal video description. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)

    Google Scholar 

  19. Ramanishka, V., et al.: Top-down visual saliency guided by captions. In: arXiv preprint arXiv:1612.07360 (2016)

  20. Thadani, K., McKeown, K.: Supervised sentence fusion with single-stage inference. In: IJCNLP, pp. 1410–1418 (2013)

    Google Scholar 

  21. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 384–394. ACL (2010)

    Google Scholar 

  22. Tzouridis, E., Nasir, J.A., Brefeld, U.: Learning to summarise related sentences. In: COLING, pp. 1636–1647 (2014)

    Google Scholar 

  23. Vadapalli, R. et al.: SSAS: semantic similarity for abstractive summarization. In: Proceedings of the IJCNLP (2017)

    Google Scholar 

  24. Xu, J., et al.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on CVPR, pp. 5288–5296 (2016)

    Google Scholar 

  25. Yu, M., Dredze, M.: Improving lexical embeddings with semantic knowledge. In: ACL, vol. 2, pp. 545–550 (2014)

    Google Scholar 

  26. Zou, W.Y., et al.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)

    Google Scholar 

Download references

Acknowledgment

This work is partly supported by the Spanish Ministry of Economy and Competitiveness under the Ramon y Cajal fellowships, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No 645012. The Titan X GPU used for this research was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Pérez-Mayos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pérez-Mayos, L., Sukno, F.M., Wanner, L. (2018). Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10704. Springer, Cham. https://doi.org/10.1007/978-3-319-73603-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73603-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73602-0

  • Online ISBN: 978-3-319-73603-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics