Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material

Pérez-Mayos, Laura; Sukno, Federico M.; Wanner, Leo

doi:10.1007/978-3-319-73603-7_23

Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material

Laura Pérez-Mayos²¹,
Federico M. Sukno²¹ &
Leo Wanner^21,22

Conference paper
First Online: 13 January 2018

3138 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10704))

Abstract

Automatic video captioning is one of the ultimate challenges of Natural Language Processing, boosted by the omnipresence of video and the release of large-scale annotated video benchmarks. However, the specificity and quality of the captions vary considerably, having an adverse effect on the quality of the trained captioning models. In this work, we address this issue by proposing automatic strategies for optimizing the annotations of video material, removing annotations that are not semantically relevant and generating new and more informative captions. We evaluate our approach on the MSR-VTT challenge with a state-of-the-art deep learning video-to-language model. Our code is available at https://github.com/lpmayos/mcv_thesis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://visionlearninggroup.github.io/caption-guided-saliency.
2.
Note that 1.5xIQR, the interquartile range, is a standard definition for suspected outliers.

References

Awad, G., et al.: Trecvid 2016: evaluating video search, video event detection, localization, and hyperlinking. In: Proceedings of TRECVID, vol. 2016 (2016)
Google Scholar
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)
Google Scholar
Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. CL 31(3), 297–328 (2005)
Google Scholar
Bengio, Y., et al.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)
MATH Google Scholar
Bing, L., et al.: Abstractive multi-document summarization via phrase selection and merging. arXiv preprint arXiv:1506.01597 (2015)
Boudin, F., Morin, E.: Keyphrase extraction for n-best reranking in multi-sentence compression. In: NAACL (2013)
Google Scholar
Cheung, J.C.K., Penn, G.: Unsupervised sentence enhancement for automatic summarization. In: EMNLP, pp. 775–786 (2014)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th ICML, pp. 160–167. ACM (2008)
Google Scholar
Elsner, M., Santhanam, D.: Learning to fuse disparate sentences. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 54–63. ACL (2011)
Google Scholar
Filippova, K.: Multi-sentence compression: finding shortest paths in word graphs. In: Proceedings of the 23rd ICCL, pp. 322–330. ACL (2010)
Google Scholar
Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: Proceedings of the CEMNLP, pp. 177–185. ACL (2008)
Google Scholar
Han, L., et al.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: * SEM@ NAACL-HLT, pp. 44–52 (2013)
Google Scholar
Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: learning sense embeddings for word and relational similarity. In: ACL, vol. 1, pp. 95–105 (2015)
Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th ICML, pp. 641–648. ACM (2007)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 216–225. ACL (2010)
Google Scholar
Ramanishka, V., et al.: Multimodal video description. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)
Google Scholar
Ramanishka, V., et al.: Top-down visual saliency guided by captions. In: arXiv preprint arXiv:1612.07360 (2016)
Thadani, K., McKeown, K.: Supervised sentence fusion with single-stage inference. In: IJCNLP, pp. 1410–1418 (2013)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 384–394. ACL (2010)
Google Scholar
Tzouridis, E., Nasir, J.A., Brefeld, U.: Learning to summarise related sentences. In: COLING, pp. 1636–1647 (2014)
Google Scholar
Vadapalli, R. et al.: SSAS: semantic similarity for abstractive summarization. In: Proceedings of the IJCNLP (2017)
Google Scholar
Xu, J., et al.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on CVPR, pp. 5288–5296 (2016)
Google Scholar
Yu, M., Dredze, M.: Improving lexical embeddings with semantic knowledge. In: ACL, vol. 2, pp. 545–550 (2014)
Google Scholar
Zou, W.Y., et al.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)
Google Scholar

Download references

Acknowledgment

This work is partly supported by the Spanish Ministry of Economy and Competitiveness under the Ramon y Cajal fellowships, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No 645012. The Titan X GPU used for this research was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Department of Information and Communication Technologies, Pompeu Fabra University, Barcelona, Spain
Laura Pérez-Mayos, Federico M. Sukno & Leo Wanner
Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Spain
Leo Wanner

Authors

Laura Pérez-Mayos
View author publications
You can also search for this author in PubMed Google Scholar
Federico M. Sukno
View author publications
You can also search for this author in PubMed Google Scholar
Leo Wanner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Pérez-Mayos .

Editor information

Editors and Affiliations

Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria
Klaus Schoeffmann
Chulalongkorn University, Bangkok, Thailand
Thanarat H. Chalidabhongse
City University of Hong Kong, Hong Kong, China
Chong Wah Ngo
Chulalongkorn University, Bangkok, Thailand
Supavadee Aramvith
Dublin City University, Dublin, Ireland
Noel E. O’Connor
Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)
Yo-Sung Ho
Tampere University of Technology, Tampere, Finland
Moncef Gabbouj
Rutgers University, Piscataway, New Jersey, USA
Ahmed Elgammal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pérez-Mayos, L., Sukno, F.M., Wanner, L. (2018). Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10704. Springer, Cham. https://doi.org/10.1007/978-3-319-73603-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-73603-7_23
Published: 13 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73602-0
Online ISBN: 978-3-319-73603-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics