Skip to main content

Gated Recurrent Capsules for Visual Word Embeddings

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11296))

Included in the following conference series:

Abstract

The caption retrieval task can be defined as follows: given a set of images I and a set of describing sentences S, for each image i in I we ought to find the sentence in S that best describes i. The most commonly applied method to solve this problem is to build a multimodal space and to map each image and each sentence to that space, so that they can be compared easily. A non-conventional model called Word2VisualVec has been proposed recently: instead of mapping images and sentences to a multimodal space, they mapped sentences directly to a space of visual features. Advances in the computation of visual features let us infer that such an approach is promising. In this paper, we propose a new Recurrent Neural Network model following that unconventional approach based on Gated Recurrent Capsules (GRCs), designed as an extension of Gated Recurrent Units (GRUs). We show that GRCs outperform GRUs on the caption retrieval task. We also state that GRCs present a great potential for other applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, vol. 16, pp. 265–283, November 2016

    Google Scholar 

  2. Cho, K., van Merrinboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-Decoder approaches. Syntax Semant. Struct. Stat. Transl. 103 (2014)

    Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE, June 2009

    Google Scholar 

  4. Dong, J., Li, X., Snoek, C.G.: Word2VisualVec: image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016)

  5. Dong, J., Huang, S., Xu, D., Tao, D.: DL-61-86 at TRECVID 2017: Video-to-Text Description (2017)

    Google Scholar 

  6. Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimedia (2018)

    Google Scholar 

  7. Faghri, F., Fleet, D.J., Kiros, R., Fidler, S.: VSE++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 (2017)

  8. Francis, D., Huet, B., Merialdo, B.: Embedding images and sentences in a common space with a recurrent capsule network. In Proceedings of the 16th International Workshop on Content-Based Multimedia Indexing. IEEE, September 2018

    Google Scholar 

  9. Gu, J., et al.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing (2018)

    Google Scholar 

  12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  13. Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 1889–1897. MIT Press, December 2014

    Google Scholar 

  14. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

    Google Scholar 

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  16. Lee, K. H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. arXiv preprint arXiv:1803.08024 (2018)

  17. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  19. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  20. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3856–3866 (2017)

    Google Scholar 

  21. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  22. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning (2012)

    Google Scholar 

  23. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  24. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328 (2014)

    Google Scholar 

Download references

Acknowledgments

One of the Titan Xp used for this research was donated by the NVIDIA Corporation. This work was partially funded by ANR (the French National Research Agency) via the GAFES project and the European H2020 research and innovation programme via the project MeMAD (GA780069).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benoit Huet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Francis, D., Huet, B., Merialdo, B. (2019). Gated Recurrent Capsules for Visual Word Embeddings. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05716-9_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05715-2

  • Online ISBN: 978-3-030-05716-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics