Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning

Situ, Runwei; Yang, Zhenguo; Lv, Jianming; Li, Qing; Liu, Wenyin

doi:10.1007/978-3-030-00767-6_14

Runwei Situ¹⁸,
Zhenguo Yang¹⁸,
Jianming Lv¹⁹,
Qing Li²⁰ &
…
Wenyin Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2663 Accesses
3 Citations

Abstract

In this paper, we propose to learn Deep Semantic Space (DSS) for cross-modal event retrieval, which is achieved by exploiting deep learning models to extract semantic features from images and textual articles jointly. More specifically, a VGG network is used to transfer deep semantic knowledge from a large-scale image dataset to the target image dataset. Simultaneously, a fully-connected network is designed to model semantic representation from textual features (e.g., TF-IDF, LDA). Furthermore, the obtained deep semantic representations for image and text can be mapped into a high-level semantic space, in which the distance between data samples can be measured straightforwardly for cross-model event retrieval. In particular, we collect a dataset called Wiki-Flickr event dataset for cross-modal event retrieval, where the data are weakly aligned unlike image-text pairs in the existing cross-modal retrieval datasets. Extensive experiments conducted on both the Pascal Sentence dataset and our Wiki-Flickr event dataset show that our DSS outperforms the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EventBind: Learning a Unified Representation to Bind Them All for Event-Based Open-World Understanding

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Article 12 January 2024

References

Yang, Z., Li, Q., Lu, Z., Ma, Y., Gong, Z., Liu, W.: Dual structure constrained multimodal feature coding for social event detection from Flickr data. ACM Trans. Internet Technol. 17(2), 19 (2017)
Article Google Scholar
Yang, Z., Li, Q., Liu, W., Ma, Y., Cheng, M.: Dual graph regularized NMF model for social event detection from Flickr data. World Wide Web 20(5), 995–1015 (2017)
Article Google Scholar
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: 18th ACM International Conference on Multimedia, pp. 251–260. ACM (2010)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Association for Computational Linguistics (2010)
Google Scholar
Hwang, S.J., Grauman, K.: Reading between the lines: object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012)
Article Google Scholar
Thompson, B: Canonical correlation analysis. In: Encyclopedia of Statistics in Behavioral Science (2000)
Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I. K.: Multimedia content processing through cross-modal association. In: 11th ACM International Conference on Multimedia, pp. 604–611. ACM (2003)
Google Scholar
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014). arXiv preprint arXiv:1409.1556
Bronstein, M. M., Bronstein, A. M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Computer Vision and Pattern Recognition, pp. 3594–3601 (2010)
Google Scholar
Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 785–796. ACM (2013)
Google Scholar
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Weinberger, K.: Learning to rank with (a lot of) word features. Inf. Retr 13(3), 291–314 (2010)
Article Google Scholar
Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1371–1384 (2008)
Article Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. Adv. Neural Inf. Process. Syst. 5, 2222–2230 (2012)
MATH Google Scholar
Wang, C., Yang, H., Meinel, C.: Deep semantic mapping for cross-modal retrieval. In: Tools with Artificial Intelligence, pp. 234–241. IEEE (2015)
Google Scholar
Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2017)
Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014)
Article Google Scholar
Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia 17(3), 370–381 (2015)
Article Google Scholar
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop, vol. 79 (2012)
Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: 22nd ACM International Conference on Multimedia, pp. 7–16. ACM (2014)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: 28th International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Krizhevsky, A.: One Weird Trick for Parallelizing Convolutional Neural Networks (2014). arXiv preprint arXiv:1404.5997
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size (2016). arXiv preprint arXiv:1602.07360

Download references

Acknowledgments

The authors would like to thank Zehang Lin and Feitao Huang for data collection. This work is supported by the National Natural Science Foundation of China (No. 61703109, No. 91748107, No. U1611461), the Guangdong Innovative Research Team Program (No. 2014ZT05G157), Science and Technology Program of Guangdong Province, China (No. 2016A010101012), and CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China (No. CASNDST201703), and an internal grant from City University of Hong Kong (Project No. 9610367).

Author information

Authors and Affiliations

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
Runwei Situ, Zhenguo Yang & Wenyin Liu
School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
Jianming Lv
Department of Computer Science, City University of Hong Kong, Hong Kong, China
Qing Li

Authors

Runwei Situ
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Lv
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenyin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhenguo Yang or Wenyin Liu .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Situ, R., Yang, Z., Lv, J., Li, Q., Liu, W. (2018). Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-00767-6_14
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EventBind: Learning a Unified Representation to Bind Them All for Event-Based Open-World Understanding

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Cross-Modal Event Retrieval: A Dataset and a Baseline Using Deep Semantic Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EventBind: Learning a Unified Representation to Bind Them All for Event-Based Open-World Understanding

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval

Semantic enhancement and multi-level alignment network for cross-modal retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation