skip to main content
research-article

ACMNet: Adaptive Confidence Matching Network for Human Behavior Analysis via Cross-modal Retrieval

Published: 17 April 2020 Publication History

Abstract

Cross-modality human behavior analysis has attracted much attention from both academia and industry. In this article, we focus on the cross-modality image-text retrieval problem for human behavior analysis, which can learn a common latent space for cross-modality data and thus benefit the understanding of human behavior with data from different modalities. Existing state-of-the-art cross-modality image-text retrieval models tend to be fine-grained region-word matching approaches, where they begin with measuring similarities for each image region or text word followed by aggregating them to estimate the global image-text similarity. However, it is observed that such fine-grained approaches often encounter the similarity bias problem, because they only consider matched text words for an image region or matched image regions for a text word for similarity calculation, but they totally ignore unmatched words/regions, which might still be salient enough to affect the global image-text similarity. In this article, we propose an Adaptive Confidence Matching Network (ACMNet), which is also a fine-grained matching approach, to effectively deal with such a similarity bias. Apart from calculating the local similarity for each region(/word) with its matched words(/regions), ACMNet also introduces a confidence score for the local similarity by leveraging the global text(/image) information, which is expected to help measure the semantic relatedness of the region(/word) to the whole text(/image). Moreover, ACMNet also incorporates the confidence scores together with the local similarities in estimating the global image-text similarity. To verify the effectiveness of ACMNet, we conduct extensive experiments and make comparisons with state-of-the-art methods on two benchmark datasets, i.e., Flickr30k and MS COCO. Experimental results show that the proposed ACMNet can outperform the state-of-the-art methods by a clear margin, which well demonstrates the effectiveness of the proposed ACMNet in human behavior analysis and the reasonableness of tackling the mentioned similarity bias issue.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Arxiv Preprint Arxiv:1409.0473 (2014).
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135--146.
[5]
Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. 2018. Show, observe and tell: Attribute-driven attention model for image captioning. In Proceedings of the International Joint Conferences on Artificial Intelligence. 606--612.
[6]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.
[7]
Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems. MIT Press, 577--585.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Arxiv Preprint Arxiv:1810.04805 (2018).
[10]
Guiguang Ding, Wenshuo Chen, Sicheng Zhao, Jungong Han, and Qiaoyan Liu. 2018. Real-time scalable visual tracking via quadrangle kernelized correlation filters. IEEE Trans. Intell. Transport. Syst. 19, 1 (2018), 140--150.
[11]
Guiguang Ding, Yuchen Guo, Kai Chen, Chaoqun Chu, Jungong Han, and Qionghai Dai. 2019. DECODE: Deep confidence network for robust image classification. IEEE Transactions on Image Processing 28, 8 (2019), 3752--3765.
[12]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[13]
Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601--4611.
[14]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. Arxiv Preprint Arxiv:1707.05612 (2017).
[15]
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. MIT Press, 2121--2129.
[16]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587.
[17]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[19]
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2310--2318.
[20]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concept and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.
[21]
Guiguang Ding, Jianguang Lou, Yusen Zhang, Hui Chen, Zijia Lin, and Borje Karlsson. 2019. GRN: Gated relation network to enhance convolutional neural network for named entity recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’19).
[22]
Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with kinect. J. Mach. Learn. Res. 16, 8 (2015), 227--254. Retrieved from http://jmlr.org/papers/v16/jiang15a.html.
[23]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 427--431.
[24]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[25]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. MIT Press, 1571--1581.
[26]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Arxiv Preprint Arxiv:1411.2539 (2014).
[27]
Ranjay Krishna, Yuke Zhu, Oliver Groth, and Justin, et al. Johnson. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (2017), 32--73.
[28]
Xiangyuan Lan, Andy Jinhua Ma, Pong C. Yuen, and Rama Chellappa. 2015. Joint sparse representation and robust feature-level fusion for multi-cue visual tracking. IEEE Trans. Image Process. 24, 12 (2015), 5826--5841.
[29]
Xiangyuan Lan, Mang Ye, Rui Shao, Bineng Zhong, Pong C. Yuen, and Huiyu Zhou. 2019. Learning modality-consistency feature templates: A robust rgb-infrared tracking system. IEEE Trans. Industr. Electron. 66, 12 (2019), 9887--9897.
[30]
Xiangyuan Lan, Shengping Zhang, Pong C. Yuen, and Rama Chellappa. 2018. Learning common and feature-specific patterns: A novel multiple-sparse-representation-based tracker. IEEE Trans. Image Process. 27, 4 (2018), 2022--2037.
[31]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201--216.
[32]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research 8 Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 15--24.
[33]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13).
[34]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. MIT Press, 3111--3119.
[35]
Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). ACM, New York, NY, 1856--1864.
[36]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307.
[37]
Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1899--1907.
[38]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of Neural Information Processing Systems.
[39]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2017. CM-GANs: Cross-modal generative adversarial networks for common representation learning. Arxiv Preprint Arxiv:1710.05106 (2017).
[40]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.
[41]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
[42]
Bryan A. Plummer, Liwei Wang, Chris M Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649.
[43]
Yuankai Qi, Shengping Zhang, Lei Qin, Qingming Huang, Hongxun Yao, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 41, 5 (2019), 1116--1130.
[44]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.
[45]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. MIT Press, 91--99.
[46]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).
[47]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. MIT Press, 5998--6008.
[49]
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. Arxiv Preprint Arxiv:1511.06361 (2015).
[50]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 5005--5013.
[51]
Janatas Wehrmann and Rodrigo C. Barros. 2018. Bidirectional retrieval made simple. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[52]
Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2018. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans. Image Process. 28, 4 (2018), 1993--2007.
[53]
Gengshen Wu, Jungong Han, Zijia Lin, Guiguang Ding, Baochang Zhang, and Qiang Ni. 2018. Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Industr. Electron. 66, 12 (2018), 9868--9877.
[54]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[55]
Chenggang Yan, Liang Li, Chunjie Zhang, Bingtao Liu, Yongdong Zhang, and Qionghai Dai. 2019. Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans. Multimed. 21, 10 (2019), 2675--2685.
[56]
C Yan, Y. Tu, X Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2020. STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans. Multimed. 22, 1 (Jan 2020), 229--241. https://doi.org/10.1109/TMM.2019.2924576
[57]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.
[58]
Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Trans. Circ. Syst. Video Technol. 27, 3 (2017), 421--430.
[59]
Shengping Zhang, Xiangyuan Lan, Hongxun Yao, Huiyu Zhou, Dacheng Tao, and Xuelong Li. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2017), 2357--2370.
[60]
Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circ. Syst. Video Technol. 25, 11 (2015), 1749--1760.
[61]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding with instance loss. Arxiv Preprint Arxiv:1711.05535 (2017).
[62]
Andrej Zukov-Gregoric, Yoram Bachrach, Pasha Minkovsky, Sam Coope, and Bogdan Maksak. 2017. Neural named entity recognition using a self-attention mechanism. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 652--656.

Cited By

View all
  • (2023)Hierarchical deep semantic alignment for cross-domain 3D model retrievalJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.103895(103895)Online publication date: Jul-2023
  • (2022)A Cross-Modal Image and Text Retrieval Method Based on Efficient Feature Extraction and Interactive Learning CAEScientific Programming10.1155/2022/73145992022(1-12)Online publication date: 10-Jan-2022
  • (2022)Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal RetrievalIEEE Transactions on Image Processing10.1109/TIP.2022.322005131(7154-7164)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. ACMNet: Adaptive Confidence Matching Network for Human Behavior Analysis via Cross-modal Retrieval

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
      Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
      January 2020
      376 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3388236
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 April 2020
      Accepted: 01 September 2019
      Revised: 01 July 2019
      Received: 01 April 2019
      Published in TOMM Volume 16, Issue 1s

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Cross-modality retrieval
      2. adaptive confidence matching network
      3. human behavior analysis
      4. image-text retrieval

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Key R&D Program of China
      • National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data (PSRPC)
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 17 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Hierarchical deep semantic alignment for cross-domain 3D model retrievalJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.103895(103895)Online publication date: Jul-2023
      • (2022)A Cross-Modal Image and Text Retrieval Method Based on Efficient Feature Extraction and Interactive Learning CAEScientific Programming10.1155/2022/73145992022(1-12)Online publication date: 10-Jan-2022
      • (2022)Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal RetrievalIEEE Transactions on Image Processing10.1109/TIP.2022.322005131(7154-7164)Online publication date: 2022
      • (2022)Deep Adaptively-Enhanced Hashing With Discriminative Similarity Guidance for Unsupervised Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.317271632:10(7255-7268)Online publication date: Oct-2022
      • (2022)Comprehensive Framework of Early and Late Fusion for Image–Sentence RetrievalIEEE MultiMedia10.1109/MMUL.2022.314497229:3(38-47)Online publication date: 1-Jul-2022
      • (2021)Global Relation-Aware Attention Network for Image-Text RetrievalProceedings of the 2021 International Conference on Multimedia Retrieval10.1145/3460426.3463615(19-28)Online publication date: Sep-2021
      • (2021)Graph-Based Network for Depth Completion with Ground CompletionProceedings of 2021 Chinese Intelligent Automation Conference10.1007/978-981-16-6372-7_30(258-265)Online publication date: 8-Oct-2021

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media