skip to main content
research-article

Deep Attentive Multimodal Network Representation Learning for Social Media Images

Published: 16 June 2021 Publication History

Abstract

The analysis for social networks, such as the socially connected Internet of Things, has shown a deep influence of intelligent information processing technology on industrial systems for Smart Cities. The goal of social media representation learning is to learn dense, low-dimensional, and continuous representations for multimodal data within social networks, facilitating many real-world applications. Since social media images are usually accompanied by rich metadata (e.g., textual descriptions, tags, groups, and submitted users), simply modeling the image is not effective to learn the comprehensive information from social media images. In this work, we treat the image and its textual description as multimodal content, and transform other metainformation into the links between contents (such as two images marked by the same tag or submitted by the same user). Based on the multimodal content and social links, we propose a Deep Attentive Multimodal Graph Embedding model named DAMGE for more effective social image representation learning. We introduce both small- and large-scale datasets to conduct extensive experiments, of which the results confirm the superiority of the proposal on the tasks of social image classification and link prediction.

References

[1]
Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 6 (2003), 1373–1396.
[2]
Jiuwen Cao, Kai Zhang, Minxia Luo, Chun Yin, and Xiaoping Lai. 2016. Extreme learning machine and adaptive sparse representation for image classification. Neural Netw. 81 (2016), 91–102.
[3]
Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rytstxWAW.
[4]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the 8th ACM International Conference on Image and Video Retrieval (CIVR’09), Stéphane Marchand-Maillet and Yiannis Kompatsiaris (Eds.). ACM.
[5]
Paul D. Clough, Michael Grubinger, Thomas Deselaers, Allan Hanbury, and Henning Müller. 2006. Overview of the ImageCLEF 2006 photographic retrieval and object annotation tasks. In Evaluation of Multilingual and Multi-Modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum (CLEF’06), Revised Selected Papers, Lecture Notes in Computer Science, Vol. 4730, Carol Peters, Paul D. Clough, Fredric C. Gey, Jussi Karlgren, Bernardo Magnini, Douglas W. Oard, Maarten de Rijke, and Maximilian Stempfhuber (Eds.). Springer, 579–594.
[6]
Peng Cui, Shaowei Liu, and Wenwu Zhu. 2018. General knowledge embedded image representation learning. IEEE Trans. Multimedia 20, 1 (2018), 198–207.
[7]
Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338.
[8]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 2121–2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.
[9]
Yue Gao, Yi Zhen, Haojie Li, and Tat-Seng Chua. 2016. Filtering of brand-related microblogs using social-smooth multiview embedding. IEEE Trans. Multimedia 18, 10 (2016), 2115–2126.
[10]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 855–864.
[11]
Zepeng Gu, Bo Lang, Tongyu Yue, and Lei Huang. 2017. Learning joint multimodal representation based on multi-fusion deep neural networks. In 24th International Conference on Neural Information Processing (ICONIP’17), Part II, Lecture Notes in Computer Science, Vol. 10635, Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, and El-Sayed M. El-Alfy (Eds.). Springer, 276–285.
[12]
William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 1024–1034. http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs.
[13]
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664.
[14]
Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, and Zhonghua Zhao. 2018. Multimodal network embedding via attention based multi-view variational autoencoder. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (ICMR’18), Kiyoharu Aizawa, Michael S. Lew, and Shin’ichi Satoh (Eds.). ACM, 108–116.
[15]
Feiran Huang, Xiaoming Zhang, and Zhoujun Li. 2018. Learning joint multimodal representation with adversarial attention networks. In 2018 ACM Multimedia Conference on Multimedia(MM’18), Susanne Boll, Kyoung Mu Lee, Jiebo Luo, Wenwu Zhu, Hyeran Byun, Chang Wen Chen, Rainer Lienhart, and Tao Mei (Eds.). ACM, 1874–1882.
[16]
Feiran Huang, Xiaoming Zhang, Zhoujun Li, Tao Mei, Yueying He, and Zhonghua Zhao. 2017. Learning social image embedding with deep multimodal attention networks. In Proceedings of the Thematic Workshops of ACM Multimedia 2017, Wanmin Wu, Jianchao Yang, Qi Tian, and Roger Zimmermann (Eds.). ACM, 460–468.
[17]
Feiran Huang, Xiaoming Zhang, Zhoujun Li, Zhonghua Zhao, and Yueying He. 2018. From content to links: Social image embedding with deep multimodal model. Knowl.-Based Syst. 160 (2018), 251–264.
[18]
Feiran Huang, Xiaoming Zhang, Jie Xu, Chaozhuo Li, and Zhoujun Li. 2019. Network embedding by fusing multimodal contents and links. Knowl.-Based Syst. 171 (2019), 44–55.
[19]
Feiran Huang, Xiaoming Zhang, Jie Xu, Zhonghua Zhao, and Zhoujun Li. 2019. Multimodal learning of social image representation by exploiting social relations. IEEE Transactions on Cybernetics 51, 3 (2021), 1506–1518.
[20]
Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, and Zhoujun Li. 2019. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Processing 28, 4 (2019), 2008–2020.
[21]
Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, Zhoujun Li, and Yueying He. 2018. Deep multi-view representation learning for social images. Appl. Soft Comput. 73 (2018), 106–118.
[22]
Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval (MIR’08), Michael S. Lew, Alberto Del Bimbo, and Erwin M. Bakker (Eds.). ACM, 39–43.
[23]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR ]1), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl.
[24]
Chaozhuo Li, Senzhang Wang, Dejian Yang, Zhoujun Li, Yang Yang, Xiaoming Zhang, and Jianshe Zhou. 2017. PPNE: Property preserving network embedding. In Proceedings of the 22nd International Conference on Database Systems for Advanced Applications, (DASFAA’17), Part I, Lecture Notes in Computer Science, Vol. 10177, K. Selçuk Candan, Lei Chen, Torben Bach Pedersen, Lijun Chang, and Wen Hua (Eds.). Springer, 163–179.
[25]
Chaozhuo Li, Lei Zheng, Senzhang Wang, Feiran Huang, Philip S. Yu, and Zhoujun Li. 2019. Multi-hot compact network embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’1), Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 459–468.
[26]
Zechao Li, Jinhui Tang, and Tao Mei. 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2070–2083.
[27]
Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Attributed social network embedding. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2257–2270.
[28]
Shaowei Liu, Peng Cui, Wenwu Zhu, and Shiqiang Yang. 2015. Learning socially embedded visual representation from scratch. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference (MM ’15,), Xiaofang Zhou, Alan F. Smeaton, Qi Tian, Dick C. A. Bulterman, Heng Tao Shen, Ketan Mayer-Patel, and Shuicheng Yan (Eds.). ACM, 109–118.
[29]
Yun Liu, Xiaoming Zhang, Feiran Huang, and Zhoujun Li. 2018. Adversarial learning of answer-related representation for visual question answering. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 1013–1022.
[30]
Yun Liu, Xiaoming Zhang, Feiran Huang, Xianghong Tang, and Zhoujun Li. 2019. Visual question answering via attention-based syntactic structure tree-LSTM. Appl. Soft Comput. 82 (2019).
[31]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 289–297. http://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering.
[32]
Zhiwu Lu, Liwei Wang, and Ji-Rong Wen. 2014. Direct semantic analysis for social image classification. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Carla E. Brodley and Peter Stone (Eds.). AAAI Press, 1258–1264. http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8189.
[33]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15), Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 1412–1421.
[34]
Julian J. McAuley and Jure Leskovec. 2012. Image labeling on a network: Using social-network metadata for image classification. In Proceedings of the 12th European Conference on Computer Vision (ECCV’12), Part IV, (Lecture Notes in Computer Science), Vol. 7575, Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.),. Springer, 828–841.
[35]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Lise Getoor and Tobias Scheffer (Eds.). Omnipress, 689–696.
[36]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543. http://aclweb.org/anthology/D/D14/D14-1162.pdf.
[37]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701–710.
[38]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91–99. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.
[39]
Abhishek Sharma and David W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE Computer Society, 593–600.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556.
[41]
Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 2231–2239. http://papers.nips.cc/paper/4683-multimodal-learning-with-deep-boltzmann-machines.
[42]
Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319–2323.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.
[44]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rJXMpikCZ.
[45]
Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. 2019. Deep graph infomax. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). OpenReview.net. https://openreview.net/forum?id=rklz9iAcKQ.
[46]
Vedran Vukotic, Christian Raymond, and Guillaume Gravier. 2016. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR’16), John R. Kender, John R. Smith, Jiebo Luo, Susanne Boll, and Winston H. Hsu (Eds.). ACM, 343–346.
[47]
Zhitao Wang, Chengyao Chen, and Wenjie Li. 2017. Predictive network representation learning for link prediction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 969–972.
[48]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 2048–2057. http://jmlr.org/proceedings/papers/v37/xuc15.html.
[49]
Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Processing 26, 5 (2017), 2494–2507.
[50]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE Computer Society, 3441–3450.
[51]
Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y. Chang. 2015. Network representation learning with rich text information. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI’15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 2111–2117. http://ijcai.org/Abstract/15/299.
[52]
Zhenguo Yang, Qing Li, Zheng Lu, Yun Ma, Zhiguo Gong, and Wenyin Liu. 2017. Dual structure constrained multimodal feature coding for social event detection from flickr data. ACM Trans. Internet Techn. 17, 2 (2017), 19:1–19:20.
[53]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 4904–4912.
[54]
Yi Zhuang, Nan Jiang, Qing Li, Lei Chen, and Chunhua Ju. 2015. Progressive batch medical image retrieval processing in mobile wireless networks. ACM Trans. Internet Techn. 15, 3 (2015), 9:1–9:27.

Cited By

View all
  • (2024)Evaluation of the impact of security perception on the structural changes of MSEs through system dynamicsHeliyon10.1016/j.heliyon.2024.e3908510:21(e39085)Online publication date: Nov-2024
  • (2023)DRAKE: Deep Pair-Wise Relation Alignment for Knowledge-Enhanced Multimodal Scene Graph Generation in Social Media PostsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.323143733:7(3199-3213)Online publication date: 1-Jul-2023
  • (2023)Explicit time embedding based cascade attention network for information popularity predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10327860:3Online publication date: 1-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 21, Issue 3
August 2021
522 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/3468071
  • Editor:
  • Ling Liu
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2021
Accepted: 01 August 2020
Revised: 01 July 2020
Received: 01 May 2020
Published in TOIT Volume 21, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Social image
  2. graph convolutional network
  3. multimodal
  4. attention network
  5. representation learning

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Guangdong Province, China
  • Guangdong Provincial Key R&D Plan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluation of the impact of security perception on the structural changes of MSEs through system dynamicsHeliyon10.1016/j.heliyon.2024.e3908510:21(e39085)Online publication date: Nov-2024
  • (2023)DRAKE: Deep Pair-Wise Relation Alignment for Knowledge-Enhanced Multimodal Scene Graph Generation in Social Media PostsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.323143733:7(3199-3213)Online publication date: 1-Jul-2023
  • (2023)Explicit time embedding based cascade attention network for information popularity predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10327860:3Online publication date: 1-May-2023
  • (2023)Multimodal learning of social image representationDigital Image Enhancement and Reconstruction10.1016/B978-0-32-398370-9.00013-5(139-150)Online publication date: 2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media