Multimodal deep representation learning for video classification

Tian, Haiman; Tao, Yudong; Pouyanfar, Samira; Chen, Shu-Ching; Shyu, Mei-Ling

doi:10.1007/s11280-018-0548-3

Multimodal deep representation learning for video classification

Published: 03 May 2018

Volume 22, pages 1325–1341, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Haiman Tian¹,
Yudong Tao²,
Samira Pouyanfar¹,
Shu-Ching Chen ORCID: orcid.org/0000-0001-9209-390X¹ &
…
Mei-Ling Shyu²

2982 Accesses
6 Altmetric
Explore all metrics

Abstract

Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal deep learning based on multiple correspondence analysis for disaster management

Article 18 September 2018

Multimedia video analytics using deep hybrid fusion algorithm

Article 23 September 2024

3D convolutional networks with multi-layer-pooling selection fusion for video classification

Article 13 August 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed.Syst. 16(6), 345–379 (2010)
Article Google Scholar
Chen, S.C., Shyu, M.L., Kashyap, R.L.: Augmented transition network as a semantic model for video data. Int. J. Network. Inf. Syst. Special Issue Video Data 3 (1), 9–25 (2000)
Google Scholar
Chen, S.C., Shyu, M.L., Chen, M., Zhang, C.: A decision tree-based multimodal data mining framework for soccer goal detection. In: IEEE International conference on multimedia and expo, pp. 265–268 (2004)
Chen, S.C., Shyu, M.L., Zhang, C.: Innovative shot boundary detection for video indexing. Video Data Manag. Inf. Retriev, 217–236 (2005)
Chen, M., Chen, S.C., Shyu, M.L., Zhang, C.: Video event mining via multimodal content analysis and classification. In: Petrushin, V. A., Khan, L. (eds.) Multimedia Data Mining and Knowledge Discovery, pp. 234–258. Springer, London (2007)
Chen, X., Zhang, C., Chen, S.C., Rubin, S.: A human-centered multiple instance learning framework for semantic video retrieval. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 39(2), 228–233 (2009)
Article Google Scholar
Chen, C., Zhu, Q., Lin, L., Shyu, M.L.: Web media semantic concept retrieval via tag removal and model fusion. ACM Trans. Intell. Syst. Technol. 4(4), 61 (2013)
Google Scholar
Deng, L., Yu, D., et al.: Deep learning: Methods and applications. Foundations and Trends®;, in Signal Processing 7(3–4), 197–387 (2014)
Article MathSciNet MATH Google Scholar
Fleury, A., Vacher, M., Noury, N: SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans. Inf. Technol. Biomed. 14(2), 274–283 (2010)
Article Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)
Article Google Scholar
Ha, H.Y., Yang, Y., Pouyanfar, S., Tian, H., Chen, S.C.: Correlation-based deep learning for multimedia semantic concept detection. In: International Conference on Web Information Systems Engineering, pp. 473–487 (2015)
Hannun, A.Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y.: Deep speech: scaling up end-to-end speech recognition. CoRR arXiv:1412.5567 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Johnson, R., Zhang, T.: Supervised and semi-supervised text categorization using lstm for region embeddings. In: International Conference on Machine Learning, pp. 526–534. JMLR.org (2016)
Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., Jean, S., Froumenty, P., Dauphin, Y., Boulanger-Lewandowski, N., et al.: Emonets: multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interf. 10(2), 99–111 (2016)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Kim, Y.: Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. ACL (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lan, Z., Bao, L., Yu, S., Liu, W., Hauptmann, A.G.: Multimedia classification and event detection using double fusion. Multimed. Tools Appl. 71(1), 333–347 (2014)
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, T., Xie, N., Zeng, C., Zhou, W., Zheng, L., Jiang, Y., Yang, Y., Ha, H., Xue, W., Huang, Y., Chen, S., Navlakha, J. K., Iyengar, S. S.: Data-driven techniques in disaster information management. ACM Comput. Surv. 50 (1), 1 (2017)
Article Google Scholar
Lin, L., Shyu, M.L.: Weighted association rule mining for video semantic detection. Methods Innov. Multimed. Database Content Manag. 1(1), 37–54 (2012)
Google Scholar
Meng, T., Shyu, M.L.: Leveraging concept association network for multimedia rare concept mining and retrieval. In: IEEE International Conference on Multimedia and Expo, pp. 860–865 (2012)
Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the Web. In: International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011)
Mostafa, M.M.: More than words: social networks’ text mining for consumer brand sentiments. Expert Syst. Appl. 40(10), 4241–4251 (2013)
Article Google Scholar
Pantic, M., Sebe, N., Cohn, J.F., Huang, T.: Affective multimodal human-computer interaction. In: ACM International Conference on Multimedia, pp. 669–676. ACM (2005)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. ACL (2014)
Potharaju, R., Carbunar, B., Azimpourkivi, M., Vasudevan, V., Iyengar, S.: Infiltrating social network accounts: attacks and defenses. In: Chang, C. H., Potkonjak, M. (eds.) Secure System Design and Trustable Computing, pp. 457–485. Springer, Cham (2016)
Pouyanfar, S., Chen, S.C.: Semantic concept detection using weighted discretization multiple correspondence analysis for disaster information management. In: International Conference on Information Reuse and Integration, pp. 556–564 (2016)
Pouyanfar, S., Chen, S.C.: Automatic video event detection for imbalance data using enhanced ensemble deep learning. Int. J. Semant. Comput. 11(01), 85–109 (2017)
Article Google Scholar
Pouyanfar, S., Yang, Y., Chen, S.C., Shyu, M.L., Iyengar, S.S.: Multimedia big data analytics: a survey. ACM Comput. Surv. 51(1), 10:1–10:34 (2018)
Article Google Scholar
Reyes, M.E.P., Pouyanfar, S., Zheng, H.C., Ha, H.Y., Chen, S.C.: Multimedia data management for disaster situation awareness. In: International Symposium on Sensor Networks, Systems and Security. Springer (2017)
Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: International Workshop on Semantic Evaluation, pp. 502–518 (2017)
Scott, J.: Social network analysis. SAGE (2017)
Shahbazi, H., Jamshidi, K., Monadjemi, A.H., Manoochehri, H.E.: Training oscillatory neural networks using natural gradient particle swarm optimization. Robotica 33(7), 1551–1567 (2015)
Article Google Scholar
Shyu, M.L., Chen, S.C., Kashyap, R.L.: Generalized affinity-based association rule mining for multimedia database queries. Knowl. Inf. Syst. 3(3), 319–337 (2001)
Article MATH Google Scholar
Shyu, M.L., Sarinnapakorn, K., Kuruppu-Appuhamilage, I., Chen, S.C., Chang, L., Goldring, T.: Handling nominal features in anomaly intrusion detection problems. In: International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications, pp. 55–62 (2005)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence, pp. 4278–4284. AAAI Press (2017)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Takahashi, N., Gygli, M., Gool, L.V.: AENet: Learning deep audio features for video analysis. CoRR arXiv:1701.00599 (2017)
Tian, H., Chen, S.C.: MCA-NN: Multiple correspondence analysis based neural network for disaster information detection. In: IEEE International Conference on Multimedia Big Data, pp. 268–275 (2017)
Tian, H., Chen, S.C.: A video-aided semantic analytics system for disaster information integration. In: IEEE International Conference on Multimedia Big Data, pp. 242–243 (2017)
Tian, Y., Chen, S.C., Shyu, M.L., Huang, T., Sheu, P., Del Bimbo, A.: Multimedia big data. IEEE MultiMedia 22(3), 93–95 (2015)
Article Google Scholar
Tian, H., Chen, S.C., Rubin, S.H., Grefe, W.K.: FA-MCADF: Feature affinity based multiple correspondence analysis and decision fusion framework for disaster information management. In: IEEE International Conference on Information Reuse and Integration, pp. 198–206 (2017)
Vosoughi, S., Vijayaraghavan, P., Roy, D.: Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1041–1044. ACM (2016)
Xue, H., Liu, Y., Cai, D., He, X.: Tracking people in rgbd videos using deep learning and motion clues. Neurocomputing 204, 70–76 (2016)
Article Google Scholar
Yan, Y., Zhu, Q., Shyu, M.L., Chen, S.C.: Classifier fusion by judgers on spark clusters for multimedia big data classification Qual. Softw. Through Reuse Integr., pp. 91–108. Springer, Cham (2016)
Yang, Y., Lu, W., Domack, J., Li, T., Chen, S.C., Luis, S., Navlakha, J.K.: MADIS: A multimedia-aided disaster information integration system for emergency management. In: International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 233–241. IEEE (2012)
Yang, Y., Pouyanfar, S., Tian, H., Chen, M., Chen, S.C., Shyu, M.L.: IF-MCA: Importance factor-based multiple correspondence analysis for multimedia data analytics. IEEE Transactions on Multimedia (2017)
Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D.: Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 55(3), 856–867 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Information Sciences, Florida International University, Miami, FL, 33199, USA
Haiman Tian, Samira Pouyanfar & Shu-Ching Chen
Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, 33124, USA
Yudong Tao & Mei-Ling Shyu

Authors

Haiman Tian
View author publications
You can also search for this author inPubMed Google Scholar
Yudong Tao
View author publications
You can also search for this author inPubMed Google Scholar
Samira Pouyanfar
View author publications
You can also search for this author inPubMed Google Scholar
Shu-Ching Chen
View author publications
You can also search for this author inPubMed Google Scholar
Mei-Ling Shyu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Haiman Tian.

Additional information

This article belongs to the Topical Collection: Special Issue on Social Media and Interactive Technologies

Guest Editors: Timothy K. Shih, Lin Hui, Somchoke Ruengittinun, and Qing Li

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, H., Tao, Y., Pouyanfar, S. et al. Multimodal deep representation learning for video classification. World Wide Web 22, 1325–1341 (2019). https://doi.org/10.1007/s11280-018-0548-3

Download citation

Received: 15 December 2017
Revised: 09 March 2018
Accepted: 15 March 2018
Published: 03 May 2018
Issue Date: 15 May 2019
DOI: https://doi.org/10.1007/s11280-018-0548-3

Keywords

Part of a collection:

Special Issue on Social Media and Interactive Technologies

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal deep representation learning for video classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal deep learning based on multiple correspondence analysis for disaster management

Multimedia video analytics using deep hybrid fusion algorithm

3D convolutional networks with multi-layer-pooling selection fusion for video classification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now