Abstract
Multi-modality fusion has recently drawn much attention due to the fast increasing of multimedia data. Document that consists of multiple modalities i.e. image, text and video, can be better understood by machines if information from different modalities semantically combined. In this paper, we propose to fuse image and text information with deep neural network (DNN) based approach. By jointly fusing visual-textual feature and taking the correlation between image and text into account, fusion features can be learned for representing document. We investigated the fusion features on document categorization, found that DNN-based fusion outperforms mainstream algorithms include K-Nearest Neighbor(KNN), Support Vector Machine (SVM) and Naive Bayes (NB) and 3-layer Neural Network (3L-NN) in both early and late fusion strategies.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Atrey, P.K., Hossain, M.A., El-Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Clinchant, S., Ah-Pine, J., Csurka, G.: Semantic combination of textual and visual information in multimedia retrieval. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, p. 44. ACM (2011)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009), pp. 248–255. IEEE (2009)
Escalante, H.J.: Late fusion of heterogeneous methods for multimedia image retrieval (2008)
Hinton, G., Deng, L., Dong, Y., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Liu, D., Lai, K.-T., Ye, G., Chen, M.-S., Chang, S.-F.: Sample-specific late fusion for visual category recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 803–810. IEEE (2013)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)
Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)
Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1630–1644 (2009)
Thompson, B.: Canonical correlation analysis. In: Everitt, B., Howell, D. (eds.) Encyclopedia of Statistics in Behavioral Science. Wiley, New York (2005)
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 85 (2008)
Vedaldi, A., Fulkerson, B.: Vlfeat: An open and portable library of computer vision algorithms. In: Proceedings of the International Conference on Multimedia, pp. 1469–1472. ACM (2010)
Wu, Z., Jiang, Y.-G., Wang, J., Pu, J., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 167–176. ACM (2014)
Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028. IEEE (2012)
Yu, J., Cong, Y., Qin, Z., Wan, T.: Cross-modal topic correlations for multimedia retrieval. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 246–249. IEEE (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, C., Yang, H., Meinel, C. (2015). Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9489. Springer, Cham. https://doi.org/10.1007/978-3-319-26532-2_73
Download citation
DOI: https://doi.org/10.1007/978-3-319-26532-2_73
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26531-5
Online ISBN: 978-3-319-26532-2
eBook Packages: Computer ScienceComputer Science (R0)