Abstract
Images embedded in documents carry extremely rich information that is vital in its content extraction and knowledge construction. Interpreting the information in diagrams, scanned tables and other types of images, enriches the underlying concepts, but requires a classifier that can recognize the huge variability of potential embedded image types and enable their relationship reconstruction. Here we tested different deep learning-based approaches for image classification on a dataset of 32K images extracted from documents and divided in 62 categories for which we obtain accuracy of \(\sim 85\%\). We also investigate to what extent textual information improves classification performance when combined with visual features. The textual features were obtained either from text embedded in the images or image captions. Our findings suggest that textual information carry relevant information with respect to the image category and that multimodal classification provides up to 7% better accuracy than single data type classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28640-0_20
Cheng, B., Stanley, R.J., Antani, S., Thoma, G.R.: Graphical figure classification using data fusion for integrating text and image features. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 693–697. IEEE (2013)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. arXiv preprint (2016)
Clark, C.A., Divvala, S.K.: Looking beyond text: extracting figures, tables and captions from computer science papers. In: AAAI Workshop: Scholarly Big Data (2015)
Ferreira, D.S., Ribeiro, J., Papa, A.R., Menezes, R.: Towards evidences of long-range correlations in seismic activity. arXiv preprint arXiv:1405.0307 (2014)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995. IEEE (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315. Citeseer (1995)
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 3168–3172. IEEE (2014)
Maaten, L.V.D., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Maderlechner, G., Suda, P., Brückner, T.: Classification of documents by form and content. Pattern Recognit. Lett. 18(11–13), 1225–1231 (1997)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Miranda, E., Aryuni, M., Irwansyah, E.: A survey of medical image classification techniques. In: International Conference on Information Management and Technology (ICIMTech), pp. 56–61. IEEE (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Taylor, S.L., Lipshutz, M., Nilson, R.W.: Classification and functional decomposition of business documents. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 563–566. IEEE (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Viana, M., Nguyen, QB., Smith, J., Gabrani, M. (2018). Multimodal Classification of Document Embedded Images. In: Fornés, A., Lamiroy, B. (eds) Graphics Recognition. Current Trends and Evolutions. GREC 2017. Lecture Notes in Computer Science(), vol 11009. Springer, Cham. https://doi.org/10.1007/978-3-030-02284-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-02284-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02283-9
Online ISBN: 978-3-030-02284-6
eBook Packages: Computer ScienceComputer Science (R0)