Multimodal Classification of Document Embedded Images

Viana, Matheus; Nguyen, Quoc-Bao; Smith, John; Gabrani, Maria

doi:10.1007/978-3-030-02284-6_4

Matheus Viana¹⁵,
Quoc-Bao Nguyen¹⁶,
John Smith¹⁶ &
…
Maria Gabrani¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11009))

Included in the following conference series:

International Workshop on Graphics Recognition

1483 Accesses

Abstract

Images embedded in documents carry extremely rich information that is vital in its content extraction and knowledge construction. Interpreting the information in diagrams, scanned tables and other types of images, enriches the underlying concepts, but requires a classifier that can recognize the huge variability of potential embedded image types and enable their relationship reconstruction. Here we tested different deep learning-based approaches for image classification on a dataset of 32K images extracted from documents and divided in 62 categories for which we obtain accuracy of $\sim 85\%$. We also investigate to what extent textual information improves classification performance when combined with visual features. The textual features were obtained either from text embedded in the images or image captions. Our findings suggest that textual information carry relevant information with respect to the image category and that multimodal classification provides up to 7% better accuracy than single data type classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal Classification Algorithm for Turkish Document Archiving: Improving Digital Document Storage by Unifying Image and Text-Based Classifiers

Multimodal Deep Networks for Text and Image-Based Document Classification

EDNets: Deep Feature Learning for Document Image Classification Based on Multi-view Encoder-Decoder Neural Networks

References

Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28640-0_20
Chapter Google Scholar
Cheng, B., Stanley, R.J., Antani, S., Thoma, G.R.: Graphical figure classification using data fusion for integrating text and image features. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 693–697. IEEE (2013)
Google Scholar
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. arXiv preprint (2016)
Google Scholar
Clark, C.A., Divvala, S.K.: Looking beyond text: extracting figures, tables and captions from computer science papers. In: AAAI Workshop: Scholarly Big Data (2015)
Google Scholar
Ferreira, D.S., Ribeiro, J., Papa, A.R., Menezes, R.: Towards evidences of long-range correlations in seismic activity. arXiv preprint arXiv:1405.0307 (2014)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995. IEEE (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315. Citeseer (1995)
Google Scholar
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 3168–3172. IEEE (2014)
Google Scholar
Maaten, L.V.D., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
Maderlechner, G., Suda, P., Brückner, T.: Classification of documents by form and content. Pattern Recognit. Lett. 18(11–13), 1225–1231 (1997)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Miranda, E., Aryuni, M., Irwansyah, E.: A survey of medical image classification techniques. In: International Conference on Information Management and Technology (ICIMTech), pp. 56–61. IEEE (2016)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Taylor, S.L., Lipshutz, M., Nilson, R.W.: Classification and functional decomposition of business documents. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 563–566. IEEE (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research Brazil, Rua Tutóia, 1157, São Paulo, 04007-900, Brazil
Matheus Viana
IBM Thomas J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY, 10598, USA
Quoc-Bao Nguyen & John Smith
IBM Zurich Research Laboratory, Smerstrasse 4, 8803, Rschlikon, Switzerland
Maria Gabrani

Authors

Matheus Viana
View author publications
You can also search for this author in PubMed Google Scholar
Quoc-Bao Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
John Smith
View author publications
You can also search for this author in PubMed Google Scholar
Maria Gabrani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matheus Viana .

Editor information

Editors and Affiliations

Computer Vision Center, Autonomous University of Barcelona, Bellaterra, Barcelona, Spain
Alicia Fornés
Université de Lorraine, Nancy, France
Bart Lamiroy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Viana, M., Nguyen, QB., Smith, J., Gabrani, M. (2018). Multimodal Classification of Document Embedded Images. In: Fornés, A., Lamiroy, B. (eds) Graphics Recognition. Current Trends and Evolutions. GREC 2017. Lecture Notes in Computer Science(), vol 11009. Springer, Cham. https://doi.org/10.1007/978-3-030-02284-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-02284-6_4
Published: 23 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02283-9
Online ISBN: 978-3-030-02284-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Classification of Document Embedded Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Classification Algorithm for Turkish Document Archiving: Improving Digital Document Storage by Unifying Image and Text-Based Classifiers

Multimodal Deep Networks for Text and Image-Based Document Classification

EDNets: Deep Feature Learning for Document Image Classification Based on Multi-view Encoder-Decoder Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multimodal Classification of Document Embedded Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Classification Algorithm for Turkish Document Archiving: Improving Digital Document Storage by Unifying Image and Text-Based Classifiers

Multimodal Deep Networks for Text and Image-Based Document Classification

EDNets: Deep Feature Learning for Document Image Classification Based on Multi-view Encoder-Decoder Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation