Automatic Extraction of Text and Non-text Information Directly from Compressed Document Images

Javed, Mohammed; Nagabhushan, P.; Chaudhuri, Bidyut B.

doi:10.1007/978-3-319-52941-7_5

Mohammed Javed²⁰,
P. Nagabhushan²¹ &
Bidyut B. Chaudhuri²²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 552))

Included in the following conference series:

International Conference on Hybrid Intelligent Systems

1119 Accesses

Abstract

Texts, images, audios, and videos form the major volume in Big Data being generated in today’s tech-savvy world. Such data are preferably archived and transmitted in the compressed form to realize storage and transmission efficiency. Through compression, though data becomes storage and transmission efficient, its processing gets expensive as it requires decompression as many times the data needs to be processed; and this requires additional computing resources. Therefore it would be novel, if the data processing and information extraction could be carried out directly from the compressed data without subjecting it to decompression. In this backdrop, the research paper demonstrates a novel technique of extracting text and non-text information straight from compressed document images (supported by TIFF and PDF formats) using the correlation-entropy features that are directly computed from the compressed representation. The experimental results reported on compressed printed text document images validate the proposed method, and also demonstrate the fact that the text and non-text information extracted from the compressed document are identical to that obtained from uncompressed representation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A review on document image analysis techniques directly in the compressed domain

Article 21 March 2017

Text Extraction from Images: A Review

A Survey on Text Detection from Document Images

References

Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, (ICDAR2009), Barcelona, Spain, pp. 296–300 (2009)
Google Scholar
Breuel, T.M.: High performance document layout analysis. In: Proceedings of Symposium on Document Image Understanding Technology, April 2003
Google Scholar
Chen, K., Yin, F., Liu, C.L.: Page segmentation with efficient whitespace rectangles extraction and grouping. In: 12th International Conference on Document Analysis and Recognition, pp. 958–962 (2013)
Google Scholar
Javed, M.: On the possibility of processing document images in compressed domain. Ph.D. thesis, Department of Studies in Computer Science, University of Mysore (2016)
Google Scholar
Javed, M., Krishnanand, S.H., Nagabhushan, P., Chaudhuri, B.B.: Visualizing ccitt group 3 and group 4 tiff documents and transforming to run-length compressed format enabling direct processing in compressed domain. Procedia Comput. Sci. 85, 213–221 (2016). (International Conference on Computational Modelling and Security - CMS 2016)
Article Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Direct processing of run-length compressed document image for segmentation and characterization of a specified block. Int. J. Comput. Appl. (IJCA) 83(15), 1–6 (2013)
Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of line-word-character segments directly from run-length compressed printed text-documents. In: National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4 (2013)
Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents. In: 2nd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 813–817 (2013)
Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1–5 (2015)
Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic page segmentation without decompressing the run-length compressed printed text documents. International Journal of Information Processing Systems (JIPS) (Accepted for Publication) (2015)
Google Scholar
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 216–220 (2015)
Google Scholar
Kasturi, R., Gorman, L.O., Govindaraju, V.: Document image analysis: a primer. Sadhana Part 1 1(27), 3–22 (2002)
Article Google Scholar
Marinai, S.: Introduction to document analysis and recognition. Stud. Comput. Intell. (SCI) 90, 1–20 (2008)
Google Scholar
Zirari, F., Ennaji, A., Nicolas, S., Mammass, D.: A document image segmentation system using analysis of connected components. In: 12th International Conference on Document Analysis and Recognition, pp. 753–757 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, NMAM Institute of Technology (Affiliated to VTU, Belagavi), Nitte, 574110, India
Mohammed Javed
Department of Studies in Computer Science, University of Mysore, Mysuru, 570006, India
P. Nagabhushan
CVPR Unit, Indian Statistical Institute, Kolkata, 700108, India
Bidyut B. Chaudhuri

Authors

Mohammed Javed
View author publications
You can also search for this author in PubMed Google Scholar
P. Nagabhushan
View author publications
You can also search for this author in PubMed Google Scholar
Bidyut B. Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Javed .

Editor information

Editors and Affiliations

(MIR Labs), Machine Intelligence Research Labs, Auburn, Washington, USA
Ajith Abraham
Hassan 1st University, Settat, Morocco
Abdelkrim Haqiq
ENIS, University of Sfax, Sfax, Tunisia
Adel M. Alimi
Technopolis Rabat-Shore Rocade, International University of Rabat, Sala el Jadida, Morocco
Ghita Mezzour
Inst Applied Dept of Electronics, Taffala, University of Sousse, Sousse, Tunisia
Nizar Rokbani
Fakulti Teknologi Maklumat dan Komunikas, Universiti Teknikal Malaysia Melaka, Durian Tunggal, Malaysia
Azah Kamilah Muda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Javed, M., Nagabhushan, P., Chaudhuri, B.B. (2017). Automatic Extraction of Text and Non-text Information Directly from Compressed Document Images. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-52941-7_5
Published: 23 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52940-0
Online ISBN: 978-3-319-52941-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics