Abstract
Texts, images, audios, and videos form the major volume in Big Data being generated in today’s tech-savvy world. Such data are preferably archived and transmitted in the compressed form to realize storage and transmission efficiency. Through compression, though data becomes storage and transmission efficient, its processing gets expensive as it requires decompression as many times the data needs to be processed; and this requires additional computing resources. Therefore it would be novel, if the data processing and information extraction could be carried out directly from the compressed data without subjecting it to decompression. In this backdrop, the research paper demonstrates a novel technique of extracting text and non-text information straight from compressed document images (supported by TIFF and PDF formats) using the correlation-entropy features that are directly computed from the compressed representation. The experimental results reported on compressed printed text document images validate the proposed method, and also demonstrate the fact that the text and non-text information extracted from the compressed document are identical to that obtained from uncompressed representation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, (ICDAR2009), Barcelona, Spain, pp. 296–300 (2009)
Breuel, T.M.: High performance document layout analysis. In: Proceedings of Symposium on Document Image Understanding Technology, April 2003
Chen, K., Yin, F., Liu, C.L.: Page segmentation with efficient whitespace rectangles extraction and grouping. In: 12th International Conference on Document Analysis and Recognition, pp. 958–962 (2013)
Javed, M.: On the possibility of processing document images in compressed domain. Ph.D. thesis, Department of Studies in Computer Science, University of Mysore (2016)
Javed, M., Krishnanand, S.H., Nagabhushan, P., Chaudhuri, B.B.: Visualizing ccitt group 3 and group 4 tiff documents and transforming to run-length compressed format enabling direct processing in compressed domain. Procedia Comput. Sci. 85, 213–221 (2016). (International Conference on Computational Modelling and Security - CMS 2016)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Direct processing of run-length compressed document image for segmentation and characterization of a specified block. Int. J. Comput. Appl. (IJCA) 83(15), 1–6 (2013)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of line-word-character segments directly from run-length compressed printed text-documents. In: National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4 (2013)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents. In: 2nd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 813–817 (2013)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1–5 (2015)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic page segmentation without decompressing the run-length compressed printed text documents. International Journal of Information Processing Systems (JIPS) (Accepted for Publication) (2015)
Javed, M., Nagabhushan, P., Chaudhuri, B.B.: A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 216–220 (2015)
Kasturi, R., Gorman, L.O., Govindaraju, V.: Document image analysis: a primer. Sadhana Part 1 1(27), 3–22 (2002)
Marinai, S.: Introduction to document analysis and recognition. Stud. Comput. Intell. (SCI) 90, 1–20 (2008)
Zirari, F., Ennaji, A., Nicolas, S., Mammass, D.: A document image segmentation system using analysis of connected components. In: 12th International Conference on Document Analysis and Recognition, pp. 753–757 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Javed, M., Nagabhushan, P., Chaudhuri, B.B. (2017). Automatic Extraction of Text and Non-text Information Directly from Compressed Document Images. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-52941-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52940-0
Online ISBN: 978-3-319-52941-7
eBook Packages: EngineeringEngineering (R0)