Skip to main content

Automatic Extraction of Text and Non-text Information Directly from Compressed Document Images

  • Conference paper
  • First Online:
Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016) (HIS 2016)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 552))

Included in the following conference series:

Abstract

Texts, images, audios, and videos form the major volume in Big Data being generated in today’s tech-savvy world. Such data are preferably archived and transmitted in the compressed form to realize storage and transmission efficiency. Through compression, though data becomes storage and transmission efficient, its processing gets expensive as it requires decompression as many times the data needs to be processed; and this requires additional computing resources. Therefore it would be novel, if the data processing and information extraction could be carried out directly from the compressed data without subjecting it to decompression. In this backdrop, the research paper demonstrates a novel technique of extracting text and non-text information straight from compressed document images (supported by TIFF and PDF formats) using the correlation-entropy features that are directly computed from the compressed representation. The experimental results reported on compressed printed text document images validate the proposed method, and also demonstrate the fact that the text and non-text information extracted from the compressed document are identical to that obtained from uncompressed representation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, (ICDAR2009), Barcelona, Spain, pp. 296–300 (2009)

    Google Scholar 

  2. Breuel, T.M.: High performance document layout analysis. In: Proceedings of Symposium on Document Image Understanding Technology, April 2003

    Google Scholar 

  3. Chen, K., Yin, F., Liu, C.L.: Page segmentation with efficient whitespace rectangles extraction and grouping. In: 12th International Conference on Document Analysis and Recognition, pp. 958–962 (2013)

    Google Scholar 

  4. Javed, M.: On the possibility of processing document images in compressed domain. Ph.D. thesis, Department of Studies in Computer Science, University of Mysore (2016)

    Google Scholar 

  5. Javed, M., Krishnanand, S.H., Nagabhushan, P., Chaudhuri, B.B.: Visualizing ccitt group 3 and group 4 tiff documents and transforming to run-length compressed format enabling direct processing in compressed domain. Procedia Comput. Sci. 85, 213–221 (2016). (International Conference on Computational Modelling and Security - CMS 2016)

    Article  Google Scholar 

  6. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Direct processing of run-length compressed document image for segmentation and characterization of a specified block. Int. J. Comput. Appl. (IJCA) 83(15), 1–6 (2013)

    Google Scholar 

  7. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of line-word-character segments directly from run-length compressed printed text-documents. In: National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4 (2013)

    Google Scholar 

  8. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents. In: 2nd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 813–817 (2013)

    Google Scholar 

  9. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1–5 (2015)

    Google Scholar 

  10. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: Automatic page segmentation without decompressing the run-length compressed printed text documents. International Journal of Information Processing Systems (JIPS) (Accepted for Publication) (2015)

    Google Scholar 

  11. Javed, M., Nagabhushan, P., Chaudhuri, B.B.: A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 216–220 (2015)

    Google Scholar 

  12. Kasturi, R., Gorman, L.O., Govindaraju, V.: Document image analysis: a primer. Sadhana Part 1 1(27), 3–22 (2002)

    Article  Google Scholar 

  13. Marinai, S.: Introduction to document analysis and recognition. Stud. Comput. Intell. (SCI) 90, 1–20 (2008)

    Google Scholar 

  14. Zirari, F., Ennaji, A., Nicolas, S., Mammass, D.: A document image segmentation system using analysis of connected components. In: 12th International Conference on Document Analysis and Recognition, pp. 753–757 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Javed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Javed, M., Nagabhushan, P., Chaudhuri, B.B. (2017). Automatic Extraction of Text and Non-text Information Directly from Compressed Document Images. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52941-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52940-0

  • Online ISBN: 978-3-319-52941-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics