Skip to main content

Spotting of Keyword Directly in Run-Length Compressed Documents

  • Conference paper
  • First Online:
Proceedings of International Conference on Computer Vision and Image Processing

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 459))

Abstract

With the rapid growth of digital libraries, e-governance and Internet applications, huge volume of documents are being generated, communicated and archived in the compressed form to provide better storage and transfer efficiencies. In such a large repository of compressed documents, the frequently used operations like keyword searching and document retrieval have to be carried out after decompression and subsequently with the help of an OCR. Therefore developing keyword spotting technique directly in compressed documents is a potential and challenging research issue. In this backdrop, the paper presents a novel approach for searching keywords directly in run-length compressed documents without going through the stages of decompression and OCRing. The proposed method extracts simple and straightforward font size invariant features like number of run transitions and correlation of runs over the selected regions of test words, and matches with that of the user queried word. In the subsequent step, based on the matching score, the keywords are spotted in the compressed document. The idea of decompression-less and OCR-less word spotting directly in compressed documents is the major contribution of this paper. The method is experimented on a data set of compressed documents and the preliminary results obtained validate the proposed idea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bai, S., Li, L., and Tan, C. L. Keyword spotting in document images through word shape coding. International Conference on Document Analysis and Recognition (ICDAR) (2009), 331–335.

    Google Scholar 

  2. CCITT-Recommedation (T.4). Standardization of group 3 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services, vol. vii, fascicle, vii.3, geneva. Tech. rep., 1985.

    Google Scholar 

  3. CCITT-Recommedation (T.6). Standardization of group 4 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services, vol. vii, fascicle, vii.3, geneva. Tech. rep., 1985.

    Google Scholar 

  4. Chen, F. R., Bloomberg, D. S., and Wilcox, L. D. Detection and location of multicharacter sequences in lines of imaged text. Journal of Electonic Imaging 5, 1 (January 1996), 37–49.

    Google Scholar 

  5. Doermann, D. The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding 70, 3 (1998), 287–298.

    Article  Google Scholar 

  6. Hull, J. J. Document matching on ccitt group 4 compressed images. SPIE Conference on Document Recognition IV (Feb 1997), 8–14.

    Google Scholar 

  7. Hull, J. J., and Cullen, J. Document image similarity and equivalence detection. International Conference on Document Analysis and Recognition (ICDAR) 1 (1997), 308–312.

    Google Scholar 

  8. Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents. 2nd IAPR Asian Conference on Pattern Recognition (ACPR) (November 2013), 813–817.

    Google Scholar 

  9. Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Automatic detection of font size straight from run length compressed text documents. IJCSIT 5, 1 (February 2014), 818–825.

    Google Scholar 

  10. Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain. 13th International Conference on Document Analysis and Recognition (ICDAR) (2015), 1–5.

    Google Scholar 

  11. Javed, M., Nagabhushan, P., and Chaudhuri, B. B. A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting. 13th International Conference on Document Analysis and Recognition (ICDAR) (2015), 216–220.

    Google Scholar 

  12. Lu, Y., and Tan, C. L. Document retrieval from compressed images. Pattern Recognition 36 (2003), 987–996.

    Article  Google Scholar 

  13. Lu, Y., and Tan, C. L. Word searching in ccitt group 4 compressed document images. International Conference on Document Analysis and Recognition (ICDAR) (2003), 467–471.

    Google Scholar 

  14. Murugappan, A., Ramachandran, B., and Dhavachelvan, P. A survey of keyword spotting techniques for printed document images. Artificial Intelligence Review 35, 2 (2011), 119–136.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bidyut Baran Chaudhuri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Singapore

About this paper

Cite this paper

Javed, M., Nagabhushan, P., Chaudhuri, B.B. (2017). Spotting of Keyword Directly in Run-Length Compressed Documents. In: Raman, B., Kumar, S., Roy, P., Sen, D. (eds) Proceedings of International Conference on Computer Vision and Image Processing. Advances in Intelligent Systems and Computing, vol 459. Springer, Singapore. https://doi.org/10.1007/978-981-10-2104-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2104-6_33

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2103-9

  • Online ISBN: 978-981-10-2104-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics