Skip to main content
Log in

Matching word images for content-based retrieval from printed document images

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

As large quantity of document images is getting archived by the digital libraries, there is a need for an efficient search strategies to make them available as per users information need. In this paper, we propose an effective word image matching scheme that achieves high performance in the presence of script variability, printing variation, degradation and word-form variants. A novel partial matching algorithm is designed for morphological matching of word form variants in a language. We formulate feature extraction scheme that extracts local features by scanning vertical strips of the word image and combining them automatically based on their discriminatory potential. We present detailed performance analysis of the proposed approach on English, Amharic and Hindi documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ataer, E., Duygulu, P.: Retrieval of ottoman documents. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 155–162 (2006)

  2. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Proceedings of the Seventh International Association for Pattern Recognition (IAPR) Workshop on Document Analysis Systems (DAS), pp. 1–12 (2006)

  3. Breiteneder, C., Eidenberger, H.: Content-based image retrieval in digital libraries. In: Proceedings of International Conference on Digital Libraries: Research and Practice, pp. 67–74 (2000)

  4. Brown, M., Foote, J., Jones, G., Jones, K.S., Young, S.: Open-vocabulary speech indexing for voice and video mail retrieval. In: Proceedings of the Fourth ACM International Multimedia Conference, pp. 307–316 (1996)

  5. Callan, J., Kantor, P., Grossman, D. (eds.): Information retrieval and OCR: from converting content to grasping meaning. In: Proceedings of the SIGIR 2002 Workshop, University of Tampere, Finland, 15 August 2002

  6. Chan, J., Ziftci, C., Forsyth, D.: Searching off-line arabic documents. In: Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1455–1462 (2006)

  7. Chaudhury, S., Geetika Sethi, A.V., Harit, G.: Devising interactive access techniques for indian language document images. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 885–889 (2003)

  8. Devillard N.: Infrared jitter imaging data reduction algorithms. Astron. Soc. Pac. Conf. Ser. 172, 172–333 (1999)

    Google Scholar 

  9. Doermann D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70(3), 287–298 (1998)

    Article  Google Scholar 

  10. Duda R.O., Hart P.E., Stork D.G.: Pattern Classification. Willey, New York (2001)

    MATH  Google Scholar 

  11. Foote J.: An overview of audio information retrieval. ACM Multimed. Syst. J. 7, 2–10 (1999)

    Article  Google Scholar 

  12. Gonzalez W.: Digital Image Processing. Addison–Wesley, Massachusetts (1992)

    Google Scholar 

  13. Harman, D.K. (ed.): In: Proceedings of TREC-4. NIST Special Publication 500-236, Gaithersburg, MD, November 1995

  14. Hawking, D.: Document retrieval in OCR-scanned text. In: Proceedings of the Sixth Parallel Computing Workshop (1996)

  15. Jain, A.K., Namboodiri, A.M.: Indexing and retrieval of on-line handwritten documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 655–659 (2003)

  16. Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. 9(2), 167–177 (2007)

    Article  Google Scholar 

  17. Korfhage R.: Information Storage and Retrieval. Willey, New York (1997)

    Google Scholar 

  18. Kumar, A, Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: Proceedings of 8th Asian Conference on Computer Vision (ACCV’07), Part I, LNCS, vol. 4843, pp. 586–595 (2007)

  19. Lu Y., Tan C.L.: Information retrieval in document image databases. IEEE Trans. Knowl. Data Eng. 16(11), 1398–1410 (2004)

    Article  Google Scholar 

  20. Manmatha R., Croft W.B.: Word spotting: indexing handwritten archives. In: Maybury, M.(eds) Intelligent Multimedia Information Retrieval Collection, pp. 43–64. AAAI/MIT Press, Cambridge (1997)

    Google Scholar 

  21. Marinai, S.: A survey of document image retrieval in digital libraries. In: 9th Colloque International Francophone Sur l’Ecrit et le Document (CIFED), pp. 193–198 (2006)

  22. Marinai S., Marino E., Soda G.: Font adaptative word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 28(8), 1187–1199 (2006)

    Article  Google Scholar 

  23. Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 218–222 (2003)

  24. Rath T., Manmatha R.: Word image matching using dynamic time warping. Proc. Conf. Comput. Vis. Pattern Recognit. 2, 521–527 (2003)

    Google Scholar 

  25. Rui Y., Huang T., Chang S.: Image retrieval: Past, present, and future. J. Vis. Commun. Image Represent. 10, 1–23 (1999)

    Article  Google Scholar 

  26. Taghva K., Borsack J., Condit A., Erva S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)

    Article  Google Scholar 

  27. Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)

    Article  Google Scholar 

  28. Trenkle, J.M., Vogt, R.C.: Word recognition for information retrieval in the image domain. In: Symposium on Document Analysis and Information Retrieval, pp. 105–122 (1993)

  29. Trier O.D., Jain A.K., Taxt T.: Feature extraction methods for character recognition: a survey. Pattern Recognit. 29(4), 641–662 (1996)

    Article  Google Scholar 

  30. Zagoris, N.P.K., Chamzas, C.: Web document image retrieval system based on word spotting. In: IEEE International Conference on Image Processing, pp. 477–480 (2006)

  31. Zhang B., Srihari S.N., Huang C.: Word image retrieval using binary features. Proc. Doc. Recognit. Retr. XI, 45–53 (2004)

    Google Scholar 

  32. Zheng, Q., Kanungo, T.: Morphological degradation models and their use in document image restoration. In: International Conference on Image Processing, pp. 193–196 (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. V. Jawahar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meshesha, M., Jawahar, C.V. Matching word images for content-based retrieval from printed document images. IJDAR 11, 29–38 (2008). https://doi.org/10.1007/s10032-008-0067-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-008-0067-3

Keywords

Navigation