Skip to main content
Log in

Keyword-guided word spotting in historical printed documents using synthetic data and user feedback

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel technique for word spotting in historical printed documents combining synthetic data and user feedback. Our aim is to search for keywords typed by the user in a large collection of digitized printed historical documents. The proposed method consists of the following stages: (1) creation of synthetic image words; (2) word segmentation using dynamic parameters; (3) efficient feature extraction for each word image and (4) a retrieval procedure that is optimized by user feedback. Experimental results prove the efficiency of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. ABBYY FineReader®. http://www.abbyy.com/finereader_ ocr/. (2005)

  2. Baird, H.S.: The state of the art of document image degradation modeling. In: IAPR 2000 Workshop on Document Analysis Systems, December 2000, pp. 10–13 (2000)

  3. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval form document image collections. In: DAS 2006, pp. 1–12 (2006)

  4. Bhat, D.: An evolutionary measure for image matching. In: Proceedings of the Fourteenth International Conference on Pattern Recognition, ICPR’98, vol. I, pp. 850–852 (1998)

  5. Bokser M (1992). Omnidocument technologies. Proc. IEEE 80(7): 1066–1078

    Article  Google Scholar 

  6. Cha, S.-H., Shin, Y.-C., Srihari, S.N.: Approximate stroke sequence string matching algorithm for character recognition and analysis. In Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR’99), pp. 53–56 (1999)

  7. Doerman, D., Li, H., Kia, O.: The detection of duplicates in document image databases. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), pp. 314–318 (1997)

  8. Downton, A.C., Lucas, S.M., Patoulas, G., Beccaloni, G.W., Scoble, M.J., Robinson, G.S.: Computerising natural history cards. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03), pp. 354–358 (2003)

  9. Gatos B., Papamarkos N. and Chamzas C. (1997). A binary tree based OCR technique for machine printed characters. Eng. Appl. Artif. Intell. 10(4): 403–412

    Article  Google Scholar 

  10. Gatos, B., Danatsas, D., Pratikakis I., Perantonis, S.J.: Automatic table detection in document images. In: Proceedings of the Third International Conference on Advances in Pattern Recognition (ICAPR’05), Lecture Notes in Computer Science (3686), pp. 609–618, Path, UK (2005)

  11. Gatos, B., Mantzaris, S.L., Chandrinos, K.V., Tsigris, A., Perantonis, S.J.: Integrated algorithms for newspaper page decomposition and article tracking. In: Proceedings Fifth International Conference on Document Analysis and Recognition (ICDAR’99), September 1999, pp. 559–562 (1999)

  12. Gatos, B., Pratikakis, I., Perantonis, S.J.: An adaptive binarisation technique for low quality historical documents In: IAPR Workshop on Document Analysis Systems (DAS2004), Lecture Notes in Computer Science (3163), September 2004, pp. 102–113 (2004)

  13. Gatos B., Pratikakis I. and Perantonis S.J. (2006). Adaptive degraded document image binarization. Pattern Recogn. 39: 317–327

    Article  MATH  Google Scholar 

  14. Govindan V.K. and Shivaprasad A.P. (1990). Character recognition—a review. Pattern Recogn 23(7): 671–683

    Article  Google Scholar 

  15. Guillevic, D., Suen, C.Y.: HMM word recognition engine. In: Fourth International Conference on Document Analysis and Recognition (ICDAR’97), pp. 544–547 (1997)

  16. Keaton, P., Greenspan, H., Goodman, R.: Keyword spotting for cursive document retrieval. In: Workshop on Document Image Analysis (DIA 1997), pp. 74–82 (1997)

  17. Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the International Workshop on Document Image Analysis for Libraries. pp. 278–287 (2004)

  18. Lu, Y., Tan, C., Weihua, H., Fan, L.: An approach to word image matching based on weighted Hausdorff distance. In: Sixth International Conference on Document Analysis and Recognition (ICDAR’01), September 2001, pp. 10–13 (2001)

  19. Madhvanath S. and Govindaraju V. (1999). Local reference lines for handwritten word recognition. Pattern Recogn 32: 2021–2028

    Article  Google Scholar 

  20. Manmatha, R., Croft, W.B.: Word spotting: indexing handwritten manuscripts. In: Intelligent Multimedia Information Retrieval. MIT, Cambridge, MA, Maybury, pp. 43–64 (1997)

  21. Manmatha, R., Rothfeder, J.L.: A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 2005

  22. Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing handwriting using word matching. In: Digital Libraries ’96: First ACM International Conference on Digital Libraries, pp. 150–159 (1999)

  23. Marcolino, A., Ramos, V., Ármalo, M., Pinto, J.C.: Line and Word matching in old documents. In: Proceedings of the Fifth IberoAmerican Sympsium on Pattern Recognition (SIAPR’00), September 2000, pp. 123–125 (2000)

  24. Niblack W. (1996). An Introduction to Digital Image Processing. Prentice Hall, Englewood cliffs

    Google Scholar 

  25. Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: ACM SIGIR conference, pp. 369–376, (2004)

  26. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 521–527 (2003)

  27. Rath, T.M., Manmatha, R.: Features for word spotting in historical documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03), pp. 218–222 (2003)

  28. Rijsbergen van K. Information Retrieval. http://www.dcs.gla.ac.uk/Keith/Preface.html

  29. Seni G. and Cohen E. (1994). External word segmentation of off-line handwritten text lines. Pattern Recogn. 27(1): 41–52

    Article  Google Scholar 

  30. Theodoridis S. and Koutroumbas K. (1997). Pattern recognition. Academic, New York

    Google Scholar 

  31. Vinciarelli A., Bengio S. and Bunke H. (2004). Offline recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6): 709–720

    Article  Google Scholar 

  32. Waked, B., Suen, C. Y., Bergler, S.: Segmenting document images using diagonal white runs and vertical edges. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR’01), pp. 194–199 (2001)

  33. Wang J., Leung K.H. and Hui S.C. (1997). Cursive word reference line detection. Pattern Recogn 30(3): 503–511

    Article  Google Scholar 

  34. Weihua, H., Tan, C.L., Sung, S.Y., Xu, Y.: Word shape recognition for image-based document retrieval. In: International Conference on Image Processing, ICIP’2001, October 2001, pp. 8–11 (2001)

  35. Wahl F.M., Wong K.Y. and Casey R.G. (1982). Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process. 20: 375–390

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to T. Konidaris.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Konidaris, T., Gatos, B., Ntzios, K. et al. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. IJDAR 9, 167–177 (2007). https://doi.org/10.1007/s10032-007-0042-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-007-0042-4

Keywords

Navigation