Abstract
As large quantity of document images is getting archived by the digital libraries, there is a need for an efficient search strategies to make them available as per users information need. In this paper, we propose an effective word image matching scheme that achieves high performance in the presence of script variability, printing variation, degradation and word-form variants. A novel partial matching algorithm is designed for morphological matching of word form variants in a language. We formulate feature extraction scheme that extracts local features by scanning vertical strips of the word image and combining them automatically based on their discriminatory potential. We present detailed performance analysis of the proposed approach on English, Amharic and Hindi documents.
Similar content being viewed by others
References
Ataer, E., Duygulu, P.: Retrieval of ottoman documents. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 155–162 (2006)
Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Proceedings of the Seventh International Association for Pattern Recognition (IAPR) Workshop on Document Analysis Systems (DAS), pp. 1–12 (2006)
Breiteneder, C., Eidenberger, H.: Content-based image retrieval in digital libraries. In: Proceedings of International Conference on Digital Libraries: Research and Practice, pp. 67–74 (2000)
Brown, M., Foote, J., Jones, G., Jones, K.S., Young, S.: Open-vocabulary speech indexing for voice and video mail retrieval. In: Proceedings of the Fourth ACM International Multimedia Conference, pp. 307–316 (1996)
Callan, J., Kantor, P., Grossman, D. (eds.): Information retrieval and OCR: from converting content to grasping meaning. In: Proceedings of the SIGIR 2002 Workshop, University of Tampere, Finland, 15 August 2002
Chan, J., Ziftci, C., Forsyth, D.: Searching off-line arabic documents. In: Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1455–1462 (2006)
Chaudhury, S., Geetika Sethi, A.V., Harit, G.: Devising interactive access techniques for indian language document images. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 885–889 (2003)
Devillard N.: Infrared jitter imaging data reduction algorithms. Astron. Soc. Pac. Conf. Ser. 172, 172–333 (1999)
Doermann D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70(3), 287–298 (1998)
Duda R.O., Hart P.E., Stork D.G.: Pattern Classification. Willey, New York (2001)
Foote J.: An overview of audio information retrieval. ACM Multimed. Syst. J. 7, 2–10 (1999)
Gonzalez W.: Digital Image Processing. Addison–Wesley, Massachusetts (1992)
Harman, D.K. (ed.): In: Proceedings of TREC-4. NIST Special Publication 500-236, Gaithersburg, MD, November 1995
Hawking, D.: Document retrieval in OCR-scanned text. In: Proceedings of the Sixth Parallel Computing Workshop (1996)
Jain, A.K., Namboodiri, A.M.: Indexing and retrieval of on-line handwritten documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 655–659 (2003)
Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. 9(2), 167–177 (2007)
Korfhage R.: Information Storage and Retrieval. Willey, New York (1997)
Kumar, A, Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: Proceedings of 8th Asian Conference on Computer Vision (ACCV’07), Part I, LNCS, vol. 4843, pp. 586–595 (2007)
Lu Y., Tan C.L.: Information retrieval in document image databases. IEEE Trans. Knowl. Data Eng. 16(11), 1398–1410 (2004)
Manmatha R., Croft W.B.: Word spotting: indexing handwritten archives. In: Maybury, M.(eds) Intelligent Multimedia Information Retrieval Collection, pp. 43–64. AAAI/MIT Press, Cambridge (1997)
Marinai, S.: A survey of document image retrieval in digital libraries. In: 9th Colloque International Francophone Sur l’Ecrit et le Document (CIFED), pp. 193–198 (2006)
Marinai S., Marino E., Soda G.: Font adaptative word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 28(8), 1187–1199 (2006)
Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), pp. 218–222 (2003)
Rath T., Manmatha R.: Word image matching using dynamic time warping. Proc. Conf. Comput. Vis. Pattern Recognit. 2, 521–527 (2003)
Rui Y., Huang T., Chang S.: Image retrieval: Past, present, and future. J. Vis. Commun. Image Represent. 10, 1–23 (1999)
Taghva K., Borsack J., Condit A., Erva S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)
Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)
Trenkle, J.M., Vogt, R.C.: Word recognition for information retrieval in the image domain. In: Symposium on Document Analysis and Information Retrieval, pp. 105–122 (1993)
Trier O.D., Jain A.K., Taxt T.: Feature extraction methods for character recognition: a survey. Pattern Recognit. 29(4), 641–662 (1996)
Zagoris, N.P.K., Chamzas, C.: Web document image retrieval system based on word spotting. In: IEEE International Conference on Image Processing, pp. 477–480 (2006)
Zhang B., Srihari S.N., Huang C.: Word image retrieval using binary features. Proc. Doc. Recognit. Retr. XI, 45–53 (2004)
Zheng, Q., Kanungo, T.: Morphological degradation models and their use in document image restoration. In: International Conference on Image Processing, pp. 193–196 (2001)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Meshesha, M., Jawahar, C.V. Matching word images for content-based retrieval from printed document images. IJDAR 11, 29–38 (2008). https://doi.org/10.1007/s10032-008-0067-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-008-0067-3