Abstract
In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.
Similar content being viewed by others
References
Antworth, E.: PC-KIMMO: A Two-level Processor for Morphological Analysis, Occasional Publications in Academic Computing no 16, Summer Institute of Linguistics, Dallas TX (1990)
Antonacopoulos, A., Karatzas, D.: Semantics-based content extraction in typewritten historical documents. In: Eighth International Conference on Document Analysis and Recognition, pp. 48–53, 2005
Bai, D., Song, P., Bruza, J., Nie, J., Cao, J.: Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th International Conference on Information and Knowledge Management (CIKM05), 2005
Beesley K., Karttunen L.: Finite State Morphology. CSLI Publications, Stanford (2003)
Bokser M.: Omnidocument technologies. Proc. IEEE 80(7), 1066–1078 (1992)
Cao, J., Nie, J., Bai, J.: Integrating word relationships into language models. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005
Doerman, D.: The detection of duplicates in document image databases. In: Proc. of the 4th Int. Conf. on Document Analysis and Recognition (ICDAR’97), pp. 314–318, 1997
Ernst-Gerlach, A., Fuhr, N.: Generating Term Variants for Text Collections with Historic Spellings. In: Proceedings of the 28th European Conference on Information RetrievalResearch (ECIR 2006), Springer, 2006
Fang, H.: A re-examination of query expansion using lexical resources. In: Proceedings of ACL’08, pp. 139–147, Columbus, Ohio, 2008
Fang, H., Zai, C.: An exploration of axiomatic approaches to information retrieval. In: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, 2005
Fang, C., Zai, C.: Semantic term matching in axiomatic approaches to information retrieval. In: Proceedings of the 2006 ACM SIGIR Conference on Research and Development in Information Retrieval, 2006
Gatos, B., Danatsas, D., Pratikakis I., Perantonis, S.J.: Automatic table detection in document images. In: Proceedings of the Third International Conference on Advances in Pattern Recognition (ICAPR’05). Lecture Notes in Computer Science (3686), pp. 609–618. (2005)
Gatos B., Papamarkos N., Chamzas C.: A binary tree based OCR technique for machine printed characters. Eng. Appl. Artif. Intell. 10(4), 403–412 (1997)
Gatos B., Pratikakis I., Perantonis S.J.: Adaptive degraded document image binarization. Pattern Recognit 39, 317–327 (2006)
Guillevic, D., Suen, C.Y.: HMM word recognition engine. In: Fourth International Conference on Document Analysis and Recognition (ICDAR’97), pp. 544–547, 1997
Karttunen L.: KIMMO: a general morphological processor. Tex. Linguist. Forum 22, 163–186 (1983)
Karttunen, L., Oflazer, K.: Special issue on finite-state methods in NLP: computational linguistics. 26(1), 1–2 (2000)
Keaton, P., Greenspan, H., Goodman, R.: Keyword spotting for cursive document retrieval. In: Workshop on Document Image Analysis (DIA 1997), pp. 74–82, 1997
Keskustalo H., Järvelin K., Pirkola A.: Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value. Inf. Retr. 11(3), 209–228 (2008)
Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-form Recognition and Production. Publication No 11, Dept. of General Linguistics, University of Helsinki (1983)
Konidaris T., Gatos B., Ntzios K., Pratikakis I., Theodoridis S., Perantonis S.J.: Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Doc. Anal. Recognit. (IJDAR) Spec. Issue Hist. Doc. 9(2–4), 167–177 (2007)
Lampropoulos, A., Galiotou, E., Manolessou, I., Ralli, A.: A finite state approach to the computational morphology of early Modern Greek. In: Proceedings of the 7th WSEAS International Conference on Applied Computer Science, Venice, pp. 242–245, 2007
Liu, S., Liu, F., Yu, C., Meng, W.: An effective approach to document retrieval using WordNet and recognizing phrases. In: Proceedings of the 2004 ACM SIGIR Conference on Research and Development in Information Retrieval, 2004
Lu, Y., Tan, C., Weihua, H., Fan, L.: An approach to word image matching based on weighted Hausdorff distance. In: Sixth International Conference on Document Analysis and Recognition (ICDAR’01), pp. 10–13, 2001
Mandala, R., Tokunaga, T., Tanaka, H.: Combining multiple evidence from different types of thesaurus for query expansion. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, 1999
Manmatha R., Croft W.B.: A Draft of Word Spotting: Indexing Handwritten Manuscripts, Intelligent Multimedia Information Retrieval, pp. 43–64. MIT Press, Cambridge, MA (1997)
Marcolino, A., Ramos, V., Ármalo, M., Pinto, J.C.: Linea and Word matching in old documents. In: Proceedings of the Fifth Ibero-American Symposium on Pattern Recognition (SIAPR’00), pp. 123–125, 2000
Perantonis S.J., Gatos B., Papamarkos N.: Block decomposition and segmentation for fast Hough transform evaluation. Pattern Recognit. 32(5), 811–824 (1999)
Ralli A., Galiotou E.: Greek Compounds: A Challenging Case for the Parsing Techniques of PC-KIMMO v.2. Int. J. Comput. Intell. 1(2), 152–162 (2004)
Rath, T.M., Manmatha, R.: Features for word spotting in historical documents. In: Proc. of the 7th Int. Conf. on Document Analysis and Recognition (ICDAR’03), pp. 218–222, 2003
Roark B., Sproat R.: Computational Approaches to Morphology and Syntax. Oxford university Press, Oxford (2007)
Salton G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Schmid, H.: A Programming Language for Finite State Transducers. In: Proc. FSMNLP 2005, Helsinki, Finland, 2005
Schmid, H., Fitschen, A., Heid, U.: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection. In: Proc. LREC 2004, Lisbon, Portugal, pp. 1263–1266, 2004
Sgarbas K., Kokkinakis N.G.: A PC-KIMMO-Based Morphological Description of Modern Greek. Lit. Linguist. Comput. 10(3), 189–201 (1995)
Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic Borders Detection of Camera Document Images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition (CBDAR’07), Curitiba, Brazil, pp. 71–78, 2007
Theodoridis S., Koutroumbas K.: Pattern recognition. Academic Press, New York (1997)
Turcato, D., Popowich, F., Toole, J., Fass, D., Nicholson, D., Tisher, D.: Adapting a synonym database to specific domains. In: Klavans J., Gonzalo J. (eds.) Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval, pp. 1–11 (2000)
Veltkamp, R.C., Hagedoorn, M.: Shape similarity measures, properties, and constructions. In: Advances in Visual Information Systems, 4th Int. Conf, VISUAL 2000, pp. 467–476, 2000
Voorhees E.M.: Using WordNet for text retrieval. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, chap. 12, pp. 285–303. MIT Press Books, Cambridge (1998)
Wahl F.M., Wong K.Y., Casey R.G.: Block segmentation and text extraction in mixed text/image documents. Comput. Graph. Image Process 20, 375–390 (1982)
Wolf C., Jolion J.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. 8(4), 280–296 (2006)
Yin P.Y.: Skew detection and block classification of printed documents. Image Vis. Comput. 19, 567–579 (2001)
Zhiguo, G., Chan, W.C., Long, H.U.: Web query expansion by WordNet. In: Proceedings of DEXA’05, Copenhagen, pp. 166–175, Springer, 2005
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kesidis, A.L., Galiotou, E., Gatos, B. et al. A word spotting framework for historical machine-printed documents. IJDAR 14, 131–144 (2011). https://doi.org/10.1007/s10032-010-0134-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-010-0134-4