skip to main content
10.1145/1568296.1568307acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Accessing the content of Greek historical documents

Published: 23 July 2009 Publication History

Abstract

In this paper, we propose an alternative method for accessing the content of Greek historical documents printed during the 17th and 18th centuries by searching words directly in digitized documents based on word spotting, without the use of an optical character recognition engine. We describe a methodology according to which synthetic word images are created from keywords. These images are compared to all the words in the digitized documents while user feedback is used in order to refine the search procedure. In order to improve the efficiency of accessing and searching, we have used natural language processing techniques that comprise (i) a morphological generator for early Modern Greek which provides the users with the ability to search documents using only a word stem and locate all the corresponding inflected word forms and (ii) a synonym dictionary which facilitates access to the semantic context of documents and enriches the results of the search process.

References

[1]
Antworth, E., 1990. PC-KIMMO: A Two-level Processor for Morphological Analysis, Occasional Publications in Academic Computing no 16, Summer Institute of Linguistics, Dallas TX.
[2]
Beesley, K., Karttunen, L., 2003. Finite State Morphology. CSLI Publications.
[3]
Bokser, M., 1992. Omnidocument technologies, Proc. of the IEEE, 80(7), 1066--1078.
[4]
Doerman, D., 1997. The detection of duplicates in document image databases, Proc. of the 4th Int. Conf. on Document Analysis and Recognition (ICDAR'97), 314--318.
[5]
Gatos, B., Danatsas, D., Pratikakis I., Perantonis, S. J.: 2005. Automatic table detection in document images, Proceedings of the Third International Conference on Advances in Pattern Recognition (ICAPR'05), Lecture Notes in Computer Science (3686), 609--618.
[6]
Gatos, B., Papamarkos, N., Chamzas, C., 1997. A binary tree based OCR technique for machine printed characters, Engineering Applications of Artificial Intelligence, 10(4), 403--412.
[7]
Gatos, B., Pratikakis, I., Perantonis, S. J. 2006. Adaptive Degraded Document Image Binarization, Pattern Recognition, vol. 39, 317--327.
[8]
Guillevic, D., Suen, C. Y., 1997. HMM word recognition engine, Fourth International Conference on Document Analysis and Recognition (ICDAR'97), 544--547.
[9]
Karttunen, L., 1983. KIMMO: A General Morphological Processor, Texas Linguistic Forum, vol. 22, 163--186.
[10]
Karttunen, L., Oflazer, K., 2000. Special Issue on Finite-State Methods in NLP: Computational Linguistics, vol. 26, no. 1.
[11]
Keaton, P., Greenspan, H., Goodman, R., 1997. Keyword spotting for cursive document retrieval, Workshop on Document Image Analysis (DIA 1997), 74--82.
[12]
Koskenniemi, K., 1983. Two-level Morphology: A General Computational Model for Wordform Recognition and Production, Publication No 11, Dept. of General Linguistics, University of Helsinki.
[13]
Konidaris, T., Gatos, B., Ntzios, K. Pratikakis I., Theodoridis, S., Perantonis, S. J., 2007. Keyword-Guided Word Spotting in Historical Printed Documents Using Synthetic Data and User feedback, International Journal on Document Analysis and Recognition (IJDAR), special issue on historical documents, Vol. 9, No. 2--4, 167--177.
[14]
Lu, Y., Tan, C., Weihua, H., Fan, L., 2001. An approach to word image matching based on weighted Hausdorff distance, Sixth International Conference on Document Analysis and Recognition (ICDAR'01), 10--13
[15]
Manmatha R., Croft, W. B., 1997. A Draft of Word Spotting: Indexing Handwritten Manuscripts, Intelligent Multimedia Information Retrieval, MIT Press, Cambridge, MA, 43--64.
[16]
Marcolino, A., Ramos, V., Ármalo, M., Pinto, J. C., 2000. Lineand Word matching in old documents, Proceedings of the Fifth Ibero-American Symposium on Pattern Recognition (SIAPR'00), 123--125.
[17]
Perantonis, S. J., Gatos, B., Papamarkos, N., 1999. Block decomposition and segmentation for fast Hough transform evaluation, Pattern Recognition, vol. 32(5), pp. 811--824.
[18]
Ralli, A., Galiotou, E., 2004. Greek Compounds: A Challenging Case for the Parsing Techniques of PC-KIMMO v. 2, International Journal of Computational Intelligence, vol. 1, no. 2, 152--162.
[19]
Rath T. M., Manmatha, R., 2003. Features for word spotting in historical documents, Proc. of the 7th Int. Conf. on Document Analysis and Recognition (ICDAR'03), 218--222.
[20]
Schmid, H., 2005. A Programming Language for Finite State Transducers, Proc. FSMNLP 2005, Helsinki, Finland.
[21]
Schmid, H., Fitschen, A., Heid, U., 2004. SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection, Proc. LREC 2004, Lisbon, Portugal, 1263--1266.
[22]
Sgarbas, K., Kokkinakis, N. G., 1995. A PC-KIMMO-Based Morphological Description of Modern Greek, Literary and Linguistic Computing, 10(3), 189--201.
[23]
Stamatopoulos, N., Gatos, B., Kesidis, A., 2007. Automatic Borders Detection of Camera Document Images, 2nd International Workshop on Camera-Based Document Analysis and Recognition (CBDAR'07), Curitiba, Brazil, 71--78.
[24]
Theodoridis, S., Koutroumbas, K. 1997. Pattern recognition. Academic Press, New York.
[25]
Turcato, D., Popowich, F., Toole, J., Fass, D., Nicholson, D., Tisher, D., 2000. Adapting a synonym database to specific domains, J. Klavans and J. Gonzalo J. (eds.) Proceedings of the ACL Workshop on Recent Advances in Natural Language Processing and Information Retrieval, 1--11.
[26]
Voorhees, E. M., 1998. Using WordNet for text retrieval, C. Fellbaum, (ed.) Wordnet: An Electronic Lexical Database, MIT Press Books, chap. 12, 285--303.
[27]
Wahl, F. M., Wong, K. Y., Casey, R. G., 1982. Block segmentation and text extraction in mixed text/image documents, Comput. Graph. Image Process. 20, 375--390
[28]
Yin, P. Y. 2001. Skew detection and block classification of printed documents, Image and Vision Computing 19, 567--579.

Cited By

View all
  • (2016)A Survey on handwritten documents word spottingInternational Journal of Multimedia Information Retrieval10.1007/s13735-016-0110-y6:1(31-47)Online publication date: 15-Oct-2016
  • (2015)High performance Query-by-Example keyword spotting using Query-by-String techniquesProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333860(741-745)Online publication date: 23-Aug-2015
  • (2015)Using attributes for word spotting and recognition in polytonic greek documentsProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333849(686-690)Online publication date: 23-Aug-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
July 2009
127 pages
ISBN:9781605584966
DOI:10.1145/1568296
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computational morphology
  2. historical document indexing
  3. natural language processing
  4. word spotting

Qualifiers

  • Research-article

Conference

AND '09

Acceptance Rates

AND '09 Paper Acceptance Rate 15 of 22 submissions, 68%;
Overall Acceptance Rate 15 of 22 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2016)A Survey on handwritten documents word spottingInternational Journal of Multimedia Information Retrieval10.1007/s13735-016-0110-y6:1(31-47)Online publication date: 15-Oct-2016
  • (2015)High performance Query-by-Example keyword spotting using Query-by-String techniquesProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333860(741-745)Online publication date: 23-Aug-2015
  • (2015)Using attributes for word spotting and recognition in polytonic greek documentsProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333849(686-690)Online publication date: 23-Aug-2015
  • (2014)Using digital corpora for preserving and processing cultural heritage texts: a case studyLibrary Review10.1108/LR-11-2013-014263:6/7(408-421)Online publication date: 26-Aug-2014
  • (2013)Unsupervised Classification of Structurally Similar Document ImagesProceedings of the 2013 12th International Conference on Document Analysis and Recognition10.1109/ICDAR.2013.248(1225-1229)Online publication date: 25-Aug-2013
  • (2011)Efficient Cut-Off Threshold Estimation for Word Spotting ApplicationsProceedings of the 2011 International Conference on Document Analysis and Recognition10.1109/ICDAR.2011.64(279-283)Online publication date: 18-Sep-2011
  • (2011)A Tool for Tuning Binarization TechniquesProceedings of the 2011 International Conference on Document Analysis and Recognition10.1109/ICDAR.2011.10(1-5)Online publication date: 18-Sep-2011
  • (2011)Digital Libraries and Document Image Retrieval Techniques: A SurveyLearning Structure and Schemas from Documents10.1007/978-3-642-22913-8_9(181-204)Online publication date: 2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media