Skip to main content
Log in

Towards information retrieval on historical document collections: the role of matching procedures and special lexica

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two alternative ways to solve this problem. In the first part of the paper, we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. In the second part of the paper, we ask if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries, it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Results indicate that for earlier periods, matching procedures alone do not lead to satisfactory results. We then describe experiments where the gain for recall obtained from historical lexica of distinct sizes is estimated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Archer, D., Ernst-Gerlach, A., Kempen, S., Pilz, T., Rayson, P.: The identification of spelling variants in English and German historical texts: manual or automatic. In: Proceedings of the Digital Humanities Conference, pp. 3–5, Paris, France (2006)

  2. Ernst-Gerlach, A., Fuhr, N.: Generating search term variants for text collections with historic spellings. In: Proceedings of the 28th European Conference on Information Retrieval Research (ECIR 2006). Springer (2006)

  3. Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint Conference on Digital libraries, pp. 333–341, ACM, New York, NY, USA (2007)

  4. EU project Improving Access to Text IMPACT, http://www.impact-project.eu/

  5. Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluisio, S.: Automatic detection of spelling variation in historical corpus. In: Davies M., Rayson, P., Hunston, S., Danielsson, P. (eds.) Proceedings of the Corpus Linguistics Conference CL2007 (2007)

  6. Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections—the role of matching procedures and special lexica. In: AND ’09: Proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data (2009)

  7. Guenthner F.: Electronic lexica and corpora research at CIS. Int. J. Corpus Linguist. 1(2), 287–301 (1996)

    Article  Google Scholar 

  8. Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information access to historical documents from the early new high german period. In: IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data (2006)

  9. Holley R.: How good can it get?. D-Lib Magazine 15(3/4), 1–16 (2009)

    Article  Google Scholar 

  10. Kempen, S., Luther, W., Pilz, T.: Comparison of distance measures for historical spelling variants. In: Artificial Intelligence in Theory and Practice, volume 217 of IFIP International Federation for Information Processing. pp. 295–304. Springer, Boston (2006)

  11. Koolen, M., Adriaans, F., Kamps, J., Rijke, M.: A cross-language approach to historic document retrieval. In: Lalmas, M., et al., (eds.) Proceedings of 28th European Conference on Information Retrieval Research (ECIR 2006). pp. 407–419. Springer, London (2006)

  12. Maier-Meyer, P.: Lexikon und automatische Lemmatisierung. PhD thesis, CIS, Universität München, München (1995)

  13. Pilz, T.: Searching in text databases with non-standard orthography. In Ernst-Gerlach, A., Fuhr, N.: Generating search term variants for text collections with historic spellings. In: Proceedings of the 28th European Conference on Information Retrieval Research (ECIR 2006). Springer (2006)

  14. Pilz, T., Luther, W., Ammon, U., Fuhr, N.: Rule-based search in text databases with nonstandard orthography. In: Proceedings ACH/ALLC (2005)

  15. Rayson, P., Archer, D., Smith, N.: VARD versus WORD: a comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In: Proceedings of Corpus Linguistics 2005, Birmingham University, July 14–17, Proceedings from the Corpus Linguistics Conference Series on-line e-journal, vol. 1, no. 1. ISSN:1747–9398 (2005)

  16. Vincent, L.: Google book search: document understanding on a massive scale. In: Proceedings of the Conference on Document Analysis and Recognition (ICDAR09) (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Ringlstetter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gotscharek, A., Reffle, U., Ringlstetter, C. et al. Towards information retrieval on historical document collections: the role of matching procedures and special lexica. IJDAR 14, 159–171 (2011). https://doi.org/10.1007/s10032-010-0132-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-010-0132-6

Keywords

Navigation