Advanced Training Set Construction for Retrieval in Historic Documents

Ernst-Gerlach, Andrea; Fuhr, Norbert

doi:10.1007/978-3-642-17187-1_12

Andrea Ernst-Gerlach²⁰ &
Norbert Fuhr²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Included in the following conference series:

Asia Information Retrieval Symposium

1366 Accesses
2 Citations

Abstract

Retrieval in historic documents with non-standard spelling requires a mapping from search terms onto the historic terms in the document. For describing this mapping, we have developed a rule-based approach. The bottleneck of this method has been the training set construction for the algorithm where an expert has to assign manually current word forms to historic spelling variants. As a better solution, we apply a spell checker on a corpus of historic texts, which gives us a list of candidate terms and associated suggestions. The new method generates possible rules for the suggestions and accepts the most frequent rules. Experimental results with German and English texts from different centuries demonstrate the feasibility of our approach. Thus a training set can be constructed with much less initial effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Awakian, A.: Development of a user-interface for an interactive rule development. Master thesis, University of Duisburg-Essen (2010)
Google Scholar
Baron, A., Rayson, P.: VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham (2008)
Google Scholar
Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)
Article MATH Google Scholar
Ernst-Gerlach, A., Fuhr, N.: Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 49–60. Springer, Heidelberg (2006)
Chapter Google Scholar
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341. ACM, New York (2007)
Google Scholar
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. In: Proceedings of the ACM SIGIR 2009 Workshop on Analytics for Noisy Unstructured Text Data (AND 2009), Barcelona, pp. 69–76 (2009)
Google Scholar
Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information Access to Historical Documents from the Early New High German Period. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007) Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, pp. 147–154 (2007)
Google Scholar
Korbar, D.: Visualisation of rule structures and rule modification possibilities for texts with non-standard spelling. Master thesis, University of Duisburg-Essen (2010)
Google Scholar
Pilz, T.: Nichtstandardisierte Rechtschreibung - Variationsmodellierung und rechnergestützte Variationsverarbeitung. Doctoral thesis, University of Duisburg-Essen (2009)
Google Scholar
Pilz, T., Luther, W.: Automated support for evidence retrieval in documents with nonstandard orthography. In: Featherston, S., Winkler, S. (eds.) The Fruits of Empirical Linguistics. Process, vol. 1, pp. 211–228. Mouton de Gruyter, Berlin (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computational and Cognitive Sciences, University of Duisburg-Essen, Lotharstr. 65, 47048, Duisburg, Germany
Andrea Ernst-Gerlach & Norbert Fuhr

Authors

Andrea Ernst-Gerlach
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Fuhr
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Engineering, Roosevelt Road National Taiwan University, No. 1, Sec. 4, 10617, Taipei, Taiwan R.O.C.
Pu-Jen Cheng
School of Computing, National University of Singapore (NUS), Computing 1, 13 Computing Drive, 117417, Singapore
Min-Yen Kan
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong Shatin, N.T. Hong Kong, China
Wai Lam
School of Computing, Computing 1, National University of Singapore (NUS), 13 Computing Drive, 117417, Singapore
Preslav Nakov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ernst-Gerlach, A., Fuhr, N. (2010). Advanced Training Set Construction for Retrieval in Historic Documents. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-17187-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics