Abstract
Retrieval in historic documents with non-standard spelling requires a mapping from search terms onto the historic terms in the document. For describing this mapping, we have developed a rule-based approach. The bottleneck of this method has been the training set construction for the algorithm where an expert has to assign manually current word forms to historic spelling variants. As a better solution, we apply a spell checker on a corpus of historic texts, which gives us a list of candidate terms and associated suggestions. The new method generates possible rules for the suggestions and accepts the most frequent rules. Experimental results with German and English texts from different centuries demonstrate the feasibility of our approach. Thus a training set can be constructed with much less initial effort.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Awakian, A.: Development of a user-interface for an interactive rule development. Master thesis, University of Duisburg-Essen (2010)
Baron, A., Rayson, P.: VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham (2008)
Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)
Ernst-Gerlach, A., Fuhr, N.: Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 49–60. Springer, Heidelberg (2006)
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341. ACM, New York (2007)
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. In: Proceedings of the ACM SIGIR 2009 Workshop on Analytics for Noisy Unstructured Text Data (AND 2009), Barcelona, pp. 69–76 (2009)
Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information Access to Historical Documents from the Early New High German Period. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007) Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, pp. 147–154 (2007)
Korbar, D.: Visualisation of rule structures and rule modification possibilities for texts with non-standard spelling. Master thesis, University of Duisburg-Essen (2010)
Pilz, T.: Nichtstandardisierte Rechtschreibung - Variationsmodellierung und rechnergestützte Variationsverarbeitung. Doctoral thesis, University of Duisburg-Essen (2009)
Pilz, T., Luther, W.: Automated support for evidence retrieval in documents with nonstandard orthography. In: Featherston, S., Winkler, S. (eds.) The Fruits of Empirical Linguistics. Process, vol. 1, pp. 211–228. Mouton de Gruyter, Berlin (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ernst-Gerlach, A., Fuhr, N. (2010). Advanced Training Set Construction for Retrieval in Historic Documents. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-17187-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)