Abstract
In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Biella, D., Dyllong, E., Kaiser, H., Luther, W., Mittmann, T.: Edition électronique de la réception de Nietzsche des années 1865 à 1945. In: Proc. ICHIM 2003, Paris (2003)
Biella, D., Dyllong, E.H., Luther, W., Pilz, T.: An On-line Literature Research System with Rule-Based Search. In: Proc. of the 4th European Conference on e-Learning (ECEL 2005), Amsterdam (2005)
Camps, R., Daudé, J.: Improving the efficacy of approximate personal name matching. In: Proc. 8th International Conference on Applications of Natural Language to Information Systems (NLDB 2003) (2003), http://www.lsi.upc.es/dept/techreps/ps/R03-9.ps.gz
Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)
Cohen, W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17(2), 141–173 (1999)
De Roux, E.: 19 bibliothèques en Europe signent un manifeste pour contrer le projet de Google. Le Monde, Paris (28.04.2005)
Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures & Algorithms Context-sensitive learning methods for text categorization. Prentice-Hall, Englewood Cliffs (1992), DBLP, http://dblp.uni-trier.de
Keller, R.: Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Hamburg (1986)
Nottelmann, H.: PIRE: An extensible IR engine based on probabilistic Datalog. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 260–274. Springer, Heidelberg (2005)
Pfeifer, U., Poersch, T., Fuhr, N.: Retrieval Effectiveness of Proper Name Search Methods. Information Processing and Management 32(6), 667–669 (1996)
Pilz, T.: Unscharfe Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung am Beispiel von Frakturtexten zur Nietzsche-Rezeption. Staatsexamensarbeit. Universität Duisburg-Essen (2003)
Peters, C. (Hrsg.): CLEF 2000. LNCS, vol. 2069. Springer, Heidelberg (2001)
Quasthoff, U.: Projekt Der Deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Proc. from the GLDV-Tagung, Linguistig und neue Medien, März 17-19 (1997), pp. 93–99. Deutscher Universitätsverlag, Leipzig (1998)
Rayson, P., Archer, D., Smith, N.: VARD versus Word. A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In: Proceedings of the Corpus Linguistics 2005 conference, Proc. from the Corpus Linguistics Conference Series on-line e-journal, Birmingham, UK, vol. 1(1) (2005)
Strunk, J.: Information Retrieval for Languages that lack a fixed orthography (2003), http://www.linguistics.ruhr-uni-bochum.de/~strunk/LSreport.pdf
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. 19th Inter. Conf. on Research and Development in Information Retrieval (SIGIR), New York, pp. 166–172 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ernst-Gerlach, A., Fuhr, N. (2006). Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_6
Download citation
DOI: https://doi.org/10.1007/11735106_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)