Abstract
JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cavnar, W.B., Trenkle, J.M.: N-Gram Based Text Categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)
Church, K.W.: Char_align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)
Damashek, M.: Gauging Similarity with n-grams: Language-Independent Categorization of Text. Science 267, 843–848 (1995)
Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and Information Technology, The Netherlands (2000)
Jelinek, F., Mercer, R.: Interpolated Estimation of Markov Source Parameters from Sparse Data. In: Gelsema, E.S., Kanal, L.N. (eds.) Pattern Recognition in Practice, pp. 381–402. North-Holland, Amsterdam (1980)
Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation (unpublished) http://www.isi.edu/koehn/publications/europarl/
Mayfield, J., McNamee, P., Piatko, C.: The JHU/APL HAIRCUT System at TREC-8. In: Voorhees, E., Harman, D. (eds.) Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication, Gaithersburg, Maryland, pp. 500–246 (2000)
Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of the 26th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, Ontario, pp. 415–416 (July 2003)
McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Working Notes of the CLEF 2003 Workshop, pp. 19-28 (2003)
McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)
McNamee, P., Mayfield, J.: Translating Pieces of Words. In: Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2005), Salvador, Brazil, pp. 643–644 (August 2005)
Mihalcea, R., Nastase, V.: Letter Level Learning for Language Independent Diacritics Restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL 2002), pp. 105–111 (2002)
Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275–281 (1998)
Zamora, E.M., Pollock, J.J., Zamora, A.: The Use of Trigram Analysis for Spelling Error Detection. Information Processing and Management 17, 305–316 (1981)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
McNamee, P. (2006). Exploring New Languages with HAIRCUT at CLEF 2005. In: Peters, C., et al. Accessing Multilingual Information Repositories. CLEF 2005. Lecture Notes in Computer Science, vol 4022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11878773_17
Download citation
DOI: https://doi.org/10.1007/11878773_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45697-1
Online ISBN: 978-3-540-45700-8
eBook Packages: Computer ScienceComputer Science (R0)