Skip to main content

(German) Language Processing for Lucene

  • Conference paper
  • First Online:
Book cover Natural Language Processing and Information Systems (NLDB 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9103))

  • 1822 Accesses

Abstract

This paper introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. It aims at facilitating four language processing steps for working with non-English texts and Apache Lucene/Solr: lemmatizing words, weighting terms based on their part-of-speech, adding synonyms and decompounding nouns, without the necessity of a thorough understanding of natural language processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://lucene.apache.org/, http://lucene.apache.org/solr/.

  2. 2.

    https://sourceforge.net/projects/glpforlucene.

  3. 3.

    Think of Shakespeare’s To Be or not to Be, where almost every token is a stop word, and one cannot just ignore them altogether.

  4. 4.

    Models for French, Spanish, Chinese, English, and German are available from https://code.google.com/p/mate-tools/.

  5. 5.

    The interface follows the implementation found in [12], extends it by new methods, and is adapted to the newer Lucene versions 4.x. It has been tested using versions 4.6 to 4.8.1.

  6. 6.

    Models for English, Arabic, Chinese, French, Spanish, and German are available at http://nlp.stanford.edu/software/tagger.shtml.

  7. 7.

    For example, for German http://sourceforge.net/projects/jobimtext/files/data/models/de_news70M_pruned.zip/download; based on 70 million sentences from a news corpus extracted using the system described in [1].

References

  1. Biemann, C., Riedl, M.: Text: now in 2D! a framework for lexical expansion with contextual similarity. J. Lang. Model. 1(1), 55–95 (2013)

    Article  Google Scholar 

  2. Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 89–97. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  3. Braschler, M., Ripplinger, B.: How effective is stemming and decompounding for german text retrieval? Inf. Retr. 7(3–4), 291–316 (2004)

    Article  Google Scholar 

  4. Hamp, B., Feldweg, H.: GermaNet - a lexical-semantic net for german. In: Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pp. 9–15 (1997)

    Google Scholar 

  5. Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for european languages. Inf. Retr. 7(1–2), 33–52 (2004)

    Article  Google Scholar 

  6. Jespersen, O.: The Philosophy of Grammar. Chicago Studies in Ethnomusicology Series. University of Chicago Press, Chicago (1992)

    Google Scholar 

  7. Kraaij, W., Pohlmann, R.E.: Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 40–48 (1996)

    Google Scholar 

  8. Leveling, J.: University of hagen at CLEF 2003: natural language access to the GIRT4 data. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 412–424. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Lioma, C., Blanco, R.: Part of speech based term weighting for information retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 412–423. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Lioma, C., van Rijsbergen, C.K.: Part of speech based term weighting for information retrieval. In: Revue Franaise de Linguistique Applique, vol. 1 (2008)

    Google Scholar 

  11. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 2. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  12. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich (2010)

    Google Scholar 

  13. Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  14. Seeker, W., Kuhn, J.: Making ellipses explicit in dependency conversion for a german treebank. In: LREC, pp. 3132–3139 (2012)

    Google Scholar 

  15. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the NAACL on Human Language Technology, NAACL 2003, pp. 173–180. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

Download references

Acknowledgemets

This package was created for and within the GeoBib project to facilitate searching the project’s data set and will be used in the planed website. GeoBib is funded by the German Federal Ministry of Education and Research (grant no. 01UG1238A-B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bastian Entrup .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Entrup, B. (2015). (German) Language Processing for Lucene. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19581-0_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19580-3

  • Online ISBN: 978-3-319-19581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics