Skip to main content

Orwell’s 1984—From Simple to Multi-word Units

  • Conference paper
  • First Online:
Book cover Human Language Technology Challenges for Computer Science and Linguistics (LTC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Included in the following conference series:

  • 847 Accesses

Abstract

In this paper we present an alternative version of the morphosyntactically annotated Serbian translation of 1984. This version follows the basic principles of the MULTEXT-East version, except for one addition—the text will be annotated with multi-word units as well. We will present the resources used for annotation with multi-word units and explain how these resources were enriched with multi-word units extracted from the processed text. Finally, we will present the format of this alternative version and the benefits obtained both from preparing the new resource and from the resource itself.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All examples in this paper (in English with a Serbian translation) are from the novel 1984, if such an example occurs in the text.

  2. 2.

    http://igm.univ-mlv.fr/~unitex/

References

  1. Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J., Tufis, D.: Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern european languages. In: Proceedings of the 36th Annual Meeting of the ACL and 17th International Conference on Computational Linguistics, vol. 1, pp. 315–319. ACL, Université de Montréal, Montréal (1998)

    Google Scholar 

  2. Erjavec, T.: MULTEXT-East: morphosyntactic resources for central and eastern european languages. Lang. Resour. Eval. 46(1), 131–142 (2012)

    Article  Google Scholar 

  3. Chiarcos, C., Erjavec, T.: OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: Proceedings of the 5th Linguistic Annotation Workshop (LAW 2011), Portland, OR, USA. pp. 11–20. ACL (2011)

    Google Scholar 

  4. Erjavec, T., Lawson, A., Romary, L. (eds.): East Meets West – A Compendium of Multilingual Resources (CD-ROM). Telri Association e.V, IdS, Mannheim (1998)

    Google Scholar 

  5. Krstev, C., Vitas, D., Erjavec, T.: Morpho-syntactic descriptions in MULTEXT-East–the case of Serbian. Informatica 28, 431–436 (2004)

    Google Scholar 

  6. Erjavec, T.: MULTEXT-East version 3: multilingual morphosyntactic specifications, lexicons and corpora. In: Lino, M.T., Xavier, M.F., Ferreira, F., Costa, R., Silva, R. (eds.) Proceedings of the 4th International Conference on Language Resources and Evaluation–LREC, Paris, pp. 1535–1538. ELRA, Paris (2004)

    Google Scholar 

  7. Popović, Z.: Taggers applied on texts in Serbian. INFOtheca 11(2), 21a–38a (2010)

    Google Scholar 

  8. Utvić, M.: Annotating corpus of contemporary Serbian. INFOtheca 12(2), 36a–47a (2011)

    Google Scholar 

  9. Delić, V., Sečujski, M., Kupusinac, A.: Transformation-based part-of-speech tagging for serbian language. In: Proceedings CIMMACS’09 of the 8th WSEAS International Conference on Computational Intelligence, Man machine Systems and Cybernetics, pp. 98–103. World Scientific and Engineering Academy and Society, Stevens Point, WI, USA (2009)

    Google Scholar 

  10. Božović, M.: Computational linguistics methods of parallel text alignment and their application to the English-Serbian language pair. Master thesis, Faculty of Philology, University of Belgrade, Belgrade (2010)

    Google Scholar 

  11. Gesmundo, A., Samardžić, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers. ACL ’12, vol. 2, pp. 368–372. ACL, Stroudsburg, PA (2012)

    Google Scholar 

  12. Ermolaev, N., Tasovac, T.: Building a lexicographic infrastructure for serbian digital libraries. In: Proceedings of the 12th international Conference on Libraries in the Digital Age (LIDA) (2012)

    Google Scholar 

  13. Gross, M.: Lexicon-grammar. The representation of compound words. In: Proceedings of Coling 1986, pp. 1–6, Bonn (1986)

    Google Scholar 

  14. Savary, A.: Multiflex: a multilingual finite-state tool for multi-word units. In: Maneth, S. (ed.) CIAA 2009. LNCS, vol. 5642, pp. 237–240. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  15. Laporte, É., Nakamura, T., Voyatzi, S., et al.: A French corpus annotated for multiword nouns. In: Proceedings of the 6th Language Resources and Evaluation Conference. Workshop Towards a Shared Task on Multiword Expressions, Marrakech, Morocco, pp. 27–30. ELRA (2008)

    Google Scholar 

  16. Utvić, M., Obradović, I., Krstev, C., Vitas, D.: The effects of multi-word tagging on text disambiguation. In: Proceedings of the 29th International Conference on Lexis and Grammar, Belgrade, Serbia, pp. 333–342. Faculty of Mathematics, University of Belgrade (2010)

    Google Scholar 

  17. Savary, A., Waszczuk, J., Przepiórkowski, A.: Towards the annotation of named entities in the National Corpus of Polish. In: Proceedings of the 7th International Conference on Language Resources and Evaluation, Valetta, Malta, pp. 3622–3629. ELRA (2010)

    Google Scholar 

  18. Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24, 473–489 (2014)

    Article  Google Scholar 

  19. Krstev, C., Vitas, D.: Finite state transducers for recognition and generation of compound words. In: Erjavec, T., Žganec Gros, J. (eds.) Proceedings of IS-LTC 2006, Ljubljana, Slovenia, pp. 192–197. Institut “Jožef Stefan” (2006)

    Google Scholar 

  20. Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)

    Google Scholar 

  21. Krstev, C., Obradović, I., Stanković, R., Vitas, D.: An approach to efficient processing of multi-word units. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds.) Computational Linguistics. SCI, vol. 458, pp. 109–129. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  22. Woźbniak, M.: Automatic extraction of multiword lexical units from Polish texts. In: Vetulani, Z. (ed.) Proceedings of the 5th Language & Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland, pp. 187–191. Fundacja Uniwersytetu im. A. Mickiewicza (2011)

    Google Scholar 

  23. Paumier, S.: Unitex 3.1beta User Manual (2013). http://www-igm.univ-mlv.fr/~unitex/UnitexManual3.1beta.pdf

  24. Savary, A.: Recensement et description des mots composés - méthodes et applications. Ph.D. thèse, Université de Marne-la-Vallée (2000)

    Google Scholar 

  25. Alegria, I., Ansa, O., Artola, X., Ezeiza, N., Nojenola, K., Urizar, R.: Representation and treatment of multiword expressions in Basque. In: Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 48–55 (2004)

    Google Scholar 

  26. Przepiórkowski, A., Woliński, M.: The unbearable lightness of tagging: a case study in morphosyntactic tagging of Polish. In: Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), Budapest, Hungary, pp. 13–14 (2003)

    Google Scholar 

  27. B.: Automatic recognition of composite verb forms in Serbian. In: Proceedings of the Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages of the 5th Balkan Conference in Informatics, Novi Sad, Serbia, pp. 89–92. Faculty of Sciences, University of Novi Sad (2012)

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Serbian Ministry of Education and Science (grant NO 178003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cvetana Krstev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Krstev, C., Vitas, D., Trtovac, A. (2014). Orwell’s 1984—From Simple to Multi-word Units. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08958-4_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08957-7

  • Online ISBN: 978-3-319-08958-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics