Skip to main content

Tweaking NooJ’s Resources to Export Morpheme-Level or Intra-word Annotations

  • Conference paper
  • First Online:
Formalizing Natural Languages: Applications to Natural Language Processing and Digital Humanities (NooJ 2021)

Abstract

NooJ’s [1, 2] export function allows its users to convert NooJ output from a binary file to a text file format. However, at present, the NooJ export function does not fully export intra-word units. My solution to this problem is demonstrated using SANTI-morf [3], a new morphological annotation system for Indonesian, written as a package in NooJ’s Indonesian language module. While the solution is dedicated to Indonesian morphology, I argue that the method I propose can be replicated by other NooJ users facing the same challenge. A new syntactic grammar whose rules capture morphotactic combinations that form Indonesian polymorphemic words is devised. Dictionaries and morphological grammars are modified to allow all morphemes and their associated analytical attributes to be automatically transferred using the new syntactic grammar as single units, like monomorphemic words. Some symbols that are special to NooJ must be replaced by non-standard symbols as these special symbols are not acceptable in syntactic annotations. This experimentation successfully exports more than 99% of word tokens from the test-bed corpus into full morpheme-level annotations. While successful, the concatenation of these morphemes as single units and the use of non-standard symbols confuse morpheme boundaries and annotations, causing the readability of the output to be low. A small program is then written to improve the readability of the output, which can easily be adapted to users’ anticipated needs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Silberztein, M.: NooJ Manual (2003). www.NooJ4nlp.org

  2. Silberztein, M.: Formalizing Natural Languages NooJ Approach. Wiley, London (2016)

    Book  Google Scholar 

  3. Prihantoro: SANTI-morf: a new morphological annotation system for Indonesian. A Ph.D. thesis: forthcoming. Lancaster University Press, Lancaster (2021)

    Google Scholar 

  4. Hardie, A.: CQPweb – combining power, flexibility and usability in a corpus analysis tool. Int. J. Corpus Linguist. 17(3), 380–409 (2012)

    Article  Google Scholar 

  5. Kilgarriff, A., et al.: The Sketch Engine: ten years on. Lexicography 1(1), 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9

    Article  Google Scholar 

  6. Anthony, L.: Concordancing with AntConc: An introduction to tools and techniques in corpus linguistics. JACET Newsl. (55), 155–185 (2006). Version 3.2.0

    Google Scholar 

  7. Scott, M.: WordSmith Manual. Lexical Analysis Software Ltd., Gloucestershire (1996)

    Google Scholar 

  8. Brezina, V., Timperley, M., McEnery, T.: #LancsBox, v.4.x [software] (2018). http://corpora.lancs.ac.uk/lancsbox

  9. Ide, N., Veronis, J. (eds.): Text Encoding Initiative: Background and Contexts. Computers and the Humanities, vol. 29, p. 1 (1995)

    Google Scholar 

  10. Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference, LREC1998, pp.463–470. ELRA, Granada (1998)

    Google Scholar 

  11. Tadmor, U.: Malay-Indonesian. In: Major World Languages, pp. 791–818. Routledge, New York (2004)

    Google Scholar 

  12. Mueller, F.: Indonesian morphology. In: Morphologies of Asia and Africa, pp. 1207–1230. Eisenbraums, Winnona (2007)

    Google Scholar 

Download references

Acknowledgments

This paper was written during my PhD candidacy at Lancaster University, Lancaster, United Kingdom. I would like to extend my deepest gratitude to the Indonesia Endowment Fund for Education (https://www.lpdp.kemenkeu.go.id/ (last accessed 07/08/2021)) (Lembaga Pengelola Dana Pendidikan, or LPDP) for fully sponsoring my PhD studies at Lancaster University. I am also highly indebted to Andrew Hardie, my PhD supervisor, for his useful feedback. All errors are mine.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prihantoro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prihantoro (2021). Tweaking NooJ’s Resources to Export Morpheme-Level or Intra-word Annotations. In: Bigey, M., Richeton, A., Silberztein, M., Thomas, I. (eds) Formalizing Natural Languages: Applications to Natural Language Processing and Digital Humanities. NooJ 2021. Communications in Computer and Information Science, vol 1520. Springer, Cham. https://doi.org/10.1007/978-3-030-92861-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92861-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92860-5

  • Online ISBN: 978-3-030-92861-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics