Abstract
NooJ’s [1, 2] export function allows its users to convert NooJ output from a binary file to a text file format. However, at present, the NooJ export function does not fully export intra-word units. My solution to this problem is demonstrated using SANTI-morf [3], a new morphological annotation system for Indonesian, written as a package in NooJ’s Indonesian language module. While the solution is dedicated to Indonesian morphology, I argue that the method I propose can be replicated by other NooJ users facing the same challenge. A new syntactic grammar whose rules capture morphotactic combinations that form Indonesian polymorphemic words is devised. Dictionaries and morphological grammars are modified to allow all morphemes and their associated analytical attributes to be automatically transferred using the new syntactic grammar as single units, like monomorphemic words. Some symbols that are special to NooJ must be replaced by non-standard symbols as these special symbols are not acceptable in syntactic annotations. This experimentation successfully exports more than 99% of word tokens from the test-bed corpus into full morpheme-level annotations. While successful, the concatenation of these morphemes as single units and the use of non-standard symbols confuse morpheme boundaries and annotations, causing the readability of the output to be low. A small program is then written to improve the readability of the output, which can easily be adapted to users’ anticipated needs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Silberztein, M.: NooJ Manual (2003). www.NooJ4nlp.org
Silberztein, M.: Formalizing Natural Languages NooJ Approach. Wiley, London (2016)
Prihantoro: SANTI-morf: a new morphological annotation system for Indonesian. A Ph.D. thesis: forthcoming. Lancaster University Press, Lancaster (2021)
Hardie, A.: CQPweb – combining power, flexibility and usability in a corpus analysis tool. Int. J. Corpus Linguist. 17(3), 380–409 (2012)
Kilgarriff, A., et al.: The Sketch Engine: ten years on. Lexicography 1(1), 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9
Anthony, L.: Concordancing with AntConc: An introduction to tools and techniques in corpus linguistics. JACET Newsl. (55), 155–185 (2006). Version 3.2.0
Scott, M.: WordSmith Manual. Lexical Analysis Software Ltd., Gloucestershire (1996)
Brezina, V., Timperley, M., McEnery, T.: #LancsBox, v.4.x [software] (2018). http://corpora.lancs.ac.uk/lancsbox
Ide, N., Veronis, J. (eds.): Text Encoding Initiative: Background and Contexts. Computers and the Humanities, vol. 29, p. 1 (1995)
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Language Resources and Evaluation Conference, LREC1998, pp.463–470. ELRA, Granada (1998)
Tadmor, U.: Malay-Indonesian. In: Major World Languages, pp. 791–818. Routledge, New York (2004)
Mueller, F.: Indonesian morphology. In: Morphologies of Asia and Africa, pp. 1207–1230. Eisenbraums, Winnona (2007)
Acknowledgments
This paper was written during my PhD candidacy at Lancaster University, Lancaster, United Kingdom. I would like to extend my deepest gratitude to the Indonesia Endowment Fund for Education (https://www.lpdp.kemenkeu.go.id/ (last accessed 07/08/2021)) (Lembaga Pengelola Dana Pendidikan, or LPDP) for fully sponsoring my PhD studies at Lancaster University. I am also highly indebted to Andrew Hardie, my PhD supervisor, for his useful feedback. All errors are mine.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Prihantoro (2021). Tweaking NooJ’s Resources to Export Morpheme-Level or Intra-word Annotations. In: Bigey, M., Richeton, A., Silberztein, M., Thomas, I. (eds) Formalizing Natural Languages: Applications to Natural Language Processing and Digital Humanities. NooJ 2021. Communications in Computer and Information Science, vol 1520. Springer, Cham. https://doi.org/10.1007/978-3-030-92861-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-92861-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92860-5
Online ISBN: 978-3-030-92861-2
eBook Packages: Computer ScienceComputer Science (R0)