skip to main content
research-article

ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank

Published:23 August 2023Publication History
Skip Abstract Section

Abstract

Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly available manually-annotated benchmark Indonesian constituency treebank with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. As part of the process of building the treebank, we review and revamp the constituent and POS tagsets in use in existing treebanks to ensure that the labels are relevant and suitable for the grammatical features of Indonesian. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in Indonesian that pose challenges for constituency parsing.

REFERENCES

  1. Abeillé Anne, Clément Lionel, and Kinyon Alexandra. 2000. Building a treebank for French. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). European Languages Resources Association. https://aclanthology.org/L00-1175/Google ScholarGoogle Scholar
  2. Alves Mark J.. 2021. Typological profile of Vietic. In The Languages and Linguistics of Mainland Southeast Asia: A comprehensive guide. De Gruyter Mouton, Berlin, Boston, 469498. Google ScholarGoogle ScholarCross RefCross Ref
  3. Wayan Arka I.. 2013. On the typology and syntax of TAM in Indonesian. In tense, aspect, mood and evidentiality in languages of Indonesia. Tokyo University of Foreign Studies, 2340.Google ScholarGoogle Scholar
  4. Wayan Arka I. and Manning Christopher D.. 1998. Voice and grammatical relations in Indonesian: A new perspective. In Proceedings of the LFG98 Conference. CSLI Publications.Google ScholarGoogle Scholar
  5. Arwidarasti Jessica, Alfina Ika, and Krisnadhi Adila. 2019. Converting an Indonesian constituency treebank to the Penn Treebank format. In 2019 International Conference of Asian Language Processing (IALP). 331336. Google ScholarGoogle ScholarCross RefCross Ref
  6. Arwidarasti Jessica, Alfina Ika, and Krisnadhi Adila. 2020. Adjusting Indonesian multiword expression annotation to the Penn Treebank format. In 2020 International Conference of Asian Language Processing (IALP). Google ScholarGoogle ScholarCross RefCross Ref
  7. Bies Ann, Ferguson Mark, Katz Karen, and MacIntyre Robert. 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google ScholarGoogle Scholar
  8. Bies Ann, Mott Justin, Warner Colin, and Kulick Seth. 2012. English Web Treebank. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2012T13Google ScholarGoogle Scholar
  9. Brants Sabine, Dipper Stefanie, Hansen Silvia, Lezius Wolfgang, and Smith George. 2002. The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistics Theories (TLT 2002).Google ScholarGoogle Scholar
  10. Chaer Abdul. 1990. Penggunaan preposisi dan konjungsi bahasa Indonesia. Nusa Indah.Google ScholarGoogle Scholar
  11. Chen Qian, Zhu Xiaodan, Ling Zhen-Hua, Wei Si, Jiang Hui, and Inkpen Diana. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 16571668. Google ScholarGoogle ScholarCross RefCross Ref
  12. Chung Sandra. 2008. Indonesian clause structure from an Austronesian perspective. Lingua, 118, 10 (2008), 15541582. Google ScholarGoogle ScholarCross RefCross Ref
  13. Civit Montserrat and Antònia Martí Ma. 2004. Building Cast3LB: A spanish treebank. Research on Language and Computation 2 (2004), 549574.Google ScholarGoogle ScholarCross RefCross Ref
  14. Clynes Adrian and Deterding David. 2011. Standard Malay (Brunei). Journal of the International Phonetic Association 41, 2 (2011), 259268. Google ScholarGoogle ScholarCross RefCross Ref
  15. Cole Peter, Hermon Gabriella, and Tjung Yassir. 2006. Is there Pasif Semu in Jakarta Indonesian? Oceanic Linguistics 45, 1 (2006), 6490. Google ScholarGoogle ScholarCross RefCross Ref
  16. Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 84408451. Google ScholarGoogle ScholarCross RefCross Ref
  17. Crain Stephen and Steedman Mark. 1984. On not being led up the garden path: The use of context by the psychological parser. In Syntactic Theory and How People Parse Sentences. Cambridge University Press.Google ScholarGoogle Scholar
  18. Cross James and Huang Liang. 2016. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 111. Google ScholarGoogle ScholarCross RefCross Ref
  19. Csendes Dóra, Csirik János, Gyimóthy Tibor, and Kocsor András. 2005. The Szeged Treebank. In TSD 2005: Text, Speech and Dialogue. Springer, Berlin, 123131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dahl Östen. 2015. Tense, aspect, mood and evidentiality, linguistics of. International Encyclopedia of the Social & Behavioral Sciences (2015), 210213. Google ScholarGoogle ScholarCross RefCross Ref
  21. Dalrymple Mary and Mofu Suriel. 2011. Plural semantics, reduplication, and numeral modification in Indonesian. Journal of Semantics 29, 2 (2012), 229260. Google ScholarGoogle ScholarCross RefCross Ref
  22. Denistia Karlina and Baayen R. Harald. 2022. The morphology of Indonesian: Data and quantitative modeling. In The Routledge Handbook of Asian Linguistics (1st ed.). Routledge, London, United Kingdom, 605634. Google ScholarGoogle ScholarCross RefCross Ref
  23. Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1: Long and Short Papers). Association for Computational Linguistics, 41714186. Google ScholarGoogle ScholarCross RefCross Ref
  24. Dinakaramani Arawinda, Rashel Fam, Luthfi Andry and Manurung Ruli. 2014. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus. In 2014 International Conference on Asian Language Processing (IALP). 6669. Google ScholarGoogle ScholarCross RefCross Ref
  25. Ding Chenchen, Utiyama Masao, and Sumita Eiichiro. 2016. Similar Southeast Asian languages: Corpus-based case study on Thai-Laotian and Malay-Indonesian. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016). The COLING 2016 Organizing Committee, 149156. http://aclanthology.lst.uni-saarland.de/W16-4614Google ScholarGoogle Scholar
  26. Dwi Noverini Djenar. 2006. On the multifunctionality of compound prepositions in Indonesian. Oceanic Linguistics 45, 2 (2006), 404428. Google ScholarGoogle ScholarCross RefCross Ref
  27. Dwi Noverini Djenar. 2018. Constituent order and information structure in Indonesian discourse. In Perspectives on Information Structure in Austronesian Languages. Language Science Press, Berlin, Germany, 177205. Google ScholarGoogle ScholarCross RefCross Ref
  28. Donohue Mark. 2007. Word order in Austronesian from north to south and west to east. Linguistic Typology 11 (2007), 349391. Google ScholarGoogle ScholarCross RefCross Ref
  29. Drozdov Andrew, Verga Patrick, Yadav Mohit, Iyyer Mohit, and McCallum Andrew. 2019. Unsupervised latent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 11291141. Google ScholarGoogle ScholarCross RefCross Ref
  30. Dryer Matthew S.. 2013a. Coding of nominal plurality. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/33Google ScholarGoogle Scholar
  31. Dryer Matthew S.. 2013b. Order of Subject, Object and Verb. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/81Google ScholarGoogle Scholar
  32. Effendi S.. 1995. Kata Sifat dan Kata Keterangan dalam Bahasa Indonesia. Bahasa dan Sastra 12, 2 (1995), 153.Google ScholarGoogle Scholar
  33. Fei Hao, Wu Shengqiong, Ren Yafeng, Li Fei, and Ji Donghong. 2021. Better combine them together! Integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 549559. Google ScholarGoogle ScholarCross RefCross Ref
  34. Filino Mario and Purwarianti Ayu. 2016. Indonesian shift-reduce constituent parser. In 2016 International Conference on Data and Software Engineering (ICoDSE). 1-6. Google ScholarGoogle ScholarCross RefCross Ref
  35. Gabbard Ryan. 2010. Null element restoration. Publicly accessible Penn Dissertations 264. https://repository.upenn.edu/edissertations/264Google ScholarGoogle Scholar
  36. Gildea Daniel and Jurafsky Daniel. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28, 3 (2002), 245288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Grangé Philippe. 2011. Aspect in Indonesian: Free markers versus affixed or clitic markers. In Proceedings of the International Workshop on TAM and Evidentiality in Indonesian Languages. Tokyo University of Foreign Studies, 4363.Google ScholarGoogle Scholar
  38. Grangé Philippe. 2015. The Indonesian verbal suffix -nya: Nominalization or subordination? Wacana Journal of the Humanities of Indonesia 16, 1 (2015), 133166. Google ScholarGoogle ScholarCross RefCross Ref
  39. Herlim Robert and Purwarianti Ayu. 2018. Indonesian shift-reduce constituency parser using feature templates & beam search strategy. In 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA). 5459. Google ScholarGoogle ScholarCross RefCross Ref
  40. Hirst Graeme. 1984. A semantic process for syntactic disambiguation. In Proceedings of the Fourth AAAI Conference on Artificial Intelligence (AAAI’84). AAAI, 148152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hogan Deirdre. 2007. Coordinate noun phrase disambiguation in a generative parsing model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 680687. https://aclanthology.org/P07-1086Google ScholarGoogle Scholar
  42. Irmawati Budi, Shindo Hiroyuki, and Matsumoto Yuji. 2017. A dependency annotation scheme to extract syntactic features in Indonesian sentences. International Journal of Technology 8, 5 (2017), 957967. Google ScholarGoogle ScholarCross RefCross Ref
  43. Jawahar Ganesh, Sagot Benoît, and Seddah Djamé. 2019. What Does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 36513657. Google ScholarGoogle ScholarCross RefCross Ref
  44. Jeoung Helen. 2020. Categorial ambiguity in mau, suka, and other Indonesian predicates. Language 96, 3 (2020), 157172. Google ScholarGoogle ScholarCross RefCross Ref
  45. Jiang Fan and Cohn Trevor. 2022. Incorporating Constituent Syntax for Coreference Resolution. arXiv. Google ScholarGoogle ScholarCross RefCross Ref
  46. Jiang Ming and Diesner Jana. 2019. A constituency parsing tree based method for relation extraction from abstracts of scholarly publications. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13). Association for Computational Linguistics, 186191. Google ScholarGoogle ScholarCross RefCross Ref
  47. Joshi Mandar, Chen Danqi, Liu Yinhan, Weld Daniel S., Zettlemoyer Luke, and Levy Omer. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8 (2020), 6477. Google ScholarGoogle ScholarCross RefCross Ref
  48. Joshi Vidur, Peters Matthew, and Hopkins Mark. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 11901199. Google ScholarGoogle ScholarCross RefCross Ref
  49. Judge John, Cahill Aoife, and van Genabith Josef. 2006. QuestionBank: Creating a corpus of parse-annotated questions. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 497504. https://aclanthology.org/P06-1063/Google ScholarGoogle Scholar
  50. Jurafsky Daniel and Martin James H.. 2009a. Constituency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 259279.Google ScholarGoogle Scholar
  51. Jurafsky Daniel and Martin James H.. 2009b. Dependency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 280304.Google ScholarGoogle Scholar
  52. Kaplan Jared, McCandlish Sam, Henighan Tom, Brown Tom B., Chess Benjamin, Child Rewon, Gray Scott, Radford Alec, Wu Jeffrey, and Amodei Dario. 2020. Scaling laws for neural language models. arXiv. Google ScholarGoogle ScholarCross RefCross Ref
  53. Kasami Tadao. 1965. An Efficient Recognition and Syntax-analysis Algorithm for Context-free Languages. Technical Report. Air Force Cambridge Research Lab, Bedford, MA.Google ScholarGoogle Scholar
  54. Keraf Gorys. 1984. Tatabahasa Indonesia. Nusa Indah.Google ScholarGoogle Scholar
  55. Kim Min-joo. 2002. Does Korean have adjectives? MIT Working Papers in Linguistics 43, (2002), 7189.Google ScholarGoogle Scholar
  56. Kim Jin-Dong, Ohta Tomoko, Tateisi, Yuka and Tsujii Jun'ichi. 2003. GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl. 1, i180i182. Google ScholarGoogle ScholarCross RefCross Ref
  57. Kim Yoon, Rush Alexander, Yu Lei, Kuncoro Adhiguna, Dyer Chris, and Melis Gábor. 2019. Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics. 11051117. Google ScholarGoogle ScholarCross RefCross Ref
  58. Kitaev Nikita, Cao Steven, and Klein Dan. 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 34993505. Google ScholarGoogle ScholarCross RefCross Ref
  59. Kitaev Nikita and Klein Dan. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 26762686. Google ScholarGoogle ScholarCross RefCross Ref
  60. Kitaev Nikita and Klein Dan. 2020. Tetra-tagging: Word-synchronous parsing with linear-time inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 62556261. Google ScholarGoogle ScholarCross RefCross Ref
  61. Klein Dan and Manning Christopher D.. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 128135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Koto Fajri, Rahimi Afshin, Lau Jey Han, and Baldwin Timothy. 2020. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 757770. Google ScholarGoogle ScholarCross RefCross Ref
  63. Kridalaksana Harimurti. 1986. Kelas Kata dalam Bahasa Indonesia. Gramedia.Google ScholarGoogle Scholar
  64. Kudo Taku and Richardson John. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 6671. Google ScholarGoogle ScholarCross RefCross Ref
  65. Li Jun, Cao Yifan, Cai Jiong, Jiang Yong, and Tu Kewei. 2020. An empirical comparison of unsupervised constituency parsing methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 32783283. Google ScholarGoogle ScholarCross RefCross Ref
  66. Li Zuchao, Parnow Kevin, and Zhao Hai. 2022. Incorporating rich syntax information in Grammatical Error Correction. Information Processing and Management, 59, 3 (2022). Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Li Zuchao, Zhao Hai, He Shexia, and Cai Jiaxun. 2021. Syntax role for neural semantic role labeling. Computational Linguistics, 47, 3 (2021), 529574. Google ScholarGoogle ScholarCross RefCross Ref
  68. Lim Ee Suan, Leong Wei Qi, Nguyen Ngan Thanh, Adhista Dea, Kng Wei Ming, Tjhi William Chandra, and Purwarianti Ayu. 2023. ICON: Building a large-scale benchmark constituency treebank for the Indonesian language. In Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023). Association for Computational Linguistics. 3753. https://aclanthology.org/2023.tlt-1.5/Google ScholarGoogle Scholar
  69. Victoria Lin Xi, Mihaylov Todor, Artetxe Mikel, Wang Tianlu, Chen Shuohui, Simig Daniel, Ott Myle, Goyal Naman, Bhosale Shruti, Du Jingfei, Pasunuru Ramakanth, Shleifer Sam, Singh Koura Punit, Chaudhary Vishrav, O'Horo Brian, Wang Jeff, Zettlemoyer Luke, Kozareva Zornitsa, Diab Mona, Stoyanov Veselin, and Li Xian. 2021. Few-shot Learning with Multilingual Language Models. arXiv. Google ScholarGoogle ScholarCross RefCross Ref
  70. Liu Pengfei, Yuan Weizhe, Fu Jinlan, Jiang Zhengbao, Hayashi Hiroaki, and Neubig Graham. 2022. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys (2022). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Ma Chunpeng, Tamura Akihiro, Utiyama Masao, Zhao Tiejun, and Sumita Eiichiro. 2018. Forest-based neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 12531263. Google ScholarGoogle ScholarCross RefCross Ref
  72. Maamouri Mohamed and Bies Ann. 2004. Developing an Arabic Treebank: Methods, guidelines, procedures, and tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. COLING, 29. https://aclanthology.org/W04-1602Google ScholarGoogle Scholar
  73. Mahdi Waruno. 2012. Distinguishing cognate homonyms in Indonesian. Oceanic Linguistics 51, 2 (2012), 402449.Google ScholarGoogle Scholar
  74. Maier Wolfgang, Kübler Sandra, Hinrichs Erhard, and Krivanek Julia. 2012. Annotating coordination in the penn treebank. In Proceedings of the Sixth Linguistic Annotation Workshop. Association for Computational Linguistics, 166174. https://aclanthology.org/W12-3624Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Manning Christopher, Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven, and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 5560. Google ScholarGoogle ScholarCross RefCross Ref
  76. Marcus Mitchell P., Kim Grace, Marcinkiewicz Mary Ann, MacIntyre Robert, Bies Ann, Ferguson Mark, Katz Karen, and Schasberger Britta. 1994. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. https://aclanthology.org/H94-1020Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Marcus Mitchell P., Santorini Beatrice, and Marcinkiewicz Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313330. https://aclanthology.org/J93-2004Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. de Marneffe Marie-Catherine, Manning Christopher, Nivre Joakim, and Zeman Daniel. 2021. Universal Dependencies. Computational Linguistics 47, 2 (2021), 255308. Google ScholarGoogle Scholar
  79. Meng Fandong, Xie Jun, Song Linfeng, Lü Yajuan, and Liu Qun. 2013. Translation with source constituency and dependency trees. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10661076. https://aclanthology.org/D13-1108Google ScholarGoogle Scholar
  80. Moeliono Anton M., Lapoliwa Hans, Alwi Hasan, Satrya Tjatur Wisnu Sasangka Sry, and Sugiyono. 2017. Tata Bahasa Baku Bahasa Indonesia Edisi Keempat. Badan Pengembangan dan Pembinaan Bahasa, Kementerian Pendidikan dan Kebudayaan. Jakarta. https://repositori.kemdikbud.go.id/16351/Google ScholarGoogle Scholar
  81. Moeljadi David. 2017. Building JATI: A treebank for Indonesian. In Proceedings of the 4th Atma Jaya Conference on Corpus Studies. https://hdl.handle.net/10220/46580Google ScholarGoogle Scholar
  82. Moeljadi David, Bond Francis, and Song Sanghoun. 2015. Building an HPSG-based Indonesian Resource Grammar (INDRA). In Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop. Association for Computational Linguistics, 916. http://aclweb.org/anthology/W/W15/W15-3302.pdfGoogle ScholarGoogle Scholar
  83. Moeljadi David, Kurniawan Aditya, and Goswam Debaditya. 2019. Building Cendana: A treebank for informal Indonesian. In The 33rd Pacific Asia Conference on Language, Information and Computation, 156-164. http://hdl.handle.net/2065/00063897Google ScholarGoogle Scholar
  84. Montemagni Simonetta, Barsotti F., Battista Marco, Calzolari Nicoletta, Corazzari Ornella, Zampolli Antonio, Fanciulli F., Massetani M., Raffaelli Remo, Basili Roberto, Pazienza Maria Teresa, Saracino D., Zanzotto Fabio, Mana Nadia, Pianesi Fabio, and Delmonte Rodolfo. 2000. The Italian Syntactic-Semantic Treebank: Architecture, annotation, tools and evaluation. In Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora. International Committee on Computational Linguistics, 1827. https://aclanthology.org/W00-1903Google ScholarGoogle Scholar
  85. Mrini Khalil, Dernoncourt Franck, Tran Quan Hung, Bui Trung, Chang Walter, and Nakashole Ndapa. 2020. Rethinking self-attention: Towards interpretability in neural parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 731742. Google ScholarGoogle ScholarCross RefCross Ref
  86. Musgrave Simon. 2013. Functional categories in the syntax and semantics of Malay. In tense, aspect, mood, and evidentiality in languages of Indonesia. PKBB Universitas Katolik Indonesia Atma Jaya, Jakarta, 135152.Google ScholarGoogle Scholar
  87. Franciscus Xaverius Nadar. 1996. A comparative study of the Indonesian and English articles. Humaniora, 3 (1996), 4756. Google ScholarGoogle ScholarCross RefCross Ref
  88. Ng Hwee Tou, Wu Siew Mei, Wu Yuanbin, Hadiwinoto Christian, and Tetreault Joel. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, 112. https://aclanthology.org/W13-3601Google ScholarGoogle Scholar
  89. Nguyen Minh Van, Lai Viet Dac, Veyseh Amir Pouran Ben, and Nguyen Thien Huu. 2021. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 8090. Google ScholarGoogle ScholarCross RefCross Ref
  90. Nomoto Hiroki, Choi Hannah, Moeljadi David, and Bond Francis. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. In Proceedings of the LREC 2018 Workshop “The 13th Workshop on Asian Language Resources”. European Language Resources Association (ELRA), 3643. http://lrec-conf.org/workshops/lrec2018/W29/pdf/8_W29.pdfGoogle ScholarGoogle Scholar
  91. Nomoto Hiroki. 2022. Kyokushoushugi ni motoduku heiretsu tsuriibanku no kouchiku [Building a parallel treebank based on minimalism]. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing. The Association for Natural Language Processing, 103107. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E1-4.pdfGoogle ScholarGoogle Scholar
  92. Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 22272237. Google ScholarGoogle ScholarCross RefCross Ref
  93. Petrov Slav and Klein Dan. 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics, 404411. https://aclanthology.org/N07-1051Google ScholarGoogle Scholar
  94. Pittayaporn Pittayawat. 2021. Typological profile of Kra-Dai languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 433468. Google ScholarGoogle ScholarCross RefCross Ref
  95. Prolo Carlos A.. 2006. Handling unlike coordinated phrases in TAG by mixing syntactic category and grammatical function. In Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms. Association for Computational Linguistics. 137140. https://aclanthology.org/W06-1520Google ScholarGoogle Scholar
  96. Punyakanok Vasin, Roth Dan, and Yih Wen-tau. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34, 2 (2008), 257287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Qi Peng, Zhang Yuhao, Zhang Yuhui, Bolton Jason, and Manning Christopher D.. 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics. 101108. Google ScholarGoogle ScholarCross RefCross Ref
  98. Ramlan M.. 1980. Kata depan atau preposisi dalam bahasa Indonesia. U. P. Karyono.Google ScholarGoogle Scholar
  99. Sagae Kenji and Lavie Alon. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology. Association for Computational Linguistics. 125132. https://aclanthology.org/W05-1513Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Sajarwa. 2019. The translation of durative aspect of French into Indonesian. In Proceedings of the Fifth Prasasti International Seminar on Linguistics (PRASASTI 2019). Atlantis Press, 393397. Google ScholarGoogle ScholarCross RefCross Ref
  101. Santorini Beatrice. 1990. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google ScholarGoogle Scholar
  102. Sasangka Sry Satriya Tjatur Wisnu, Indiyatini Titik, and Widjaja Nantje Harijati. 2000. Adjektiva dan Adverbia dalam Bahasa Indonesia. Pusat Bahasa Departemen Pendidikan Nasional Jakarta.Google ScholarGoogle Scholar
  103. Seddah Djamé, Tsarfaty Reut, Kübler Sandra, Candito Marie, Choi Jinho D., Farkas Richárd, Foster Jennifer, Goenaga Iakes, Gojenola Koldo, Goldberg Yoav, Green Spence, Habash Nizar, Kuhlmann Marco, Maier Wolfgang, Nivre Joakim, Przepiórkowski Adam, Roth Ryan, Seeker Wolfgang, Versley Yannick, Vincze Veronika, Woliński Marcin, Wróblewska Alina, and Villemonte de la Clérgerie Eric. 2013. Overview of the SPMRL 2013 Shared Task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages. Association for Computational Linguistics, 146182. https://aclanthology.org/W13-4917Google ScholarGoogle Scholar
  104. Seginer Yoav. 2007. Fast unsupervised incremental parsing. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 384391. https://aclanthology.org/P07-1049Google ScholarGoogle Scholar
  105. Shen Yikang, Lin Zhouhan, Huang Chin Wei, and Courville Aaron. 2018. Neural language modeling by jointly learning syntax and lexicon. International Conference on Learning Representations. Google ScholarGoogle ScholarCross RefCross Ref
  106. Sima'an Khalil, Itai Alon, Winter Yoad, Altman Alon, and Nativ Noa. 2001. Building a tree-bank of modern Hebrew text. Traitement Automatique des Langues 42 (2001), 347380.Google ScholarGoogle Scholar
  107. Neil Sneddon James. 2003. Diglossia in Indonesian. Bijdragen tot de Taal-, Land- en Volkenkunde 159, 4 (2003), 519549. https://www.jstor.org/stable/27868068Google ScholarGoogle Scholar
  108. Sneddon James Neil, Adelaar Alexander, Djenar Dwi Noverini, and Ewing Michael C.. 2010. Indonesian Reference Grammar, 2nd edition. Allen & Unwin.Google ScholarGoogle Scholar
  109. Stack Maggie. 2005. Word order and intonation in Indonesian. In Lexical Semantic Ontology Working Papers in Linguistics 5: Proceedings of Workshop in General Linguistics. Linguistics Student Organization, 168182.Google ScholarGoogle Scholar
  110. Stern Mitchell, Andreas Jacob, and Klein Dan. 2017. A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 818827. Google ScholarGoogle ScholarCross RefCross Ref
  111. Taylor Ann, Marcus Mitchell, and Santorini Beatrice. 2003. The Penn Treebank: An Overview. In Abeillé A. (Eds). Treebanks. Text, Speech and Language Technology, Volume 20. Springer, Dordrecht, 522. Google ScholarGoogle ScholarCross RefCross Ref
  112. Teeuw Alex. 1962. Some problems in the study of word-classes in Bahasa Indonesia. Lingua, 11 (1962), 409421. Google ScholarGoogle ScholarCross RefCross Ref
  113. Telljohann Heike, Hinrichs Erhard, and Kübler Sandra. 2004. The Tüba-D/Z Treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association. https://aclanthology.org/L04-1096/Google ScholarGoogle Scholar
  114. Le Quang Thang Hiroshi Noji, and Miyao Yusuke. 2015. Optimal shift-reduce constituent parsing with structured perceptron. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. 15341544. Google ScholarGoogle ScholarCross RefCross Ref
  115. Thim Stefan. 2012. Phrasal verbs: The English verb-particle construction and its history (Topics in English Linguistics 78). Mouton de Gruyter (2012). Google ScholarGoogle ScholarCross RefCross Ref
  116. Kyaw Thu Ye, Pa Win Pa, Utiyama Masao, Finch Andrew, and Sumita Eiichiro. 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association. 15741578. https://aclanthology.org/L16-1249Google ScholarGoogle Scholar
  117. Tian Yuanhe, Song Yan, Xia Fei, and Zhang Tong. 2020. Improving constituency parsing with span attention. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics. 16911703. Google ScholarGoogle ScholarCross RefCross Ref
  118. Tjia Johnny. 2015. Grammatical relations and grammatical categories in Malay; The Indonesian prefix meN- revisited. Wacana 16, 1 (2015), 105132.Google ScholarGoogle Scholar
  119. Van Bik Kenneth. 2021. Typological profile of Kuki-Chin languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 369402. Google ScholarGoogle ScholarCross RefCross Ref
  120. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, Polosukhin Illia. 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017). Google ScholarGoogle ScholarCross RefCross Ref
  121. Watanabe Taro and Sumita Eiichiro. 2015. Transition-based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 11691179. Google ScholarGoogle ScholarCross RefCross Ref
  122. Weischedel Ralph, Ayuso Damaris, Bobrow R., Boisen Sean, Ingria Robert, and Palmucci Jeff. 1991. Partial parsing: A report on work in progress. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991. 204209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Weischedel Ralph, Palmer Martha, Marcus Mitchell, Hovy Eduard, Pradhan Sameer, Ramshaw Lance, Xue Nianwen, Taylor Ann, Kaufman Jeff, Franchini Michelle, El-Bachouti Mohammed, Belvin Robert, and Houston Ann. 2013. OntoNotes Release 5.0. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2013T19Google ScholarGoogle Scholar
  124. Wilie Bryan, Vincentio Karissa, Winata Genta Indra, Cahyawijaya Samuel, Li Xiaohong, Lim Zhi Yuan, Soleman Sidik, Mahendra Rahmad, Fung Pascale, Bahar Syafri, and Purwarianti Ayu. 2020. IndoNLU: benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 843857. https://aclanthology.org/2020.aacl-main.85Google ScholarGoogle Scholar
  125. Woliński Marcin and Hajnicz Elżbieta. 2021. Składnica: A constituency treebank of Polish harmonised with the Walenty Valency Dictionary. Language Resources and Evaluation, 55 (2021), 209239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Wu Shijie and Dredze Mark. 2019. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 833844. Google ScholarGoogle ScholarCross RefCross Ref
  127. Xia Qingrong, Zhang Bo, Wang Rui, Li Zhenghua, Zhang Yue, Huang Fei, Si Luo, and Zhang Min. 2021. A unified span-based approach for opinion mining with syntactic constituents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 17951804. Google ScholarGoogle ScholarCross RefCross Ref
  128. Xu Jiacheng and Durrett Greg. 2019. Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 32923303. Google ScholarGoogle ScholarCross RefCross Ref
  129. Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, and Raffel Colin. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 483498. Google ScholarGoogle ScholarCross RefCross Ref
  130. Yang Jian, Ma Shuming, Zhang Dongdong, Li Zhoujun, and Zhou Ming. 2020. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 59795989. Google ScholarGoogle ScholarCross RefCross Ref
  131. Yang Sen, Cui Leyang, Ning Ruoxi, Wu Di, and Zhang Yue. 2022. Challenges to open-domain constituency parsing. In Findings of the Association for Computational Linguistics: ACL. Association for Computational Linguistics. 112127. https://aclanthology.org/2022.findings-acl.11Google ScholarGoogle Scholar
  132. Yıldız Olcay Taner, Solak Ercan, Çandır Şemsinur, Ehsani Razieh, and Görgün Onur. 2015. Constructing a Turkish constituency parse treebank. In Information Sciences and Systems 2015. Springer, Cham, 339347. Google ScholarGoogle ScholarCross RefCross Ref
  133. Younger Daniel H.. 1967. Recognition and parsing of context-free languages in time n3. Information and Control 10, 2 (1967), 189208. Google ScholarGoogle ScholarCross RefCross Ref
  134. Zhang Meishan. 2020. A survey of syntactic-semantic parsing based on constituent and dependency structures. Science China Technological Sciences 63 (2020), 1989–1920. Google ScholarGoogle ScholarCross RefCross Ref
  135. Zhang Yian. 2020. Latent tree learning with ordered neurons: What parses does it produce? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. 119125. Google ScholarGoogle ScholarCross RefCross Ref
  136. Zhou Junru and Zhao Hai. 2019. Head-driven phrase structure grammar parsing on Penn Treebank. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 23962408. Google ScholarGoogle ScholarCross RefCross Ref
  137. Zhu Fangyi, Tan Lok You, Ng See-Kiong, and Bressan Stéphane. 2022. Syntax-informed question answering with heterogeneous graph transformer. In Database and Expert Systems Applications: 33rd International Conference, DEXA 2022, Vienna, Austria, August 22–24, 2022, Proceedings, Part I. Springer-Verlag, Berlin, 1731. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
      August 2023
      373 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3615980
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 August 2023
      • Online AM: 25 July 2023
      • Accepted: 6 July 2023
      • Revised: 16 May 2023
      • Received: 21 November 2022
      Published in tallip Volume 22, Issue 8

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)140
      • Downloads (Last 6 weeks)10

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text