research-article

ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank

Authors:
Ee Suan Lim

AI Singapore, Singapore

AI Singapore, Singapore

0000-0001-5417-7897
View Profile

,
Wei Qi Leong

AI Singapore, Singapore

AI Singapore, Singapore

0009-0002-0645-1112
View Profile

,
Thanh Ngan Nguyen

AI Singapore, Singapore

AI Singapore, Singapore

0009-0009-7995-9866
View Profile

,
Wei Ming Kng

AI Singapore, Singapore

AI Singapore, Singapore

0009-0002-4076-9554
View Profile

,
William Chandra Tjhi

AI Singapore, Singapore

AI Singapore, Singapore

0009-0009-9861-3545
View Profile

,
Dea Adhista

Prosa.ai, Indonesia

Prosa.ai, Indonesia

0009-0003-4326-0646
View Profile

,
Ayu Purwarianti

Prosa.ai and Institut Teknologi Bandung, Indonesia

Prosa.ai and Institut Teknologi Bandung, Indonesia

0000-0002-5016-3700
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 823 August 2023Article No.: 213pp 1–34https://doi.org/10.1145/3609798

Published:23 August 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly available manually-annotated benchmark Indonesian constituency treebank with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. As part of the process of building the treebank, we review and revamp the constituent and POS tagsets in use in existing treebanks to ensure that the labels are relevant and suitable for the grammatical features of Indonesian. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in Indonesian that pose challenges for constituency parsing.

REFERENCES

Abeillé Anne, Clément Lionel, and Kinyon Alexandra. 2000. Building a treebank for French. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). European Languages Resources Association. https://aclanthology.org/L00-1175/Google Scholar
Alves Mark J.. 2021. Typological profile of Vietic. In The Languages and Linguistics of Mainland Southeast Asia: A comprehensive guide. De Gruyter Mouton, Berlin, Boston, 469–498. Google ScholarCross Ref
Wayan Arka I.. 2013. On the typology and syntax of TAM in Indonesian. In tense, aspect, mood and evidentiality in languages of Indonesia. Tokyo University of Foreign Studies, 23–40.Google Scholar
Wayan Arka I. and Manning Christopher D.. 1998. Voice and grammatical relations in Indonesian: A new perspective. In Proceedings of the LFG98 Conference. CSLI Publications.Google Scholar
Arwidarasti Jessica, Alfina Ika, and Krisnadhi Adila. 2019. Converting an Indonesian constituency treebank to the Penn Treebank format. In 2019 International Conference of Asian Language Processing (IALP). 331–336. Google ScholarCross Ref
Arwidarasti Jessica, Alfina Ika, and Krisnadhi Adila. 2020. Adjusting Indonesian multiword expression annotation to the Penn Treebank format. In 2020 International Conference of Asian Language Processing (IALP). Google ScholarCross Ref
Bies Ann, Ferguson Mark, Katz Karen, and MacIntyre Robert. 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google Scholar
Bies Ann, Mott Justin, Warner Colin, and Kulick Seth. 2012. English Web Treebank. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2012T13Google Scholar
Brants Sabine, Dipper Stefanie, Hansen Silvia, Lezius Wolfgang, and Smith George. 2002. The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistics Theories (TLT 2002).Google Scholar
Chaer Abdul. 1990. Penggunaan preposisi dan konjungsi bahasa Indonesia. Nusa Indah.Google Scholar
Chen Qian, Zhu Xiaodan, Ling Zhen-Hua, Wei Si, Jiang Hui, and Inkpen Diana. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1657–1668. Google ScholarCross Ref
Chung Sandra. 2008. Indonesian clause structure from an Austronesian perspective. Lingua, 118, 10 (2008), 1554–1582. Google ScholarCross Ref
Civit Montserrat and Antònia Martí Ma. 2004. Building Cast3LB: A spanish treebank. Research on Language and Computation 2 (2004), 549–574.Google ScholarCross Ref
Clynes Adrian and Deterding David. 2011. Standard Malay (Brunei). Journal of the International Phonetic Association 41, 2 (2011), 259–268. Google ScholarCross Ref
Cole Peter, Hermon Gabriella, and Tjung Yassir. 2006. Is there Pasif Semu in Jakarta Indonesian? Oceanic Linguistics 45, 1 (2006), 64–90. Google ScholarCross Ref
Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8440–8451. Google ScholarCross Ref
Crain Stephen and Steedman Mark. 1984. On not being led up the garden path: The use of context by the psychological parser. In Syntactic Theory and How People Parse Sentences. Cambridge University Press.Google Scholar
Cross James and Huang Liang. 2016. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1–11. Google ScholarCross Ref
Csendes Dóra, Csirik János, Gyimóthy Tibor, and Kocsor András. 2005. The Szeged Treebank. In TSD 2005: Text, Speech and Dialogue. Springer, Berlin, 123–131. Google ScholarDigital Library
Dahl Östen. 2015. Tense, aspect, mood and evidentiality, linguistics of. International Encyclopedia of the Social & Behavioral Sciences (2015), 210–213. Google ScholarCross Ref
Dalrymple Mary and Mofu Suriel. 2011. Plural semantics, reduplication, and numeral modification in Indonesian. Journal of Semantics 29, 2 (2012), 229–260. Google ScholarCross Ref
Denistia Karlina and Baayen R. Harald. 2022. The morphology of Indonesian: Data and quantitative modeling. In The Routledge Handbook of Asian Linguistics (1st ed.). Routledge, London, United Kingdom, 605–634. Google ScholarCross Ref
Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1: Long and Short Papers). Association for Computational Linguistics, 4171–4186. Google ScholarCross Ref
Dinakaramani Arawinda, Rashel Fam, Luthfi Andry and Manurung Ruli. 2014. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus. In 2014 International Conference on Asian Language Processing (IALP). 66–69. Google ScholarCross Ref
Ding Chenchen, Utiyama Masao, and Sumita Eiichiro. 2016. Similar Southeast Asian languages: Corpus-based case study on Thai-Laotian and Malay-Indonesian. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016). The COLING 2016 Organizing Committee, 149–156. http://aclanthology.lst.uni-saarland.de/W16-4614Google Scholar
Dwi Noverini Djenar. 2006. On the multifunctionality of compound prepositions in Indonesian. Oceanic Linguistics 45, 2 (2006), 404–428. Google ScholarCross Ref
Dwi Noverini Djenar. 2018. Constituent order and information structure in Indonesian discourse. In Perspectives on Information Structure in Austronesian Languages. Language Science Press, Berlin, Germany, 177–205. Google ScholarCross Ref
Donohue Mark. 2007. Word order in Austronesian from north to south and west to east. Linguistic Typology 11 (2007), 349–391. Google ScholarCross Ref
Drozdov Andrew, Verga Patrick, Yadav Mohit, Iyyer Mohit, and McCallum Andrew. 2019. Unsupervised latent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 1129–1141. Google ScholarCross Ref
Dryer Matthew S.. 2013a. Coding of nominal plurality. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/33Google Scholar
Dryer Matthew S.. 2013b. Order of Subject, Object and Verb. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/81Google Scholar
Effendi S.. 1995. Kata Sifat dan Kata Keterangan dalam Bahasa Indonesia. Bahasa dan Sastra 12, 2 (1995), 1–53.Google Scholar
Fei Hao, Wu Shengqiong, Ren Yafeng, Li Fei, and Ji Donghong. 2021. Better combine them together! Integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 549–559. Google ScholarCross Ref
Filino Mario and Purwarianti Ayu. 2016. Indonesian shift-reduce constituent parser. In 2016 International Conference on Data and Software Engineering (ICoDSE). 1-6. Google ScholarCross Ref
Gabbard Ryan. 2010. Null element restoration. Publicly accessible Penn Dissertations 264. https://repository.upenn.edu/edissertations/264Google Scholar
Gildea Daniel and Jurafsky Daniel. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28, 3 (2002), 245–288. Google ScholarDigital Library
Grangé Philippe. 2011. Aspect in Indonesian: Free markers versus affixed or clitic markers. In Proceedings of the International Workshop on TAM and Evidentiality in Indonesian Languages. Tokyo University of Foreign Studies, 43–63.Google Scholar
Grangé Philippe. 2015. The Indonesian verbal suffix -nya: Nominalization or subordination? Wacana Journal of the Humanities of Indonesia 16, 1 (2015), 133–166. Google ScholarCross Ref
Herlim Robert and Purwarianti Ayu. 2018. Indonesian shift-reduce constituency parser using feature templates & beam search strategy. In 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA). 54–59. Google ScholarCross Ref
Hirst Graeme. 1984. A semantic process for syntactic disambiguation. In Proceedings of the Fourth AAAI Conference on Artificial Intelligence (AAAI’84). AAAI, 148–152.Google ScholarDigital Library
Hogan Deirdre. 2007. Coordinate noun phrase disambiguation in a generative parsing model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 680–687. https://aclanthology.org/P07-1086Google Scholar
Irmawati Budi, Shindo Hiroyuki, and Matsumoto Yuji. 2017. A dependency annotation scheme to extract syntactic features in Indonesian sentences. International Journal of Technology 8, 5 (2017), 957–967. Google ScholarCross Ref
Jawahar Ganesh, Sagot Benoît, and Seddah Djamé. 2019. What Does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3651–3657. Google ScholarCross Ref
Jeoung Helen. 2020. Categorial ambiguity in mau, suka, and other Indonesian predicates. Language 96, 3 (2020), 157–172. Google ScholarCross Ref
Jiang Fan and Cohn Trevor. 2022. Incorporating Constituent Syntax for Coreference Resolution. arXiv. Google ScholarCross Ref
Jiang Ming and Diesner Jana. 2019. A constituency parsing tree based method for relation extraction from abstracts of scholarly publications. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13). Association for Computational Linguistics, 186–191. Google ScholarCross Ref
Joshi Mandar, Chen Danqi, Liu Yinhan, Weld Daniel S., Zettlemoyer Luke, and Levy Omer. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8 (2020), 64–77. Google ScholarCross Ref
Joshi Vidur, Peters Matthew, and Hopkins Mark. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1190–1199. Google ScholarCross Ref
Judge John, Cahill Aoife, and van Genabith Josef. 2006. QuestionBank: Creating a corpus of parse-annotated questions. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 497–504. https://aclanthology.org/P06-1063/Google Scholar
Jurafsky Daniel and Martin James H.. 2009a. Constituency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 259–279.Google Scholar
Jurafsky Daniel and Martin James H.. 2009b. Dependency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 280–304.Google Scholar
Kaplan Jared, McCandlish Sam, Henighan Tom, Brown Tom B., Chess Benjamin, Child Rewon, Gray Scott, Radford Alec, Wu Jeffrey, and Amodei Dario. 2020. Scaling laws for neural language models. arXiv. Google ScholarCross Ref
Kasami Tadao. 1965. An Efficient Recognition and Syntax-analysis Algorithm for Context-free Languages. Technical Report. Air Force Cambridge Research Lab, Bedford, MA.Google Scholar
Keraf Gorys. 1984. Tatabahasa Indonesia. Nusa Indah.Google Scholar
Kim Min-joo. 2002. Does Korean have adjectives? MIT Working Papers in Linguistics 43, (2002), 71–89.Google Scholar
Kim Jin-Dong, Ohta Tomoko, Tateisi, Yuka and Tsujii Jun'ichi. 2003. GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl. 1, i180–i182. Google ScholarCross Ref
Kim Yoon, Rush Alexander, Yu Lei, Kuncoro Adhiguna, Dyer Chris, and Melis Gábor. 2019. Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics. 1105–1117. Google ScholarCross Ref
Kitaev Nikita, Cao Steven, and Klein Dan. 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3499–3505. Google ScholarCross Ref
Kitaev Nikita and Klein Dan. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 2676–2686. Google ScholarCross Ref
Kitaev Nikita and Klein Dan. 2020. Tetra-tagging: Word-synchronous parsing with linear-time inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6255–6261. Google ScholarCross Ref
Klein Dan and Manning Christopher D.. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 128–135. Google ScholarDigital Library
Koto Fajri, Rahimi Afshin, Lau Jey Han, and Baldwin Timothy. 2020. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 757–770. Google ScholarCross Ref
Kridalaksana Harimurti. 1986. Kelas Kata dalam Bahasa Indonesia. Gramedia.Google Scholar
Kudo Taku and Richardson John. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 66–71. Google ScholarCross Ref
Li Jun, Cao Yifan, Cai Jiong, Jiang Yong, and Tu Kewei. 2020. An empirical comparison of unsupervised constituency parsing methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3278–3283. Google ScholarCross Ref
Li Zuchao, Parnow Kevin, and Zhao Hai. 2022. Incorporating rich syntax information in Grammatical Error Correction. Information Processing and Management, 59, 3 (2022). Google ScholarDigital Library
Li Zuchao, Zhao Hai, He Shexia, and Cai Jiaxun. 2021. Syntax role for neural semantic role labeling. Computational Linguistics, 47, 3 (2021), 529–574. Google ScholarCross Ref
Lim Ee Suan, Leong Wei Qi, Nguyen Ngan Thanh, Adhista Dea, Kng Wei Ming, Tjhi William Chandra, and Purwarianti Ayu. 2023. ICON: Building a large-scale benchmark constituency treebank for the Indonesian language. In Proceedings of the 21^st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023). Association for Computational Linguistics. 37–53. https://aclanthology.org/2023.tlt-1.5/Google Scholar
Victoria Lin Xi, Mihaylov Todor, Artetxe Mikel, Wang Tianlu, Chen Shuohui, Simig Daniel, Ott Myle, Goyal Naman, Bhosale Shruti, Du Jingfei, Pasunuru Ramakanth, Shleifer Sam, Singh Koura Punit, Chaudhary Vishrav, O'Horo Brian, Wang Jeff, Zettlemoyer Luke, Kozareva Zornitsa, Diab Mona, Stoyanov Veselin, and Li Xian. 2021. Few-shot Learning with Multilingual Language Models. arXiv. Google ScholarCross Ref
Liu Pengfei, Yuan Weizhe, Fu Jinlan, Jiang Zhengbao, Hayashi Hiroaki, and Neubig Graham. 2022. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys (2022). Google ScholarDigital Library
Ma Chunpeng, Tamura Akihiro, Utiyama Masao, Zhao Tiejun, and Sumita Eiichiro. 2018. Forest-based neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 1253–1263. Google ScholarCross Ref
Maamouri Mohamed and Bies Ann. 2004. Developing an Arabic Treebank: Methods, guidelines, procedures, and tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. COLING, 2–9. https://aclanthology.org/W04-1602Google Scholar
Mahdi Waruno. 2012. Distinguishing cognate homonyms in Indonesian. Oceanic Linguistics 51, 2 (2012), 402–449.Google Scholar
Maier Wolfgang, Kübler Sandra, Hinrichs Erhard, and Krivanek Julia. 2012. Annotating coordination in the penn treebank. In Proceedings of the Sixth Linguistic Annotation Workshop. Association for Computational Linguistics, 166–174. https://aclanthology.org/W12-3624Google ScholarDigital Library
Manning Christopher, Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven, and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 55–60. Google ScholarCross Ref
Marcus Mitchell P., Kim Grace, Marcinkiewicz Mary Ann, MacIntyre Robert, Bies Ann, Ferguson Mark, Katz Karen, and Schasberger Britta. 1994. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. https://aclanthology.org/H94-1020Google ScholarDigital Library
Marcus Mitchell P., Santorini Beatrice, and Marcinkiewicz Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. https://aclanthology.org/J93-2004Google ScholarDigital Library
de Marneffe Marie-Catherine, Manning Christopher, Nivre Joakim, and Zeman Daniel. 2021. Universal Dependencies. Computational Linguistics 47, 2 (2021), 255–308. Google Scholar
Meng Fandong, Xie Jun, Song Linfeng, Lü Yajuan, and Liu Qun. 2013. Translation with source constituency and dependency trees. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1066–1076. https://aclanthology.org/D13-1108Google Scholar
Moeliono Anton M., Lapoliwa Hans, Alwi Hasan, Satrya Tjatur Wisnu Sasangka Sry, and Sugiyono. 2017. Tata Bahasa Baku Bahasa Indonesia Edisi Keempat. Badan Pengembangan dan Pembinaan Bahasa, Kementerian Pendidikan dan Kebudayaan. Jakarta. https://repositori.kemdikbud.go.id/16351/Google Scholar
Moeljadi David. 2017. Building JATI: A treebank for Indonesian. In Proceedings of the 4th Atma Jaya Conference on Corpus Studies. https://hdl.handle.net/10220/46580Google Scholar
Moeljadi David, Bond Francis, and Song Sanghoun. 2015. Building an HPSG-based Indonesian Resource Grammar (INDRA). In Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop. Association for Computational Linguistics, 9–16. http://aclweb.org/anthology/W/W15/W15-3302.pdfGoogle Scholar
Moeljadi David, Kurniawan Aditya, and Goswam Debaditya. 2019. Building Cendana: A treebank for informal Indonesian. In The 33rd Pacific Asia Conference on Language, Information and Computation, 156-164. http://hdl.handle.net/2065/00063897Google Scholar
Montemagni Simonetta, Barsotti F., Battista Marco, Calzolari Nicoletta, Corazzari Ornella, Zampolli Antonio, Fanciulli F., Massetani M., Raffaelli Remo, Basili Roberto, Pazienza Maria Teresa, Saracino D., Zanzotto Fabio, Mana Nadia, Pianesi Fabio, and Delmonte Rodolfo. 2000. The Italian Syntactic-Semantic Treebank: Architecture, annotation, tools and evaluation. In Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora. International Committee on Computational Linguistics, 18–27. https://aclanthology.org/W00-1903Google Scholar
Mrini Khalil, Dernoncourt Franck, Tran Quan Hung, Bui Trung, Chang Walter, and Nakashole Ndapa. 2020. Rethinking self-attention: Towards interpretability in neural parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 731–742. Google ScholarCross Ref
Musgrave Simon. 2013. Functional categories in the syntax and semantics of Malay. In tense, aspect, mood, and evidentiality in languages of Indonesia. PKBB Universitas Katolik Indonesia Atma Jaya, Jakarta, 135–152.Google Scholar
Franciscus Xaverius Nadar. 1996. A comparative study of the Indonesian and English articles. Humaniora, 3 (1996), 47–56. Google ScholarCross Ref
Ng Hwee Tou, Wu Siew Mei, Wu Yuanbin, Hadiwinoto Christian, and Tetreault Joel. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, 1–12. https://aclanthology.org/W13-3601Google Scholar
Nguyen Minh Van, Lai Viet Dac, Veyseh Amir Pouran Ben, and Nguyen Thien Huu. 2021. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 80–90. Google ScholarCross Ref
Nomoto Hiroki, Choi Hannah, Moeljadi David, and Bond Francis. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. In Proceedings of the LREC 2018 Workshop “The 13th Workshop on Asian Language Resources”. European Language Resources Association (ELRA), 36–43. http://lrec-conf.org/workshops/lrec2018/W29/pdf/8_W29.pdfGoogle Scholar
Nomoto Hiroki. 2022. Kyokushoushugi ni motoduku heiretsu tsuriibanku no kouchiku [Building a parallel treebank based on minimalism]. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing. The Association for Natural Language Processing, 103–107. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E1-4.pdfGoogle Scholar
Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. Google ScholarCross Ref
Petrov Slav and Klein Dan. 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics, 404–411. https://aclanthology.org/N07-1051Google Scholar
Pittayaporn Pittayawat. 2021. Typological profile of Kra-Dai languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 433–468. Google ScholarCross Ref
Prolo Carlos A.. 2006. Handling unlike coordinated phrases in TAG by mixing syntactic category and grammatical function. In Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms. Association for Computational Linguistics. 137–140. https://aclanthology.org/W06-1520Google Scholar
Punyakanok Vasin, Roth Dan, and Yih Wen-tau. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34, 2 (2008), 257–287. Google ScholarDigital Library
Qi Peng, Zhang Yuhao, Zhang Yuhui, Bolton Jason, and Manning Christopher D.. 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics. 101–108. Google ScholarCross Ref
Ramlan M.. 1980. Kata depan atau preposisi dalam bahasa Indonesia. U. P. Karyono.Google Scholar
Sagae Kenji and Lavie Alon. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology. Association for Computational Linguistics. 125–132. https://aclanthology.org/W05-1513Google ScholarDigital Library
Sajarwa. 2019. The translation of durative aspect of French into Indonesian. In Proceedings of the Fifth Prasasti International Seminar on Linguistics (PRASASTI 2019). Atlantis Press, 393–397. Google ScholarCross Ref
Santorini Beatrice. 1990. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google Scholar
Sasangka Sry Satriya Tjatur Wisnu, Indiyatini Titik, and Widjaja Nantje Harijati. 2000. Adjektiva dan Adverbia dalam Bahasa Indonesia. Pusat Bahasa Departemen Pendidikan Nasional Jakarta.Google Scholar
Seddah Djamé, Tsarfaty Reut, Kübler Sandra, Candito Marie, Choi Jinho D., Farkas Richárd, Foster Jennifer, Goenaga Iakes, Gojenola Koldo, Goldberg Yoav, Green Spence, Habash Nizar, Kuhlmann Marco, Maier Wolfgang, Nivre Joakim, Przepiórkowski Adam, Roth Ryan, Seeker Wolfgang, Versley Yannick, Vincze Veronika, Woliński Marcin, Wróblewska Alina, and Villemonte de la Clérgerie Eric. 2013. Overview of the SPMRL 2013 Shared Task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages. Association for Computational Linguistics, 146–182. https://aclanthology.org/W13-4917Google Scholar
Seginer Yoav. 2007. Fast unsupervised incremental parsing. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 384–391. https://aclanthology.org/P07-1049Google Scholar
Shen Yikang, Lin Zhouhan, Huang Chin Wei, and Courville Aaron. 2018. Neural language modeling by jointly learning syntax and lexicon. International Conference on Learning Representations. Google ScholarCross Ref
Sima'an Khalil, Itai Alon, Winter Yoad, Altman Alon, and Nativ Noa. 2001. Building a tree-bank of modern Hebrew text. Traitement Automatique des Langues 42 (2001), 347–380.Google Scholar
Neil Sneddon James. 2003. Diglossia in Indonesian. Bijdragen tot de Taal-, Land- en Volkenkunde 159, 4 (2003), 519–549. https://www.jstor.org/stable/27868068Google Scholar
Sneddon James Neil, Adelaar Alexander, Djenar Dwi Noverini, and Ewing Michael C.. 2010. Indonesian Reference Grammar, 2nd edition. Allen & Unwin.Google Scholar
Stack Maggie. 2005. Word order and intonation in Indonesian. In Lexical Semantic Ontology Working Papers in Linguistics 5: Proceedings of Workshop in General Linguistics. Linguistics Student Organization, 168–182.Google Scholar
Stern Mitchell, Andreas Jacob, and Klein Dan. 2017. A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 818–827. Google ScholarCross Ref
Taylor Ann, Marcus Mitchell, and Santorini Beatrice. 2003. The Penn Treebank: An Overview. In Abeillé A. (Eds). Treebanks. Text, Speech and Language Technology, Volume 20. Springer, Dordrecht, 5–22. Google ScholarCross Ref
Teeuw Alex. 1962. Some problems in the study of word-classes in Bahasa Indonesia. Lingua, 11 (1962), 409–421. Google ScholarCross Ref
Telljohann Heike, Hinrichs Erhard, and Kübler Sandra. 2004. The Tüba-D/Z Treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association. https://aclanthology.org/L04-1096/Google Scholar
Le Quang Thang Hiroshi Noji, and Miyao Yusuke. 2015. Optimal shift-reduce constituent parsing with structured perceptron. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. 1534–1544. Google ScholarCross Ref
Thim Stefan. 2012. Phrasal verbs: The English verb-particle construction and its history (Topics in English Linguistics 78). Mouton de Gruyter (2012). Google ScholarCross Ref
Kyaw Thu Ye, Pa Win Pa, Utiyama Masao, Finch Andrew, and Sumita Eiichiro. 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association. 1574–1578. https://aclanthology.org/L16-1249Google Scholar
Tian Yuanhe, Song Yan, Xia Fei, and Zhang Tong. 2020. Improving constituency parsing with span attention. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics. 1691–1703. Google ScholarCross Ref
Tjia Johnny. 2015. Grammatical relations and grammatical categories in Malay; The Indonesian prefix meN- revisited. Wacana 16, 1 (2015), 105–132.Google Scholar
Van Bik Kenneth. 2021. Typological profile of Kuki-Chin languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 369–402. Google ScholarCross Ref
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, Polosukhin Illia. 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017). Google ScholarCross Ref
Watanabe Taro and Sumita Eiichiro. 2015. Transition-based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 1169–1179. Google ScholarCross Ref
Weischedel Ralph, Ayuso Damaris, Bobrow R., Boisen Sean, Ingria Robert, and Palmucci Jeff. 1991. Partial parsing: A report on work in progress. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991. 204–209. Google ScholarDigital Library
Weischedel Ralph, Palmer Martha, Marcus Mitchell, Hovy Eduard, Pradhan Sameer, Ramshaw Lance, Xue Nianwen, Taylor Ann, Kaufman Jeff, Franchini Michelle, El-Bachouti Mohammed, Belvin Robert, and Houston Ann. 2013. OntoNotes Release 5.0. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2013T19Google Scholar
Wilie Bryan, Vincentio Karissa, Winata Genta Indra, Cahyawijaya Samuel, Li Xiaohong, Lim Zhi Yuan, Soleman Sidik, Mahendra Rahmad, Fung Pascale, Bahar Syafri, and Purwarianti Ayu. 2020. IndoNLU: benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 843–857. https://aclanthology.org/2020.aacl-main.85Google Scholar
Woliński Marcin and Hajnicz Elżbieta. 2021. Składnica: A constituency treebank of Polish harmonised with the Walenty Valency Dictionary. Language Resources and Evaluation, 55 (2021), 209–239. Google ScholarDigital Library
Wu Shijie and Dredze Mark. 2019. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 833–844. Google ScholarCross Ref
Xia Qingrong, Zhang Bo, Wang Rui, Li Zhenghua, Zhang Yue, Huang Fei, Si Luo, and Zhang Min. 2021. A unified span-based approach for opinion mining with syntactic constituents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 1795–1804. Google ScholarCross Ref
Xu Jiacheng and Durrett Greg. 2019. Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 3292–3303. Google ScholarCross Ref
Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, and Raffel Colin. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 483–498. Google ScholarCross Ref
Yang Jian, Ma Shuming, Zhang Dongdong, Li Zhoujun, and Zhou Ming. 2020. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5979–5989. Google ScholarCross Ref
Yang Sen, Cui Leyang, Ning Ruoxi, Wu Di, and Zhang Yue. 2022. Challenges to open-domain constituency parsing. In Findings of the Association for Computational Linguistics: ACL. Association for Computational Linguistics. 112–127. https://aclanthology.org/2022.findings-acl.11Google Scholar
Yıldız Olcay Taner, Solak Ercan, Çandır Şemsinur, Ehsani Razieh, and Görgün Onur. 2015. Constructing a Turkish constituency parse treebank. In Information Sciences and Systems 2015. Springer, Cham, 339–347. Google ScholarCross Ref
Younger Daniel H.. 1967. Recognition and parsing of context-free languages in time n3. Information and Control 10, 2 (1967), 189–208. Google ScholarCross Ref
Zhang Meishan. 2020. A survey of syntactic-semantic parsing based on constituent and dependency structures. Science China Technological Sciences 63 (2020), 1989–1920. Google ScholarCross Ref
Zhang Yian. 2020. Latent tree learning with ordered neurons: What parses does it produce? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. 119–125. Google ScholarCross Ref
Zhou Junru and Zhao Hai. 2019. Head-driven phrase structure grammar parsing on Penn Treebank. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2396–2408. Google ScholarCross Ref
Zhu Fangyi, Tan Lok You, Ng See-Kiong, and Bressan Stéphane. 2022. Syntax-informed question answering with heterogeneous graph transformer. In Database and Expert Systems Applications: 33rd International Conference, DEXA 2022, Vienna, Austria, August 22–24, 2022, Proceedings, Part I. Springer-Verlag, Berlin, 17–31. Google ScholarDigital Library

Index Terms

ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Składnica: a constituency treebank of Polish harmonised with the Walenty valency dictionary
Abstract
This paper reports on the developments in three interrelated linguistic resources for Polish. The first is Świgra 2—a rule based constituency parser for Polish. The second is Składnica—a treebank built using Świgra 2. The third resource is valency ...
Read More
Constituency Parsing of Complex Noun Sequences in Hindi
CICLing 2014: Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8403

A complex noun sequence is one in which a head noun is recursively modified by one or more bare nouns and/or genitives Constituency analysis of complex noun sequence is a prerequisite for finding dependency relation semantic relation between components ...
Read More
Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges
Abstract
This paper presents the SAGT Turkish–German code-switching treebank, and observations and annotation challenges we encountered during its development. The treebank consists of transcriptions of bilingual conversations annotated with several layers:...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 8
August 2023
373 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3615980
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 August 2023
- Online AM: 25 July 2023
- Accepted: 6 July 2023
- Revised: 16 May 2023
- Received: 21 November 2022
Published in tallip Volume 22, Issue 8

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Constituency parsing
treebank
Indonesian corpus
corpus annotation
neural parser
deep learning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 140
  Total Downloads
- Downloads (Last 12 months)140
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Składnica: a constituency treebank of Polish harmonised with the Walenty valency dictionary

Constituency Parsing of Complex Noun Sequences in Hindi

Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Składnica: a constituency treebank of Polish harmonised with the Walenty valency dictionary

Constituency Parsing of Complex Noun Sequences in Hindi

Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media