Abstract
Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly available manually-annotated benchmark Indonesian constituency treebank with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. As part of the process of building the treebank, we review and revamp the constituent and POS tagsets in use in existing treebanks to ensure that the labels are relevant and suitable for the grammatical features of Indonesian. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in Indonesian that pose challenges for constituency parsing.
- 2000. Building a treebank for French. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). European Languages Resources Association. https://aclanthology.org/L00-1175/Google Scholar .
- 2021. Typological profile of Vietic. In The Languages and Linguistics of Mainland Southeast Asia: A comprehensive guide. De Gruyter Mouton, Berlin, Boston, 469–498. Google ScholarCross Ref .
- 2013. On the typology and syntax of TAM in Indonesian. In tense, aspect, mood and evidentiality in languages of Indonesia. Tokyo University of Foreign Studies, 23–40.Google Scholar .
- 1998. Voice and grammatical relations in Indonesian: A new perspective. In Proceedings of the LFG98 Conference. CSLI Publications.Google Scholar .
- 2019. Converting an Indonesian constituency treebank to the Penn Treebank format. In 2019 International Conference of Asian Language Processing (IALP). 331–336. Google ScholarCross Ref .
- 2020. Adjusting Indonesian multiword expression annotation to the Penn Treebank format. In 2020 International Conference of Asian Language Processing (IALP). Google ScholarCross Ref .
- 1995. Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google Scholar .
- 2012. English Web Treebank. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2012T13Google Scholar .
- 2002. The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistics Theories (TLT 2002).Google Scholar .
- 1990. Penggunaan preposisi dan konjungsi bahasa Indonesia. Nusa Indah.Google Scholar .
- 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1657–1668. Google ScholarCross Ref .
- 2008. Indonesian clause structure from an Austronesian perspective. Lingua, 118, 10 (2008), 1554–1582. Google ScholarCross Ref .
- 2004. Building Cast3LB: A spanish treebank. Research on Language and Computation 2 (2004), 549–574.Google ScholarCross Ref .
- 2011. Standard Malay (Brunei). Journal of the International Phonetic Association 41, 2 (2011), 259–268. Google ScholarCross Ref .
- 2006. Is there Pasif Semu in Jakarta Indonesian? Oceanic Linguistics 45, 1 (2006), 64–90. Google ScholarCross Ref .
- 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8440–8451. Google ScholarCross Ref .
- 1984. On not being led up the garden path: The use of context by the psychological parser. In Syntactic Theory and How People Parse Sentences. Cambridge University Press.Google Scholar .
- 2016. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1–11. Google ScholarCross Ref .
- 2005. The Szeged Treebank. In TSD 2005: Text, Speech and Dialogue. Springer, Berlin, 123–131. Google ScholarDigital Library .
- 2015. Tense, aspect, mood and evidentiality, linguistics of. International Encyclopedia of the Social & Behavioral Sciences (2015), 210–213. Google ScholarCross Ref .
- 2011. Plural semantics, reduplication, and numeral modification in Indonesian. Journal of Semantics 29, 2 (2012), 229–260. Google ScholarCross Ref .
- 2022. The morphology of Indonesian: Data and quantitative modeling. In The Routledge Handbook of Asian Linguistics (1st ed.). Routledge, London, United Kingdom, 605–634. Google ScholarCross Ref .
- 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1: Long and Short Papers). Association for Computational Linguistics, 4171–4186. Google ScholarCross Ref .
- 2014. Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus. In 2014 International Conference on Asian Language Processing (IALP). 66–69. Google ScholarCross Ref .
- 2016. Similar Southeast Asian languages: Corpus-based case study on Thai-Laotian and Malay-Indonesian. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016). The COLING 2016 Organizing Committee, 149–156. http://aclanthology.lst.uni-saarland.de/W16-4614Google Scholar .
- Dwi Noverini Djenar. 2006. On the multifunctionality of compound prepositions in Indonesian. Oceanic Linguistics 45, 2 (2006), 404–428. Google ScholarCross Ref
- Dwi Noverini Djenar. 2018. Constituent order and information structure in Indonesian discourse. In Perspectives on Information Structure in Austronesian Languages. Language Science Press, Berlin, Germany, 177–205. Google ScholarCross Ref
- 2007. Word order in Austronesian from north to south and west to east. Linguistic Typology 11 (2007), 349–391. Google ScholarCross Ref .
- 2019. Unsupervised latent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 1129–1141. Google ScholarCross Ref .
- Coding of nominal plurality. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/33Google Scholar . 2013a.
- Order of Subject, Object and Verb. In Dryer, Matthew S. and Haspelmath, Martin (Eds.), The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. https://wals.info/chapter/81Google Scholar . 2013b.
- 1995. Kata Sifat dan Kata Keterangan dalam Bahasa Indonesia. Bahasa dan Sastra 12, 2 (1995), 1–53.Google Scholar .
- 2021. Better combine them together! Integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 549–559. Google ScholarCross Ref .
- 2016. Indonesian shift-reduce constituent parser. In 2016 International Conference on Data and Software Engineering (ICoDSE). 1-6. Google ScholarCross Ref .
- 2010. Null element restoration. Publicly accessible Penn Dissertations 264. https://repository.upenn.edu/edissertations/264Google Scholar .
- 2002. Automatic labeling of semantic roles. Computational Linguistics, 28, 3 (2002), 245–288. Google ScholarDigital Library .
- 2011. Aspect in Indonesian: Free markers versus affixed or clitic markers. In Proceedings of the International Workshop on TAM and Evidentiality in Indonesian Languages. Tokyo University of Foreign Studies, 43–63.Google Scholar .
- 2015. The Indonesian verbal suffix -nya: Nominalization or subordination? Wacana Journal of the Humanities of Indonesia 16, 1 (2015), 133–166. Google ScholarCross Ref .
- 2018. Indonesian shift-reduce constituency parser using feature templates & beam search strategy. In 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA). 54–59. Google ScholarCross Ref .
- 1984. A semantic process for syntactic disambiguation. In Proceedings of the Fourth AAAI Conference on Artificial Intelligence (AAAI’84). AAAI, 148–152.Google ScholarDigital Library .
- 2007. Coordinate noun phrase disambiguation in a generative parsing model. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 680–687. https://aclanthology.org/P07-1086Google Scholar .
- 2017. A dependency annotation scheme to extract syntactic features in Indonesian sentences. International Journal of Technology 8, 5 (2017), 957–967. Google ScholarCross Ref .
- 2019. What Does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3651–3657. Google ScholarCross Ref .
- 2020. Categorial ambiguity in mau, suka, and other Indonesian predicates. Language 96, 3 (2020), 157–172. Google ScholarCross Ref .
- 2022. Incorporating Constituent Syntax for Coreference Resolution. arXiv. Google ScholarCross Ref .
- 2019. A constituency parsing tree based method for relation extraction from abstracts of scholarly publications. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13). Association for Computational Linguistics, 186–191. Google ScholarCross Ref .
- 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8 (2020), 64–77. Google ScholarCross Ref .
- 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1190–1199. Google ScholarCross Ref .
- 2006. QuestionBank: Creating a corpus of parse-annotated questions. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 497–504. https://aclanthology.org/P06-1063/Google Scholar .
- Constituency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 259–279.Google Scholar . 2009a.
- Dependency parsing. In Speech and Language Processing (2nd ed.). Pearson Prentice Hall, United States, 280–304.Google Scholar . 2009b.
- 2020. Scaling laws for neural language models. arXiv. Google ScholarCross Ref .
- 1965. An Efficient Recognition and Syntax-analysis Algorithm for Context-free Languages. Technical Report. Air Force Cambridge Research Lab, Bedford, MA.Google Scholar .
- 1984. Tatabahasa Indonesia. Nusa Indah.Google Scholar .
- 2002. Does Korean have adjectives? MIT Working Papers in Linguistics 43, (2002), 71–89.Google Scholar .
- 2003. GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl. 1, i180–i182. Google ScholarCross Ref .
- 2019. Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics. 1105–1117. Google ScholarCross Ref .
- 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3499–3505. Google ScholarCross Ref .
- 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 2676–2686. Google ScholarCross Ref .
- 2020. Tetra-tagging: Word-synchronous parsing with linear-time inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6255–6261. Google ScholarCross Ref .
- 2002. A generative constituent-context model for improved grammar induction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 128–135. Google ScholarDigital Library .
- 2020. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 757–770. Google ScholarCross Ref .
- 1986. Kelas Kata dalam Bahasa Indonesia. Gramedia.Google Scholar .
- 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 66–71. Google ScholarCross Ref .
- 2020. An empirical comparison of unsupervised constituency parsing methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3278–3283. Google ScholarCross Ref .
- 2022. Incorporating rich syntax information in Grammatical Error Correction. Information Processing and Management, 59, 3 (2022). Google ScholarDigital Library .
- 2021. Syntax role for neural semantic role labeling. Computational Linguistics, 47, 3 (2021), 529–574. Google ScholarCross Ref .
- 2023. ICON: Building a large-scale benchmark constituency treebank for the Indonesian language. In Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023). Association for Computational Linguistics. 37–53. https://aclanthology.org/2023.tlt-1.5/Google Scholar .
- 2021. Few-shot Learning with Multilingual Language Models. arXiv. Google ScholarCross Ref .
- 2022. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys (2022). Google ScholarDigital Library .
- 2018. Forest-based neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 1253–1263. Google ScholarCross Ref .
- 2004. Developing an Arabic Treebank: Methods, guidelines, procedures, and tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. COLING, 2–9. https://aclanthology.org/W04-1602Google Scholar .
- 2012. Distinguishing cognate homonyms in Indonesian. Oceanic Linguistics 51, 2 (2012), 402–449.Google Scholar .
- 2012. Annotating coordination in the penn treebank. In Proceedings of the Sixth Linguistic Annotation Workshop. Association for Computational Linguistics, 166–174. https://aclanthology.org/W12-3624Google ScholarDigital Library .
- 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 55–60. Google ScholarCross Ref .
- 1994. The Penn Treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994. https://aclanthology.org/H94-1020Google ScholarDigital Library .
- 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330. https://aclanthology.org/J93-2004Google ScholarDigital Library .
- 2021. Universal Dependencies. Computational Linguistics 47, 2 (2021), 255–308. Google Scholar .
- 2013. Translation with source constituency and dependency trees. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1066–1076. https://aclanthology.org/D13-1108Google Scholar .
- 2017. Tata Bahasa Baku Bahasa Indonesia Edisi Keempat. Badan Pengembangan dan Pembinaan Bahasa, Kementerian Pendidikan dan Kebudayaan. Jakarta. https://repositori.kemdikbud.go.id/16351/Google Scholar .
- 2017. Building JATI: A treebank for Indonesian. In Proceedings of the 4th Atma Jaya Conference on Corpus Studies. https://hdl.handle.net/10220/46580Google Scholar .
- 2015. Building an HPSG-based Indonesian Resource Grammar (INDRA). In Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop. Association for Computational Linguistics, 9–16. http://aclweb.org/anthology/W/W15/W15-3302.pdfGoogle Scholar .
- 2019. Building Cendana: A treebank for informal Indonesian. In The 33rd Pacific Asia Conference on Language, Information and Computation, 156-164. http://hdl.handle.net/2065/00063897Google Scholar .
- 2000. The Italian Syntactic-Semantic Treebank: Architecture, annotation, tools and evaluation. In Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora. International Committee on Computational Linguistics, 18–27. https://aclanthology.org/W00-1903Google Scholar .
- 2020. Rethinking self-attention: Towards interpretability in neural parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 731–742. Google ScholarCross Ref .
- 2013. Functional categories in the syntax and semantics of Malay. In tense, aspect, mood, and evidentiality in languages of Indonesia. PKBB Universitas Katolik Indonesia Atma Jaya, Jakarta, 135–152.Google Scholar .
- Franciscus Xaverius Nadar. 1996. A comparative study of the Indonesian and English articles. Humaniora, 3 (1996), 47–56. Google ScholarCross Ref
- 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, 1–12. https://aclanthology.org/W13-3601Google Scholar .
- 2021. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 80–90. Google ScholarCross Ref .
- 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. In Proceedings of the LREC 2018 Workshop “The 13th Workshop on Asian Language Resources”. European Language Resources Association (ELRA), 36–43. http://lrec-conf.org/workshops/lrec2018/W29/pdf/8_W29.pdfGoogle Scholar .
- 2022. Kyokushoushugi ni motoduku heiretsu tsuriibanku no kouchiku [Building a parallel treebank based on minimalism]. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing. The Association for Natural Language Processing, 103–107. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E1-4.pdfGoogle Scholar .
- 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. Google ScholarCross Ref .
- 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics, 404–411. https://aclanthology.org/N07-1051Google Scholar .
- 2021. Typological profile of Kra-Dai languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 433–468. Google ScholarCross Ref .
- 2006. Handling unlike coordinated phrases in TAG by mixing syntactic category and grammatical function. In Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms. Association for Computational Linguistics. 137–140. https://aclanthology.org/W06-1520Google Scholar .
- 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34, 2 (2008), 257–287. Google ScholarDigital Library .
- 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics. 101–108. Google ScholarCross Ref .
- 1980. Kata depan atau preposisi dalam bahasa Indonesia. U. P. Karyono.Google Scholar .
- 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology. Association for Computational Linguistics. 125–132. https://aclanthology.org/W05-1513Google ScholarDigital Library .
- Sajarwa. 2019. The translation of durative aspect of French into Indonesian. In Proceedings of the Fifth Prasasti International Seminar on Linguistics (PRASASTI 2019). Atlantis Press, 393–397. Google ScholarCross Ref
- 1990. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical Report. University of Pennsylvania, Philadelphia, Pennsylvania.Google Scholar .
- 2000. Adjektiva dan Adverbia dalam Bahasa Indonesia. Pusat Bahasa Departemen Pendidikan Nasional Jakarta.Google Scholar .
- 2013. Overview of the SPMRL 2013 Shared Task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages. Association for Computational Linguistics, 146–182. https://aclanthology.org/W13-4917Google Scholar .
- 2007. Fast unsupervised incremental parsing. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 384–391. https://aclanthology.org/P07-1049Google Scholar .
- 2018. Neural language modeling by jointly learning syntax and lexicon. International Conference on Learning Representations. Google ScholarCross Ref .
- 2001. Building a tree-bank of modern Hebrew text. Traitement Automatique des Langues 42 (2001), 347–380.Google Scholar .
- 2003. Diglossia in Indonesian. Bijdragen tot de Taal-, Land- en Volkenkunde 159, 4 (2003), 519–549. https://www.jstor.org/stable/27868068Google Scholar .
- 2010. Indonesian Reference Grammar, 2nd edition. Allen & Unwin.Google Scholar .
- 2005. Word order and intonation in Indonesian. In Lexical Semantic Ontology Working Papers in Linguistics 5: Proceedings of Workshop in General Linguistics. Linguistics Student Organization, 168–182.Google Scholar .
- 2017. A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. 818–827. Google ScholarCross Ref .
- 2003. The Penn Treebank: An Overview. In (Eds). Treebanks. Text, Speech and Language Technology, Volume 20. Springer, Dordrecht, 5–22. Google ScholarCross Ref .
- 1962. Some problems in the study of word-classes in Bahasa Indonesia. Lingua, 11 (1962), 409–421. Google ScholarCross Ref .
- 2004. The Tüba-D/Z Treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association. https://aclanthology.org/L04-1096/Google Scholar .
- 2015. Optimal shift-reduce constituent parsing with structured perceptron. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. 1534–1544. Google ScholarCross Ref .
- 2012. Phrasal verbs: The English verb-particle construction and its history (Topics in English Linguistics 78). Mouton de Gruyter (2012). Google ScholarCross Ref .
- 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association. 1574–1578. https://aclanthology.org/L16-1249Google Scholar .
- 2020. Improving constituency parsing with span attention. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics. 1691–1703. Google ScholarCross Ref .
- 2015. Grammatical relations and grammatical categories in Malay; The Indonesian prefix meN- revisited. Wacana 16, 1 (2015), 105–132.Google Scholar .
- 2021. Typological profile of Kuki-Chin languages. In The Languages and Linguistics of Mainland Southeast Asia: A Comprehensive Guide. De Gruyter Mouton, Berlin, Boston, 369–402. Google ScholarCross Ref .
- 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017). Google ScholarCross Ref .
- 2015. Transition-based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 1169–1179. Google ScholarCross Ref .
- 1991. Partial parsing: A report on work in progress. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991. 204–209. Google ScholarDigital Library .
- 2013. OntoNotes Release 5.0. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/LDC2013T19Google Scholar .
- 2020. IndoNLU: benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 843–857. https://aclanthology.org/2020.aacl-main.85Google Scholar .
- 2021. Składnica: A constituency treebank of Polish harmonised with the Walenty Valency Dictionary. Language Resources and Evaluation, 55 (2021), 209–239. Google ScholarDigital Library .
- 2019. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 833–844. Google ScholarCross Ref .
- 2021. A unified span-based approach for opinion mining with syntactic constituents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 1795–1804. Google ScholarCross Ref .
- 2019. Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 3292–3303. Google ScholarCross Ref .
- 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 483–498. Google ScholarCross Ref .
- 2020. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5979–5989. Google ScholarCross Ref .
- 2022. Challenges to open-domain constituency parsing. In Findings of the Association for Computational Linguistics: ACL. Association for Computational Linguistics. 112–127. https://aclanthology.org/2022.findings-acl.11Google Scholar .
- 2015. Constructing a Turkish constituency parse treebank. In Information Sciences and Systems 2015. Springer, Cham, 339–347. Google ScholarCross Ref .
- 1967. Recognition and parsing of context-free languages in time n3. Information and Control 10, 2 (1967), 189–208. Google ScholarCross Ref .
- 2020. A survey of syntactic-semantic parsing based on constituent and dependency structures. Science China Technological Sciences 63 (2020), 1989–1920. Google ScholarCross Ref .
- 2020. Latent tree learning with ordered neurons: What parses does it produce? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. 119–125. Google ScholarCross Ref .
- 2019. Head-driven phrase structure grammar parsing on Penn Treebank. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2396–2408. Google ScholarCross Ref .
- 2022. Syntax-informed question answering with heterogeneous graph transformer. In Database and Expert Systems Applications: 33rd International Conference, DEXA 2022, Vienna, Austria, August 22–24, 2022, Proceedings, Part I. Springer-Verlag, Berlin, 17–31. Google ScholarDigital Library .
Index Terms
- ICON: A Linguistically-Motivated Large-Scale Benchmark Indonesian Constituency Treebank
Recommendations
Składnica: a constituency treebank of Polish harmonised with the Walenty valency dictionary
AbstractThis paper reports on the developments in three interrelated linguistic resources for Polish. The first is Świgra 2—a rule based constituency parser for Polish. The second is Składnica—a treebank built using Świgra 2. The third resource is valency ...
Constituency Parsing of Complex Noun Sequences in Hindi
CICLing 2014: Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8403A complex noun sequence is one in which a head noun is recursively modified by one or more bare nouns and/or genitives Constituency analysis of complex noun sequence is a prerequisite for finding dependency relation semantic relation between components ...
Two languages, one treebank: building a Turkish–German code-switching treebank and its challenges
AbstractThis paper presents the SAGT Turkish–German code-switching treebank, and observations and annotation challenges we encountered during its development. The treebank consists of transcriptions of bilingual conversations annotated with several layers:...
Comments