Abstract
This study presents the first Native Language Identification (NLI) study for L2 Portuguese. We used a sub-set of the NLI-PT dataset, containing texts written by speakers of five different native languages: Chinese, English, German, Italian, and Spanish. We explore the linguistic annotations available in NLI-PT to extract a range of (morpho-)syntactic features and apply NLI classification methods to predict the native language of the authors. The best results were obtained using an ensemble combination of the features, achieving \(54.1\%\) accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
The issues exist as the corpus was not designed specifically for NLI.
- 4.
More details about this approach can be found in [21].
- 5.
Like previous work, this also includes stop words.
- 6.
- 7.
They are also known as Phrase Structure Rules or Production Rules.
References
Malmasi, S.: Native language identification: explorations and applications. Ph.D. thesis (2016)
Malmasi, S., Dras, M.: Multilingual native language identification. In: Natural Language Engineering (2015)
Malmasi, S., Dras, M.: Chinese native language identification. In: Proceedings of EACL. Association for Computational Linguistics, Gothenburg (2014)
Malmasi, S., Dras, M., Temnikova, I.: Norwegian native language identification. In: Proceedings of RANLP, Hissar, Bulgaria, pp. 404–412, September 2015
Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the Arabic Natural Language Processing Workshop (2014)
Block, D., Cameron, D.: Globalization and Language Teaching. Routledge, Abingdon (2002)
Martins, R.T., Hasegawa, R., Nunes, M.G.V., Montilha, G., De Oliveira, O.N.: Linguistic issues in the development of ReGra: a grammar checker for Brazilian Portuguese. Nat. Lang. Eng. 4(4), 287–307 (1998)
Elliot, S.: IntelliMetric: From here to validity. In: A Cross-Disciplinary Perspective, Automated Essay Scoring, pp. 71–86 (2003)
Baptista, J., Costa, N., Guerra, J., Zampieri, M., Cabral, M., Mamede, N.: P-AWL: academic word list for Portuguese. In: Pardo, T.A.S., Branco, A., Klautau, A., Vieira, R., de Lima, V.L.S. (eds.) PROPOR 2010. LNCS (LNAI), vol. 6001, pp. 120–123. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12320-7_15
Mendes, A., Antunes, S., Janssen, M., Gonçalves, A.: The COPLE2 corpus: a learner corpus for Portuguese. In: Proceedings of LREC (2016)
Wong, S.M.J., Dras, M.: Contrastive analysis and native language identification. In: Proceedings of ALTA, Sydney, Australia, pp. 53–61, December 2009
Wong, S.M.J., Dras, M.: Exploiting parse structures for native language identification. In: Proceedings of EMNLP (2011)
Swanson, B., Charniak, E.: Native language detection with tree substitution grammars. In: Proceedings of ACL, Jeju Island, Korea, pp. 193–197, July 2012
Tetreault, J., Blanchard, D., Cahill, A., Chodorow, M.: Native tongues, lost and found: resources and empirical evaluations in native language identification. In: Proceedings of COLING, Mumbai, India, pp. 2585–2602 (2012)
Gebre, B.G., Zampieri, M., Wittenburg, P., Heskes, T.: Improving native language identification with TF-IDF weighting. In: Proceedings of BEA (2013)
Malmasi, S., Dras, M.: Language transfer hypotheses with linear SVM weights. In: Proceedings of EMNLP, pp. 1385–1390 (2014)
Malmasi, S., Dras, M., Johnson, M., Du, L., Wolska, M.: Unsupervised text segmentation based on native language characteristics. In: Proceedings of ACL (2017)
Malmasi, S., Tetreault, J., Dras, M.: Oracle and human baselines for native language identification. In: Proceedings of BEA (2015)
Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceedings of BEA (2013)
Malmasi, S., et al.: A report on the 2017 native language identification shared task. In: Proceedings of BEA (2017)
Malmasi, S., Dras, M.: Native Language Identification using Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017)
Malmasi, S., Dras, M.: Native language identification with classifier stacking and ensembles. Computational Linguistics (2018)
Wong, S.M.J., Dras, M., Johnson, M.: Exploring adaptor grammars for native language identification. In: Proceedings of EMNLP (2012)
Tsur, O., Rappoport, A.: Using classifier features for studying the effect of native language on the choice of written second language words. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition (2007)
Malmasi, S., Wong, S.M.J., Dras, M.: NLI shared task 2013: MQ submission. In: Proceedings of BEA (2013)
Swanson, B., Charniak, E.: Data driven language transfer hypotheses. EACL 2014, 169 (2014)
Granger, S., Dagneaux, E., Meunier, F., Paquot, M.: International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvian-la-Neuve (2009)
Brooke, J., Hirst, G.: Measuring interlanguage: native language identification with L1-influence metrics. In: Proceedings of LREC (2012)
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., Chodorow, M.: TOEFL11: a corpus of non-native English. Educational Testing Service, Technical report (2013)
Malmasi, S., Dras, M.: Finnish native language identification. In: Proceedings of ALTA, Melbourne, Australia, pp. 139–144 (2014)
Wang, M., Malmasi, S., Huang, M.: The Jinan Chinese learner corpus. In: Proceedings of BEA (2015)
Tenfjord, K., Meurer, P., Hofland, K.: The ASK corpus: a language learner corpus of Norwegian as a second language. In: Proceedings of LREC (2006)
del Río, I., Zampieri, M., Malmasi, S.: A Portuguese native language identification dataset. In: Proceedings of BEA (2018)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)
Malmasi, S., Cahill, A.: Measuring feature diversity in native language identification. In: Proceedings of BEA (2015)
Malmasi, S., Dras, M., Zampieri, M.: LTG at SemEval-2016 Task 11: complex word identification with classifier ensembles. In: Proceedings of SemEval (2016)
Malmasi, S., Zampieri, M., Dras, M.: Predicting post severity in mental health forums. In: Proceedings of CLPsych (2016)
Acknowledgements
We would like to thank the anonymous reviewers for the suggestions and constructive feedback provided.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Malmasi, S., del Río, I., Zampieri, M. (2018). Portuguese Native Language Identification. In: Villavicencio, A., et al. Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science(), vol 11122. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-99722-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99721-6
Online ISBN: 978-3-319-99722-3
eBook Packages: Computer ScienceComputer Science (R0)