Abstract
We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving \(74\,\%\) accuracy against a random baseline of \(16.7\,\%\) and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as \(94\,\%\), but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (\(76\,\%\)). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2, 000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with \(74\,\%\) accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with \(97\,\%\) accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See [2, §2] for a more detailed discussion.
- 2.
e.g. on short texts such as Tweets, SMS messages and status updates.
- 3.
Spoken Arabic dialect identification is a another area of research, as discussed in [6].
- 4.
Given that this is a parallel corpus, this is 1, 000 sentences per dialect, 6, 000 sentences in total.
- 5.
- 6.
For a dataset with n items, this is equivalent to n-fold cross-validation.
- 7.
This contains 13, 512 MSA sentences, resulting in a majority class baseline of \(51.89\,\%\).
References
Habash, N.Y.: Introduction to arabic natural language processing. Synth. Lect. Hum. Lang. Technol. 3(1), 1–187 (2010)
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of arabic. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014
Al-Sabbagh, R., Girju, R.: Mining the web for the induction of a dialectical arabic lexicon. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta, Malta, May 2010
Diab, M., Albadrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P., Eskander, R.: Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon, May 2014
Stede, M.: Lexical choice criteria in language generation. In: Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, pp. 454–459. Association for Computational Linguistics (1993)
Biadsy, F., Hirschberg, J., Habash, N.: Spoken arabic dialect identification using phonotactic modeling. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pp. 53–61, Association for Computational Linguistics (2009)
Zaidan, O.F., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41. Association for Computational Linguistics (2011)
Elfardy, H., Diab, M.T.: Sentence level dialect identification in arabic. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 456–461 (2013)
Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, October 2014
Brooke, J., Hirst, G.: Measuring interlanguage: native language identification with L1-influence metrics. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 779–784, May 2012
Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the Arabic Natural Language Processing Workshop (EMNLP 2014). Association for Computational Linguistics, Doha, Qatar, pp. 180–186, October 2014. http://aclweb.org/anthology/W14-3625
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Malmasi, S., Dras, M.: Language Identification using Classifier Ensembles. In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial 2015). Association for Computational Linguistics, Hissar, Bulgaria, september 2015
Malmasi, S., Wong, S.M.J., Dras, M.: NLI shared task 2013: MQ submission. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Atlanta, Georgia, pp. 124–133, June 2013. http://www.aclweb.org/anthology/W13-1716
Malmasi, S., Dras, M.: Large-scale native language identification with cross-corpus evaluation. In: Proceedings of NAACL-HLT 2015. Association for Computational Linguistics, Denver, Colorado, pp. 1403–1409, June 2015. http://aclweb.org/anthology/N15-1160
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)
Malmasi, S., Tetreault, J., Dras, M.: Oracle and human baselines for native language identification. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Denver, Colorado, June 2015
Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics (PACLING 2015). Bali, Indonesia, May 2015
Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)
Malmasi, S., Dras, M.: Language transfer hypotheses with linear SVM weights. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1385–1390 (10 2014). http://aclweb.org/anthology/D14-1144
Malmasi, S., Cahill, A.: Measuring feature diversity in native language identification. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Denver, Colorado, pp. 49–55, June 2015. http://aclweb.org/anthology/W15-0606
Acknowledgments
We would like to thank Houda Bouamor for making the MPCA data available. We also thank the three anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Malmasi, S., Refaee, E., Dras, M. (2016). Arabic Dialect Identification Using a Parallel Multidialectal Corpus. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_3
Download citation
DOI: https://doi.org/10.1007/978-981-10-0515-2_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0514-5
Online ISBN: 978-981-10-0515-2
eBook Packages: Computer ScienceComputer Science (R0)