Skip to main content

Arabic Dialect Identification Using a Parallel Multidialectal Corpus

  • Conference paper
  • First Online:
Computational Linguistics (PACLING 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 593))

Included in the following conference series:

Abstract

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving \(74\,\%\) accuracy against a random baseline of \(16.7\,\%\) and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as \(94\,\%\), but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (\(76\,\%\)). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2, 000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with \(74\,\%\) accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with \(97\,\%\) accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See [2, §2] for a more detailed discussion.

  2. 2.

    e.g. on short texts such as Tweets, SMS messages and status updates.

  3. 3.

    Spoken Arabic dialect identification is a another area of research, as discussed in [6].

  4. 4.

    Given that this is a parallel corpus, this is 1, 000 sentences per dialect, 6, 000 sentences in total.

  5. 5.

    http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear/.

  6. 6.

    For a dataset with n items, this is equivalent to n-fold cross-validation.

  7. 7.

    This contains 13, 512 MSA sentences, resulting in a majority class baseline of \(51.89\,\%\).

References

  1. Habash, N.Y.: Introduction to arabic natural language processing. Synth. Lect. Hum. Lang. Technol. 3(1), 1–187 (2010)

    Article  Google Scholar 

  2. Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of arabic. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014

    Google Scholar 

  3. Al-Sabbagh, R., Girju, R.: Mining the web for the induction of a dialectical arabic lexicon. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta, Malta, May 2010

    Google Scholar 

  4. Diab, M., Albadrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P., Eskander, R.: Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon, May 2014

    Google Scholar 

  5. Stede, M.: Lexical choice criteria in language generation. In: Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, pp. 454–459. Association for Computational Linguistics (1993)

    Google Scholar 

  6. Biadsy, F., Hirschberg, J., Habash, N.: Spoken arabic dialect identification using phonotactic modeling. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pp. 53–61, Association for Computational Linguistics (2009)

    Google Scholar 

  7. Zaidan, O.F., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41. Association for Computational Linguistics (2011)

    Google Scholar 

  8. Elfardy, H., Diab, M.T.: Sentence level dialect identification in arabic. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 456–461 (2013)

    Google Scholar 

  9. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

    Article  Google Scholar 

  10. Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, October 2014

    Google Scholar 

  11. Brooke, J., Hirst, G.: Measuring interlanguage: native language identification with L1-influence metrics. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 779–784, May 2012

    Google Scholar 

  12. Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the Arabic Natural Language Processing Workshop (EMNLP 2014). Association for Computational Linguistics, Doha, Qatar, pp. 180–186, October 2014. http://aclweb.org/anthology/W14-3625

  13. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  14. Malmasi, S., Dras, M.: Language Identification using Classifier Ensembles. In: Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial 2015). Association for Computational Linguistics, Hissar, Bulgaria, september 2015

    Google Scholar 

  15. Malmasi, S., Wong, S.M.J., Dras, M.: NLI shared task 2013: MQ submission. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Atlanta, Georgia, pp. 124–133, June 2013. http://www.aclweb.org/anthology/W13-1716

  16. Malmasi, S., Dras, M.: Large-scale native language identification with cross-corpus evaluation. In: Proceedings of NAACL-HLT 2015. Association for Computational Linguistics, Denver, Colorado, pp. 1403–1409, June 2015. http://aclweb.org/anthology/N15-1160

  17. Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)

    Article  MathSciNet  Google Scholar 

  18. Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)

    Article  Google Scholar 

  19. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)

    Google Scholar 

  20. Malmasi, S., Tetreault, J., Dras, M.: Oracle and human baselines for native language identification. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Denver, Colorado, June 2015

    Google Scholar 

  21. Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics (PACLING 2015). Bali, Indonesia, May 2015

    Google Scholar 

  22. Gottron, T., Lipka, N.: A comparison of language identification approaches on short, query-style texts. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 611–614. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  23. Malmasi, S., Dras, M.: Language transfer hypotheses with linear SVM weights. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1385–1390 (10 2014). http://aclweb.org/anthology/D14-1144

  24. Malmasi, S., Cahill, A.: Measuring feature diversity in native language identification. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Denver, Colorado, pp. 49–55, June 2015. http://aclweb.org/anthology/W15-0606

Download references

Acknowledgments

We would like to thank Houda Bouamor for making the MPCA data available. We also thank the three anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shervin Malmasi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Malmasi, S., Refaee, E., Dras, M. (2016). Arabic Dialect Identification Using a Parallel Multidialectal Corpus. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-0515-2_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-0514-5

  • Online ISBN: 978-981-10-0515-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics