Skip to main content

Character-Level Dialect Identification in Arabic Using Long Short-Term Memory

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Abstract

In this paper, we introduce a neural network based sequence learning approach for the task of Arabic dialect classification. Character models based on recurrent neural networks with Long Short-Term Memory (LSTM) are suggested to classify short texts, such as tweets, written in different Arabic dialects. The LSTM-based character models can handle long-term dependencies in character sequences and do not require a set of linguistic rules at word-level, which is especially useful for the rich morphology of the Arabic language and the lack of strict orthographic rules for dialects. On the Tunisian Election Twitter dataset, our system achieves a promising average accuracy of 92.2% for distinguishing Modern Standard Arabic from Tunisian dialect. On the Multidialectal Parallel Corpus of Arabic, the proposed character models can distinguish six classes, Modern Standard Arabic and five Arabic dialects, with an average accuracy of 63.4%. They clearly outperform a standard word-level approach based on statistical n-grams as well as several other existing systems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://ttg.uni-saarland.de/vardial2016/dsl2016.html.

  2. 2.

    http://ttg.uni-saarland.de/vardial2017/sharedtask2017.html.

References

  1. Younes, J., Achour, H., Souissi, E.: Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. LNCS, pp. 3–14. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24800-4_1

    Chapter  Google Scholar 

  2. Sayadi, K., Liwicki, M., Ingold, R., Bui, M.: Tunisian dialect and modern standard Arabic dataset for sentiment analysis: Tunisian election context. In: ACLing (2016)

    Google Scholar 

  3. Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 190–198 (2013)

    Google Scholar 

  4. Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks (2015). arXiv:1506.02078

  5. Elfardy, H., Diab, M.T.: Sentence level dialect identification in Arabic. In: ACL (2013)

    Google Scholar 

  6. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40, 171–202 (2014)

    Article  Google Scholar 

  7. Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics (PACLING 2015), Bali, Indonesia (2015)

    Google Scholar 

  8. Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In: LREC (2014)

    Google Scholar 

  9. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

    Google Scholar 

  10. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI (2016)

    Google Scholar 

  11. Xiao, Y., Cho, K.: Efficient character-level document classification by combining convolution and recurrent layers (2016). arXiv:1602.00367

  12. Zhang, X., LeCun, Y.: Text understanding from scratch (2015). arXiv:1502.01710

  13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  14. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML (1997)

    Google Scholar 

  15. Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: EMNLP (2014)

    Google Scholar 

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). abs/1301.3781

  17. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)

    Google Scholar 

  18. Dauphin, Y.N., Vries, H.d., Bengio, Y.: Equilibrated adaptive learning rates for non-convex optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. NIPS’15, pp. 1504–1512. MIT Press, Cambridge, MA, USA (2015)

    Google Scholar 

  19. Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: EMNLP, pp. 1465–1468 (2014)

    Google Scholar 

  20. Adouane, W., Semmar, N., Johansson, R., Bobicev, V.: Automatic detection of arabicized berber and arabic varieties. VarDial 3, 63 (2016)

    Google Scholar 

  21. Graves, A., Liwicki, M., Fernndez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 855–868 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karim Sayadi .

Editor information

Editors and Affiliations

A Speed Test

A Speed Test

In order to get an idea about the speed of processing the text records (e.g. tweets) provided by the described environment based on Torch, we illustrate in the Fig. 3 the processing time per second and per tweet during the training phase.

Fig. 3.
figure 3

Hardware comparison concerning training time

We compared the polyGPU configuration with the macCPU configuration consists of a MacBook Air 4.2 CPU of a with a 1.7 GHz Intel Core i5 processor and 4GB of DDR3 RAM at 1333 MHz. and he monoGPU configuration consists of a single NVIDIA Corporation GK107 [GeForce GT 740] GPU deployed in a PC with CUDA Cores: 384 a VRAM: 2 GB DDR3 and Memory Clock at 1.8 Gbps.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sayadi, K., Hamidi, M., Bui, M., Liwicki, M., Fischer, A. (2018). Character-Level Dialect Identification in Arabic Using Long Short-Term Memory. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77116-8_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77115-1

  • Online ISBN: 978-3-319-77116-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics