Character-Level Dialect Identification in Arabic Using Long Short-Term Memory

Sayadi, Karim; Hamidi, Mansour; Bui, Marc; Liwicki, Marcus; Fischer, Andreas

doi:10.1007/978-3-319-77116-8_24

Character-Level Dialect Identification in Arabic Using Long Short-Term Memory

Karim Sayadi^14,17,
Mansour Hamidi¹⁵,
Marc Bui¹⁴,
Marcus Liwicki¹⁵ &
…
Andreas Fischer^15,16

Conference paper
First Online: 10 October 2018

1078 Accesses
1 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Abstract

In this paper, we introduce a neural network based sequence learning approach for the task of Arabic dialect classification. Character models based on recurrent neural networks with Long Short-Term Memory (LSTM) are suggested to classify short texts, such as tweets, written in different Arabic dialects. The LSTM-based character models can handle long-term dependencies in character sequences and do not require a set of linguistic rules at word-level, which is especially useful for the rich morphology of the Arabic language and the lack of strict orthographic rules for dialects. On the Tunisian Election Twitter dataset, our system achieves a promising average accuracy of 92.2% for distinguishing Modern Standard Arabic from Tunisian dialect. On the Multidialectal Parallel Corpus of Arabic, the proposed character models can distinguish six classes, Modern Standard Arabic and five Arabic dialects, with an average accuracy of 63.4%. They clearly outperform a standard word-level approach based on statistical n-grams as well as several other existing systems.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Younes, J., Achour, H., Souissi, E.: Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. LNCS, pp. 3–14. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24800-4_1
Chapter Google Scholar
Sayadi, K., Liwicki, M., Ingold, R., Bui, M.: Tunisian dialect and modern standard Arabic dataset for sentiment analysis: Tunisian election context. In: ACLing (2016)
Google Scholar
Hermans, M., Schrauwen, B.: Training and analysing deep recurrent neural networks. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 190–198 (2013)
Google Scholar
Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks (2015). arXiv:1506.02078
Elfardy, H., Diab, M.T.: Sentence level dialect identification in Arabic. In: ACL (2013)
Google Scholar
Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40, 171–202 (2014)
Article Google Scholar
Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics (PACLING 2015), Bali, Indonesia (2015)
Google Scholar
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In: LREC (2014)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI (2016)
Google Scholar
Xiao, Y., Cho, K.: Efficient character-level document classification by combining convolution and recurrent layers (2016). arXiv:1602.00367
Zhang, X., LeCun, Y.: Text understanding from scratch (2015). arXiv:1502.01710
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML (1997)
Google Scholar
Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: EMNLP (2014)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). abs/1301.3781
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)
Google Scholar
Dauphin, Y.N., Vries, H.d., Bengio, Y.: Equilibrated adaptive learning rates for non-convex optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. NIPS’15, pp. 1504–1512. MIT Press, Cambridge, MA, USA (2015)
Google Scholar
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: EMNLP, pp. 1465–1468 (2014)
Google Scholar
Adouane, W., Semmar, N., Johansson, R., Bobicev, V.: Automatic detection of arabicized berber and arabic varieties. VarDial 3, 63 (2016)
Google Scholar
Graves, A., Liwicki, M., Fernndez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 855–868 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CHArt Laboratory EA 4004, EPHE, PSL Research University, Paris, France
Karim Sayadi & Marc Bui
Department of Informatics, University of Fribourg, 1700, Fribourg, Switzerland
Mansour Hamidi, Marcus Liwicki & Andreas Fischer
Institute for Complex Systems, University of Applied Sciences and Arts Western Switzerland, 1705, Fribourg, Switzerland
Andreas Fischer
OCTO Technology, Paris, France
Karim Sayadi

Authors

Karim Sayadi
View author publications
You can also search for this author in PubMed Google Scholar
Mansour Hamidi
View author publications
You can also search for this author in PubMed Google Scholar
Marc Bui
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Liwicki
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Fischer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karim Sayadi .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

A Speed Test

In order to get an idea about the speed of processing the text records (e.g. tweets) provided by the described environment based on Torch, we illustrate in the Fig. 3 the processing time per second and per tweet during the training phase.

We compared the polyGPU configuration with the macCPU configuration consists of a MacBook Air 4.2 CPU of a with a 1.7 GHz Intel Core i5 processor and 4GB of DDR3 RAM at 1333 MHz. and he monoGPU configuration consists of a single NVIDIA Corporation GK107 [GeForce GT 740] GPU deployed in a PC with CUDA Cores: 384 a VRAM: 2 GB DDR3 and Memory Clock at 1.8 Gbps.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sayadi, K., Hamidi, M., Bui, M., Liwicki, M., Fischer, A. (2018). Character-Level Dialect Identification in Arabic Using Long Short-Term Memory. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-77116-8_24
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Speed Test

A Speed Test

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation