Improving Accessibility to Arabic ETDs Using Automatic Classification

Abdelrahman, Eman; Fox, Edward

doi:10.1007/978-3-031-16802-4_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13541))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1032 Accesses
2 Altmetric

Abstract

Electronic Theses and Dissertations (ETDs) are documents rich in research information that provide many benefits to students and future generations of scholars in various disciplines. Therefore, research is taking place to extract data from ETDs and make them more accessible. However, much of the related research involved ETDs in the English language, while Arabic ETDs remain an untapped source of data, although the number of Arabic ETDs available digitally is growing. Therefore, the need to make them more browsable and accessible increases. Some ways to achieve this need include data annotation, indexing, translation, and classification. As the size of the data increases, manual subject classification becomes less feasible. Accordingly, automatic subject classification becomes essential for the searchability and management of data. There are two main roadblocks to performing automatic subject classification of Arabic ETDs. The first is the lack of a large public corpus of Arabic ETDs for training purposes, while the second is the Arabic language’s linguistic complexity, especially in academic documents. This research aims to collect key metadata of Arabic ETDs, and apply different automatic subject classification methodologies. The first goal is aided by scraping data from the AskZad Digital Library. The second goal is achieved by exploring different machine learning and deep learning techniques. The experiments’ results show that deep learning using pretrained language models yielded the highest accuracy of approximately 0.83, while classical machine learning techniques yielded approximately 0.41 and 0.70 for multiclass classification one-vs-all classification respectively. This indicates that using pretrained language models assists in understanding languages which is essential for the classification of text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://sdl.edu.sa.

References

Abdeen, M.A., AlBouq, S., Elmahalawy, A., Shehata, S.: A closer look at Arabic text classification. Int. J. Adv. Comput. Sci. Appl. 10(11), 677–688 (2019)
Google Scholar
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 11–16 (2016)
Google Scholar
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic text classification. In: 9th International Conference on the Statistical Analysis of Textual Data (JADT 2008) (2008)
Google Scholar
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104 (2020)
AskZad: AskZad: The World’s First and Largest Arabic Digital Library (2020). http://askzad.com. Accessed 14 Jan 2022
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Google Scholar
Cristianini, N., Ricci, E.: Support vector machines. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-30162-4_415
Chapter Google Scholar
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Information Processing Systems, vol. 28, pp. 3079–3087 (2015)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Duwairi, R.M.: A distance-based classifier for Arabic text categorization. In: DMIN, pp. 187–192 (2005)
Google Scholar
Duwairi, R.M.: Arabic text categorization. Int. Arab J. Inf. Technol. 4(2), 125–131 (2007)
Google Scholar
El-Halees, A.M.: Arabic text classification using maximum entropy. IUG J. Nat. Stud. 15, 157–167 (2007)
Google Scholar
Fix, E., Hodges, J.L., Jr.: Discriminatory analysis-nonparametric discrimination: small sample performance. Technical report, University of California Berkeley (1952)
Google Scholar
Gharib, T.F., Habib, M.B., Fayed, Z.T.: Arabic text classification using support vector machines. Int. J. Comput. Their Appl. 16, 192–199 (2009)
Google Scholar
Khreisat, L.: Arabic text classification using N-gram frequency statistics: a comparative study. In: DMIN 2006, pp. 78–82 (2006)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. Technical report, University of Southern California Marina Del Rey Information Sciences Institute (2003)
Google Scholar
Kourdi, M.E., Bensaid, A., Rachidi, T.: Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, pp. 51–58 (2004)
Google Scholar
Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 206–211 (2014)
Google Scholar
Pasha, A., et al.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 26–31. European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Google Scholar
Quinlan, J.: C4. 5: Programs for Machine Learning. Elsevier (2014)
Google Scholar
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Safaya, A., Abdullatif, M., Yuret, D.: BERT-CNN for offensive speech identification in social media. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, KUISAIL at SemEval-2020 Task 12, pp. 2054–2059. International Committee for Computational Linguistics, Barcelona (2020). https://www.aclweb.org/anthology/2020.semeval-1.271
SakhrSoftware: Sakhr Software: Arabic language technology (Sakhr Solutions: Ranked Number 1 in Accuracy and Performance, Powered by the World’s Leading Research in Arabic Natural Language Processing (NLP)) (2022). http://www.sakhr.com. Accessed 6 Jan 2022
Sawaf, H., Zaplo, J., Ney, H.: Statistical classification methods for Arabic news articles. In: Third Arabic Natural Language Processing Workshop, in ACL 2001 (2001)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar

Download references

Acknowledgements

Special thanks go to Dr. Wu Jian from Old Dominion University, Dr. Bill Ingram from Virginia Tech, and the team working on the Institute of Museum and Library Services grant LG-37-19-0078-19. Thanks go to the AskZad Digital Library, which is part of the Saudi Digital Library, from which the dataset was collected. Thanks also go to Fatimah Alotaibi for her support during the project.

Author information

Authors and Affiliations

Department of Computer Science, Virginia Tech, Blacksburg, USA
Eman Abdelrahman & Edward Fox

Authors

Eman Abdelrahman
View author publications
You can also search for this author in PubMed Google Scholar
Edward Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eman Abdelrahman .

Editor information

Editors and Affiliations

University of Padua, Padua, Italy
Gianmaria Silvello
Universidad Politécnica de Madrid, Madrid, Spain
Oscar Corcho
CNR-ISTI – National Research Council, Pisa, Italy
Paolo Manghi
University of Padua, Padua, Italy
Giorgio Maria Di Nunzio
Linnaeus University, Växjö, Sweden
Koraljka Golub
University of Padua, Padua, Italy
Nicola Ferro
Sapienza University of Rome, Rome, Italy
Antonella Poggi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdelrahman, E., Fox, E. (2022). Improving Accessibility to Arabic ETDs Using Automatic Classification. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-16802-4_18
Published: 15 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Accessibility to Arabic ETDs Using Automatic Classification