Abstract
Electronic Theses and Dissertations (ETDs) are documents rich in research information that provide many benefits to students and future generations of scholars in various disciplines. Therefore, research is taking place to extract data from ETDs and make them more accessible. However, much of the related research involved ETDs in the English language, while Arabic ETDs remain an untapped source of data, although the number of Arabic ETDs available digitally is growing. Therefore, the need to make them more browsable and accessible increases. Some ways to achieve this need include data annotation, indexing, translation, and classification. As the size of the data increases, manual subject classification becomes less feasible. Accordingly, automatic subject classification becomes essential for the searchability and management of data. There are two main roadblocks to performing automatic subject classification of Arabic ETDs. The first is the lack of a large public corpus of Arabic ETDs for training purposes, while the second is the Arabic language’s linguistic complexity, especially in academic documents. This research aims to collect key metadata of Arabic ETDs, and apply different automatic subject classification methodologies. The first goal is aided by scraping data from the AskZad Digital Library. The second goal is achieved by exploring different machine learning and deep learning techniques. The experiments’ results show that deep learning using pretrained language models yielded the highest accuracy of approximately 0.83, while classical machine learning techniques yielded approximately 0.41 and 0.70 for multiclass classification one-vs-all classification respectively. This indicates that using pretrained language models assists in understanding languages which is essential for the classification of text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Abdeen, M.A., AlBouq, S., Elmahalawy, A., Shehata, S.: A closer look at Arabic text classification. Int. J. Adv. Comput. Sci. Appl. 10(11), 677–688 (2019)
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 11–16 (2016)
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic text classification. In: 9th International Conference on the Statistical Analysis of Textual Data (JADT 2008) (2008)
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104 (2020)
AskZad: AskZad: The World’s First and Largest Arabic Digital Library (2020). http://askzad.com. Accessed 14 Jan 2022
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Cristianini, N., Ricci, E.: Support vector machines. In: Kao, M.Y. (ed.) Encyclopedia of Algorithms. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-30162-4_415
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Information Processing Systems, vol. 28, pp. 3079–3087 (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Duwairi, R.M.: A distance-based classifier for Arabic text categorization. In: DMIN, pp. 187–192 (2005)
Duwairi, R.M.: Arabic text categorization. Int. Arab J. Inf. Technol. 4(2), 125–131 (2007)
El-Halees, A.M.: Arabic text classification using maximum entropy. IUG J. Nat. Stud. 15, 157–167 (2007)
Fix, E., Hodges, J.L., Jr.: Discriminatory analysis-nonparametric discrimination: small sample performance. Technical report, University of California Berkeley (1952)
Gharib, T.F., Habib, M.B., Fayed, Z.T.: Arabic text classification using support vector machines. Int. J. Comput. Their Appl. 16, 192–199 (2009)
Khreisat, L.: Arabic text classification using N-gram frequency statistics: a comparative study. In: DMIN 2006, pp. 78–82 (2006)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. Technical report, University of Southern California Marina Del Rey Information Sciences Institute (2003)
Kourdi, M.E., Bensaid, A., Rachidi, T.: Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, pp. 51–58 (2004)
Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 206–211 (2014)
Pasha, A., et al.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 26–31. European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Quinlan, J.: C4. 5: Programs for Machine Learning. Elsevier (2014)
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Safaya, A., Abdullatif, M., Yuret, D.: BERT-CNN for offensive speech identification in social media. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, KUISAIL at SemEval-2020 Task 12, pp. 2054–2059. International Committee for Computational Linguistics, Barcelona (2020). https://www.aclweb.org/anthology/2020.semeval-1.271
SakhrSoftware: Sakhr Software: Arabic language technology (Sakhr Solutions: Ranked Number 1 in Accuracy and Performance, Powered by the World’s Leading Research in Arabic Natural Language Processing (NLP)) (2022). http://www.sakhr.com. Accessed 6 Jan 2022
Sawaf, H., Zaplo, J., Ney, H.: Statistical classification methods for Arabic news articles. In: Third Arabic Natural Language Processing Workshop, in ACL 2001 (2001)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Acknowledgements
Special thanks go to Dr. Wu Jian from Old Dominion University, Dr. Bill Ingram from Virginia Tech, and the team working on the Institute of Museum and Library Services grant LG-37-19-0078-19. Thanks go to the AskZad Digital Library, which is part of the Saudi Digital Library, from which the dataset was collected. Thanks also go to Fatimah Alotaibi for her support during the project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Abdelrahman, E., Fox, E. (2022). Improving Accessibility to Arabic ETDs Using Automatic Classification. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)