Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

Trandafili, Evis; Kote, Nelda; Biba, Marenglen

doi:10.1007/978-3-319-75928-9_48

Evis Trandafili⁷,
Nelda Kote⁸ &
Marenglen Biba⁹

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 17))

Included in the following conference series:

International Conference on Emerging Internetworking, Data & Web Technologies

2017 Accesses
4 Citations

Abstract

Text mining and natural language processing are gaining significant role in our daily life as information volumes increase steadily. Most of the digital information is unstructured in the form of raw text. While for several languages there is extensive research on mining and language processing, much less work has been performed for other languages. In this paper we aim to evaluate the performance of some of the most important text classification algorithms over a corpus composed of Albanian texts. After applying natural language preprocessing steps, we apply several algorithms such as Simple Logistics, Naïve Bayes, k-Nearest Neighbor, Decision Trees, Random Forest, Support Vector Machines and Neural Networks. The experiments show that Naïve Bayes and Support Vector Machines perform best in classifying Albanian corpuses. Furthermore, Simple Logistics algorithm also shows good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA (2012)
Google Scholar
Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)
Article Google Scholar
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014)
Article Google Scholar
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas, Halifax, Canada, 13 p., August 2017
Google Scholar
Talib, R., et al.: Text mining: techniques, applications and issues. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016)
Google Scholar
Zewen, X.U., et al.: Semi-Supervised Learning in Large Scale Text Categorization. Shanghai Jiao Tong University and Springer, Heidelberg (2017)
Google Scholar
Sadiku, J., Biba, M.: Automatic stemming of Albanian through a rule-based approach. J. Int. Res. Publ. Lang. Individ. Soc. 6 (2012). ISSN 1313-2547
Google Scholar
Biba, M., Gjati, E.: Boosting text classification through stemming of composite words. In: ISI 2013, pp. 185–194 (2013)
Google Scholar
Kılıncx, D., et al.: TTC-3600: a new benchmark dataset for Turkish text categorization. J. Inf. Sci., 1–12 (2015)
Google Scholar
Karan, K., Snajder, J., Basic, B.D.: Evaluation of classification algorithms and features for collocation extraction in Croatian. In: LREC 2012, Eighth International Conference on Language Resources and Evaluation (2012). ISBN 978-2-9517408-7-7
Google Scholar
Yu, B.: An evaluation of text classification methods for literary study. Literary Linguist. Comput. 23(3), 327–343 (2008)
Article Google Scholar
Gonçalves, T., Quaresma, P.: Using IR techniques to improve automated text classification. In: Meziane, F., Métais, E. (eds.) Natural Language Processing and Information Systems, NLDB 2004. LNCS, vol. 3136. Springer, Heidelberg (2004)
Google Scholar
Rasjida, Z.E., Setiawan, R.: Performance comparison and optimization of text document classification using k-NN and Naïve Bayes classification technique. In: 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, vol. 1314, Bali, Indonesia, October 2017
Google Scholar
Al-Zaghoul, F., Al-Dhaheri, S.: Arabic text classification based on features reduction using artificial neural networks. In: UKSim 15th International Conference on Computer Modelling and Simulation. IEEE (2013)
Google Scholar
Zaid Enweiji, M., Lehinevych, T., Glybovets, A.: Cross-language text classification with convolutional neural networks from scratch. Eureka: Phys. Eng., 24–33 (2017). https://doi.org/10.21303/2461-4262.2017.00304
Hamp, E.P.: Albanian Language, Encyclopedia Britannica (2016)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Google Scholar
McCallum, A.K.: Mallet: A Machine Learning for Language Toolkit (2002)
Google Scholar
Aggarwal, C., Zhai, C.X.: Mining Text Data. Springer (2012)
Google Scholar
Dunham, M.H.: Data Mining: Introductory And Advanced Topics. Pearson Education (2006)
Google Scholar
Moreaux, M.: Text Classification with Generic Logistic-Regression Classifier (2015)
Google Scholar
Ramasundaram, S., Victor, S.P.: Text categorization by backpropagation network. Int. J. Comput. Appl. (0975 – 8887) 8(6), October 2010
Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Information Technology, Polytechnic University of Tirana, Tirana, Albania
Evis Trandafili
Department of Fundamentals of Computer Science, Faculty of Information Technology, Polytechnic University of Tirana, Tirana, Albania
Nelda Kote
Department of Computer Science, Faculty of Information Technology, New York University of Tirana, Tirana, Albania
Marenglen Biba

Authors

Evis Trandafili
View author publications
You can also search for this author in PubMed Google Scholar
Nelda Kote
View author publications
You can also search for this author in PubMed Google Scholar
Marenglen Biba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evis Trandafili .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Fukuoka-shi, Japan
Leonard Barolli
Technical University of Catalonia, Barcelona, Spain
Fatos Xhafa
Department of Computer Science, COMSATS Institute of Information Technology, Islamabad, Pakistan
Nadeem Javaid
Polytechnic University of Tirana, Tirana, Albania
Evjola Spaho
Polytechnic University of Tirana, Tirana, Albania
Vladi Kolici

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trandafili, E., Kote, N., Biba, M. (2018). Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-75928-9_48
Published: 24 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75927-2
Online ISBN: 978-3-319-75928-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics