Abstract
Text mining and natural language processing are gaining significant role in our daily life as information volumes increase steadily. Most of the digital information is unstructured in the form of raw text. While for several languages there is extensive research on mining and language processing, much less work has been performed for other languages. In this paper we aim to evaluate the performance of some of the most important text classification algorithms over a corpus composed of Albanian texts. After applying natural language preprocessing steps, we apply several algorithms such as Simple Logistics, Naïve Bayes, k-Nearest Neighbor, Decision Trees, Random Forest, Support Vector Machines and Neural Networks. The experiments show that Naïve Bayes and Support Vector Machines perform best in classifying Albanian corpuses. Furthermore, Simple Logistics algorithm also shows good results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA (2012)
Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014)
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas, Halifax, Canada, 13 p., August 2017
Talib, R., et al.: Text mining: techniques, applications and issues. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016)
Zewen, X.U., et al.: Semi-Supervised Learning in Large Scale Text Categorization. Shanghai Jiao Tong University and Springer, Heidelberg (2017)
Sadiku, J., Biba, M.: Automatic stemming of Albanian through a rule-based approach. J. Int. Res. Publ. Lang. Individ. Soc. 6 (2012). ISSN 1313-2547
Biba, M., Gjati, E.: Boosting text classification through stemming of composite words. In: ISI 2013, pp. 185–194 (2013)
Kılıncx, D., et al.: TTC-3600: a new benchmark dataset for Turkish text categorization. J. Inf. Sci., 1–12 (2015)
Karan, K., Snajder, J., Basic, B.D.: Evaluation of classification algorithms and features for collocation extraction in Croatian. In: LREC 2012, Eighth International Conference on Language Resources and Evaluation (2012). ISBN 978-2-9517408-7-7
Yu, B.: An evaluation of text classification methods for literary study. Literary Linguist. Comput. 23(3), 327–343 (2008)
Gonçalves, T., Quaresma, P.: Using IR techniques to improve automated text classification. In: Meziane, F., Métais, E. (eds.) Natural Language Processing and Information Systems, NLDB 2004. LNCS, vol. 3136. Springer, Heidelberg (2004)
Rasjida, Z.E., Setiawan, R.: Performance comparison and optimization of text document classification using k-NN and Naïve Bayes classification technique. In: 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, vol. 1314, Bali, Indonesia, October 2017
Al-Zaghoul, F., Al-Dhaheri, S.: Arabic text classification based on features reduction using artificial neural networks. In: UKSim 15th International Conference on Computer Modelling and Simulation. IEEE (2013)
Zaid Enweiji, M., Lehinevych, T., Glybovets, A.: Cross-language text classification with convolutional neural networks from scratch. Eureka: Phys. Eng., 24–33 (2017). https://doi.org/10.21303/2461-4262.2017.00304
Hamp, E.P.: Albanian Language, Encyclopedia Britannica (2016)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
McCallum, A.K.: Mallet: A Machine Learning for Language Toolkit (2002)
Aggarwal, C., Zhai, C.X.: Mining Text Data. Springer (2012)
Dunham, M.H.: Data Mining: Introductory And Advanced Topics. Pearson Education (2006)
Moreaux, M.: Text Classification with Generic Logistic-Regression Classifier (2015)
Ramasundaram, S., Victor, S.P.: Text categorization by backpropagation network. Int. J. Comput. Appl. (0975 – 8887) 8(6), October 2010
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Trandafili, E., Kote, N., Biba, M. (2018). Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_48
Download citation
DOI: https://doi.org/10.1007/978-3-319-75928-9_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75927-2
Online ISBN: 978-3-319-75928-9
eBook Packages: EngineeringEngineering (R0)