Skip to main content

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

  • Conference paper
  • First Online:
Advances in Internet, Data & Web Technologies (EIDWT 2018)

Abstract

Text mining and natural language processing are gaining significant role in our daily life as information volumes increase steadily. Most of the digital information is unstructured in the form of raw text. While for several languages there is extensive research on mining and language processing, much less work has been performed for other languages. In this paper we aim to evaluate the performance of some of the most important text classification algorithms over a corpus composed of Albanian texts. After applying natural language preprocessing steps, we apply several algorithms such as Simple Logistics, Naïve Bayes, k-Nearest Neighbor, Decision Trees, Random Forest, Support Vector Machines and Neural Networks. The experiments show that Naïve Bayes and Support Vector Machines perform best in classifying Albanian corpuses. Furthermore, Simple Logistics algorithm also shows good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gantz, J., Reinsel, D.: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA (2012)

    Google Scholar 

  2. Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)

    Article  Google Scholar 

  3. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014)

    Article  Google Scholar 

  4. Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas, Halifax, Canada, 13 p., August 2017

    Google Scholar 

  5. Talib, R., et al.: Text mining: techniques, applications and issues. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016)

    Google Scholar 

  6. Zewen, X.U., et al.: Semi-Supervised Learning in Large Scale Text Categorization. Shanghai Jiao Tong University and Springer, Heidelberg (2017)

    Google Scholar 

  7. Sadiku, J., Biba, M.: Automatic stemming of Albanian through a rule-based approach. J. Int. Res. Publ. Lang. Individ. Soc. 6 (2012). ISSN 1313-2547

    Google Scholar 

  8. Biba, M., Gjati, E.: Boosting text classification through stemming of composite words. In: ISI 2013, pp. 185–194 (2013)

    Google Scholar 

  9. Kılıncx, D., et al.: TTC-3600: a new benchmark dataset for Turkish text categorization. J. Inf. Sci., 1–12 (2015)

    Google Scholar 

  10. Karan, K., Snajder, J., Basic, B.D.: Evaluation of classification algorithms and features for collocation extraction in Croatian. In: LREC 2012, Eighth International Conference on Language Resources and Evaluation (2012). ISBN 978-2-9517408-7-7

    Google Scholar 

  11. Yu, B.: An evaluation of text classification methods for literary study. Literary Linguist. Comput. 23(3), 327–343 (2008)

    Article  Google Scholar 

  12. Gonçalves, T., Quaresma, P.: Using IR techniques to improve automated text classification. In: Meziane, F., Métais, E. (eds.) Natural Language Processing and Information Systems, NLDB 2004. LNCS, vol. 3136. Springer, Heidelberg (2004)

    Google Scholar 

  13. Rasjida, Z.E., Setiawan, R.: Performance comparison and optimization of text document classification using k-NN and Naïve Bayes classification technique. In: 2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2017, vol. 1314, Bali, Indonesia, October 2017

    Google Scholar 

  14. Al-Zaghoul, F., Al-Dhaheri, S.: Arabic text classification based on features reduction using artificial neural networks. In: UKSim 15th International Conference on Computer Modelling and Simulation. IEEE (2013)

    Google Scholar 

  15. Zaid Enweiji, M., Lehinevych, T., Glybovets, A.: Cross-language text classification with convolutional neural networks from scratch. Eureka: Phys. Eng., 24–33 (2017). https://doi.org/10.21303/2461-4262.2017.00304

  16. Hamp, E.P.: Albanian Language, Encyclopedia Britannica (2016)

    Google Scholar 

  17. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)

    Google Scholar 

  18. McCallum, A.K.: Mallet: A Machine Learning for Language Toolkit (2002)

    Google Scholar 

  19. Aggarwal, C., Zhai, C.X.: Mining Text Data. Springer (2012)

    Google Scholar 

  20. Dunham, M.H.: Data Mining: Introductory And Advanced Topics. Pearson Education (2006)

    Google Scholar 

  21. Moreaux, M.: Text Classification with Generic Logistic-Regression Classifier (2015)

    Google Scholar 

  22. Ramasundaram, S., Victor, S.P.: Text categorization by backpropagation network. Int. J. Comput. Appl. (0975 – 8887) 8(6), October 2010

    Google Scholar 

  23. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning (1998)

    Google Scholar 

  24. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evis Trandafili .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Trandafili, E., Kote, N., Biba, M. (2018). Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75928-9_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75927-2

  • Online ISBN: 978-3-319-75928-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics