Skip to main content

Text Classification of Kannada Webpages Using Various Pre-processing Agents

  • Conference paper
Recent Advances in Intelligent Informatics

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 235))

Abstract

Text classification of Webpages has wide applications and many techniques have been employed to achieve the same. In this paper, an attempt is made to classify Kannada webpages into pre-determined 6 classes or categories. Kannada is a morphologically rich Indian Language. Kannada Webpages are subjected to different pre-processing steps and machine learning techniques like Naïve Bayes and Maximum Entropy are applied to train models. All the pre-processing steps before classification are implemented as intelligent agents doing a particular task like Language Identification, Sentence Boundary detection and Term frequency calculation. It is observed that highest accuracy of 0.9 is achieved using both stemming and stopword removal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)

    Article  Google Scholar 

  2. Tsukada, M., Washio, T., Motoda, H.: Automatic web-page classification by using machine learning methods. In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 303–313. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  3. Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Transferring Naïve Bayes Classifiers for text classification. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 540–545 (2007)

    Google Scholar 

  4. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  5. Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)

    Article  Google Scholar 

  6. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short Text Classification in Twitter to Improve Information Filtering. In: Proceeedings of 33rd International ACM SIGIR Conference, pp. 841–842 (2010)

    Google Scholar 

  7. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  8. Lewis, D.D., Knguette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  9. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of 22nd Annual International SIGIR, pp. 42–49 (1999)

    Google Scholar 

  10. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press (2000)

    Google Scholar 

  12. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML 1998, 10th European Conference on Machine Learning (1998)

    Google Scholar 

  13. Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Journal of Machine Learning 46(1-3), 423–444 (2002)

    Article  MATH  Google Scholar 

  14. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 999–1006 (2001)

    Google Scholar 

  15. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3) (2011)

    Google Scholar 

  16. Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)

    Chapter  Google Scholar 

  17. Ranabir Singh, S., Hema Murthy, A., Timothy Gonsalves, A.: Feature Selection for Text Classification based on Gini Coefficient of Inequality. In: Fourth International Workshop on Feature Selection in Data Mining, JMLR 2010, Hyderabad, pp. 76–85 (2010)

    Google Scholar 

  18. El-Halees, A.M.: Arabic Text Classification Using Maximum Entropy. The Islamic University Journal 15, 157–167 (2007)

    Google Scholar 

  19. Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text Classification using Graph Mining-Based Feature Extraction. The Journal of Knowledge Based Systems (2010)

    Google Scholar 

  20. Lu, S.H., Chiang, D.A., Keh, H.C., Huang, H.H.: Chinese text classification by the naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowledge Based Syst. 23(6), 598–604 (2010)

    Article  Google Scholar 

  21. Raghuveer, K., Murthy, K.N.: Text Categorization in Indian Languages using Machine Learning Approaches. In: Proceedings of 3rd International conference on Artificial Intelligence, pp. 1864–1883 (2007)

    Google Scholar 

  22. Nidhi, V.G.: Algorithm for Punjabi Text Classification. International Journal of Computer Applications 37, 30–35 (2012)

    Google Scholar 

  23. Nidhi, V.G.: Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach. In: Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, COLING 2012, pp. 109–122 (2012)

    Google Scholar 

  24. Mohanty, S., Santi, P.K., Mishra, R., Mohapatra, R.N., Swain, S.: Semantic based text classification using wordnets: Indian languages perspective. In: Proceedings of the 3th International Global WordNejuh t Conf., South Jeju Island, Korea, pp. 321–324 (2006)

    Google Scholar 

  25. Jayashree, R., Srikanta Murthy, K.: An analysis of sentence level text classification for the Kannada language. In: Proceedings of International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 147–151 (2011)

    Google Scholar 

  26. Deepamala, N., Ramakanth Kumar, P.: Language Identification of Kannada Language using N-Gram. International Journal of Computer Applications (0975-8887) 46(4), 24–28 (2012)

    Google Scholar 

  27. Manning, C.D., Schütze: Foundations of statistical natural language processing. The MIT Press, Cambridge

    Google Scholar 

  28. Deepamala, N., Ramakanth Kumar, P.: Sentence Boundary Detection in Kannada Language. International Journal of Computer Applications (0975-8887) 39(9), 38–41 (2012)

    Article  Google Scholar 

  29. Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu by A McCallum

  30. Paice, C.: Paice Stemmer, http://www.comp.lancs.ac.uk/computing/research/stemming/general/paice.htm

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Deepamala .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Deepamala, N., Kumar, P.R. (2014). Text Classification of Kannada Webpages Using Various Pre-processing Agents. In: Thampi, S., Abraham, A., Pal, S., Rodriguez, J. (eds) Recent Advances in Intelligent Informatics. Advances in Intelligent Systems and Computing, vol 235. Springer, Cham. https://doi.org/10.1007/978-3-319-01778-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-01778-5_24

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-01777-8

  • Online ISBN: 978-3-319-01778-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics