Abstract
Text classification of Webpages has wide applications and many techniques have been employed to achieve the same. In this paper, an attempt is made to classify Kannada webpages into pre-determined 6 classes or categories. Kannada is a morphologically rich Indian Language. Kannada Webpages are subjected to different pre-processing steps and machine learning techniques like Naïve Bayes and Maximum Entropy are applied to train models. All the pre-processing steps before classification are implemented as intelligent agents doing a particular task like Language Identification, Sentence Boundary detection and Term frequency calculation. It is observed that highest accuracy of 0.9 is achieved using both stemming and stopword removal.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Tsukada, M., Washio, T., Motoda, H.: Automatic web-page classification by using machine learning methods. In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 303–313. Springer, Heidelberg (2001)
Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Transferring Naïve Bayes Classifiers for text classification. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 540–545 (2007)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short Text Classification in Twitter to Improve Information Filtering. In: Proceeedings of 33rd International ACM SIGIR Conference, pp. 841–842 (2010)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Lewis, D.D., Knguette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of 22nd Annual International SIGIR, pp. 42–49 (1999)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press (2000)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML 1998, 10th European Conference on Machine Learning (1998)
Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Journal of Machine Learning 46(1-3), 423–444 (2002)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 999–1006 (2001)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3) (2011)
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)
Ranabir Singh, S., Hema Murthy, A., Timothy Gonsalves, A.: Feature Selection for Text Classification based on Gini Coefficient of Inequality. In: Fourth International Workshop on Feature Selection in Data Mining, JMLR 2010, Hyderabad, pp. 76–85 (2010)
El-Halees, A.M.: Arabic Text Classification Using Maximum Entropy. The Islamic University Journal 15, 157–167 (2007)
Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text Classification using Graph Mining-Based Feature Extraction. The Journal of Knowledge Based Systems (2010)
Lu, S.H., Chiang, D.A., Keh, H.C., Huang, H.H.: Chinese text classification by the naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowledge Based Syst. 23(6), 598–604 (2010)
Raghuveer, K., Murthy, K.N.: Text Categorization in Indian Languages using Machine Learning Approaches. In: Proceedings of 3rd International conference on Artificial Intelligence, pp. 1864–1883 (2007)
Nidhi, V.G.: Algorithm for Punjabi Text Classification. International Journal of Computer Applications 37, 30–35 (2012)
Nidhi, V.G.: Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach. In: Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, COLING 2012, pp. 109–122 (2012)
Mohanty, S., Santi, P.K., Mishra, R., Mohapatra, R.N., Swain, S.: Semantic based text classification using wordnets: Indian languages perspective. In: Proceedings of the 3th International Global WordNejuh t Conf., South Jeju Island, Korea, pp. 321–324 (2006)
Jayashree, R., Srikanta Murthy, K.: An analysis of sentence level text classification for the Kannada language. In: Proceedings of International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 147–151 (2011)
Deepamala, N., Ramakanth Kumar, P.: Language Identification of Kannada Language using N-Gram. International Journal of Computer Applications (0975-8887) 46(4), 24–28 (2012)
Manning, C.D., Schütze: Foundations of statistical natural language processing. The MIT Press, Cambridge
Deepamala, N., Ramakanth Kumar, P.: Sentence Boundary Detection in Kannada Language. International Journal of Computer Applications (0975-8887) 39(9), 38–41 (2012)
Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu by A McCallum
Paice, C.: Paice Stemmer, http://www.comp.lancs.ac.uk/computing/research/stemming/general/paice.htm
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Deepamala, N., Kumar, P.R. (2014). Text Classification of Kannada Webpages Using Various Pre-processing Agents. In: Thampi, S., Abraham, A., Pal, S., Rodriguez, J. (eds) Recent Advances in Intelligent Informatics. Advances in Intelligent Systems and Computing, vol 235. Springer, Cham. https://doi.org/10.1007/978-3-319-01778-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-01778-5_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01777-8
Online ISBN: 978-3-319-01778-5
eBook Packages: EngineeringEngineering (R0)