Text Classification of Kannada Webpages Using Various Pre-processing Agents

Deepamala, N.; Kumar, P. Ramakanth

doi:10.1007/978-3-319-01778-5_24

N. Deepamala⁶ &
P. Ramakanth Kumar⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 235))

1645 Accesses
4 Citations

Abstract

Text classification of Webpages has wide applications and many techniques have been employed to achieve the same. In this paper, an attempt is made to classify Kannada webpages into pre-determined 6 classes or categories. Kannada is a morphologically rich Indian Language. Kannada Webpages are subjected to different pre-processing steps and machine learning techniques like Naïve Bayes and Maximum Entropy are applied to train models. All the pre-processing steps before classification are implemented as intelligent agents doing a particular task like Language Identification, Sentence Boundary detection and Term frequency calculation. It is observed that highest accuracy of 0.9 is achieved using both stemming and stopword removal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Article Google Scholar
Tsukada, M., Washio, T., Motoda, H.: Automatic web-page classification by using machine learning methods. In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 303–313. Springer, Heidelberg (2001)
Chapter Google Scholar
Dai, W., Xue, G.-R., Yang, Q., Yu, Y.: Transferring Naïve Bayes Classifiers for text classification. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 540–545 (2007)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Google Scholar
Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3, 243–269 (2004)
Article Google Scholar
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short Text Classification in Twitter to Improve Information Filtering. In: Proceeedings of 33rd International ACM SIGIR Conference, pp. 841–842 (2010)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Lewis, D.D., Knguette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of 22nd Annual International SIGIR, pp. 42–49 (1999)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press (2000)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML 1998, 10th European Conference on Machine Learning (1998)
Google Scholar
Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Journal of Machine Learning 46(1-3), 423–444 (2002)
Article MATH Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 999–1006 (2001)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3) (2011)
Google Scholar
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Proceedings of the Fourth International Workshop on Web Information and Data Management, McLean, Virginia, USA, pp. 96–99. ACM Press, New York (2002)
Chapter Google Scholar
Ranabir Singh, S., Hema Murthy, A., Timothy Gonsalves, A.: Feature Selection for Text Classification based on Gini Coefficient of Inequality. In: Fourth International Workshop on Feature Selection in Data Mining, JMLR 2010, Hyderabad, pp. 76–85 (2010)
Google Scholar
El-Halees, A.M.: Arabic Text Classification Using Maximum Entropy. The Islamic University Journal 15, 157–167 (2007)
Google Scholar
Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text Classification using Graph Mining-Based Feature Extraction. The Journal of Knowledge Based Systems (2010)
Google Scholar
Lu, S.H., Chiang, D.A., Keh, H.C., Huang, H.H.: Chinese text classification by the naïve Bayes classifier and the associative classifier with multiple confidence threshold values. Knowledge Based Syst. 23(6), 598–604 (2010)
Article Google Scholar
Raghuveer, K., Murthy, K.N.: Text Categorization in Indian Languages using Machine Learning Approaches. In: Proceedings of 3rd International conference on Artificial Intelligence, pp. 1864–1883 (2007)
Google Scholar
Nidhi, V.G.: Algorithm for Punjabi Text Classification. International Journal of Computer Applications 37, 30–35 (2012)
Google Scholar
Nidhi, V.G.: Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach. In: Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, COLING 2012, pp. 109–122 (2012)
Google Scholar
Mohanty, S., Santi, P.K., Mishra, R., Mohapatra, R.N., Swain, S.: Semantic based text classification using wordnets: Indian languages perspective. In: Proceedings of the 3th International Global WordNejuh t Conf., South Jeju Island, Korea, pp. 321–324 (2006)
Google Scholar
Jayashree, R., Srikanta Murthy, K.: An analysis of sentence level text classification for the Kannada language. In: Proceedings of International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 147–151 (2011)
Google Scholar
Deepamala, N., Ramakanth Kumar, P.: Language Identification of Kannada Language using N-Gram. International Journal of Computer Applications (0975-8887) 46(4), 24–28 (2012)
Google Scholar
Manning, C.D., Schütze: Foundations of statistical natural language processing. The MIT Press, Cambridge
Google Scholar
Deepamala, N., Ramakanth Kumar, P.: Sentence Boundary Detection in Kannada Language. International Journal of Computer Applications (0975-8887) 39(9), 38–41 (2012)
Article Google Scholar
Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu by A McCallum
Paice, C.: Paice Stemmer, http://www.comp.lancs.ac.uk/computing/research/stemming/general/paice.htm

Download references

Author information

Authors and Affiliations

R.V. College of Engineering, Bangalore, India
N. Deepamala & P. Ramakanth Kumar

Authors

N. Deepamala
View author publications
You can also search for this author in PubMed Google Scholar
P. Ramakanth Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Deepamala .

Editor information

Editors and Affiliations

Technopark Campus Trivandrum, Indian Inst. of Information Technology and Management – Kerala (IIITM-K), Kerala, India
Sabu M. Thampi
Machine Intelligence Research Labs (MIR Labs), Auburn, USA
Ajith Abraham
Indian Statistical Institute, Kolkata, India
Sankar Kumar Pal
Department of Computer Science School of Science, University of Salamanca, Salamanca, Spain
Juan Manuel Corchado Rodriguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deepamala, N., Kumar, P.R. (2014). Text Classification of Kannada Webpages Using Various Pre-processing Agents. In: Thampi, S., Abraham, A., Pal, S., Rodriguez, J. (eds) Recent Advances in Intelligent Informatics. Advances in Intelligent Systems and Computing, vol 235. Springer, Cham. https://doi.org/10.1007/978-3-319-01778-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-01778-5_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01777-8
Online ISBN: 978-3-319-01778-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics