Abstract
Web directory hierarchy is critical to serve user’s search request. Creating and maintaining such directories without human experts involvement requires good classification of web documents. In this paper, we explore web page classification using keywords from documents as attributes and using the random forest learning methods. Our initially results are promising that the random forests learning method performed better than several other well known learning methods. When the number of topics increased from five to seven, random forests still performed better than other methods even though absolute classification rates decreased.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Breiman, L.: Random Forest. Machine Learning 45(1), 5–32 (2001)
Shi, T.: Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Modern Pathology 18, 547–557 (2005)
Svetnik, V.: Random Forest: A Classification and Regression Tool for compound classification and QSAR modeling. J. Chem. Inf. Computer Science 43, 1947–1958 (2003)
Zhang, J., Zulkernine, M.: A Hybrid Network Intrusion Detection Technique Using Random Forests. In: Proceedings of the First International Conference on Availability, Reliability and Security (ARES 2006), pp. 262–269 (2006)
Russel, I., Markov, Z., Neller, T.: Wed Document Classification. NSF Project MLeXAI sample project report, http://uhaweb.hartford.edu/compsci/ccli/samplep.htm
Qi, W., Davidson, B.: Web page classification: Features and Algorithms. ACM Computing Surveys 41(2) (2009)
Shen, D., Chen, Z., et al.: Web-page classification through summarization. In: SIGIR 2004 (2004)
Glover, E.J., Tsioutsiouliklis, K., Flake, et al.: Using web structure for classifying and describing web pages. In: Proc. of www, vol. 12 (2002)
Ye, Y., Li, H., Deng, X., Huang, J.: Feature weighting random forest for detection of hidden web search interfaces. Computational Linguistics and Chinese Language Processing 13(4), 387–404 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klassen, M., Paturi, N. (2010). Web Document Classification by Keywords Using Random Forests. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-14306-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14305-2
Online ISBN: 978-3-642-14306-9
eBook Packages: Computer ScienceComputer Science (R0)