Abstract
This work aims to approach web pages categorization by means of semantic graphs and machine learning techniques. We propose to use a semantic graph that can provide a compact and structured representation of the concepts present in a document in order to take into account the semantic information. The semantic graph allows determining a map of the semantic areas contained in the document and their relationships w.r.t. a particular concept or term. The semantic measure between the terms is calculated by using the lexical database (i.e., WordNet). The document categorization is accomplished by a machine learning technique. We compare the performance of both supervised and unsupervised techniques (i.e., Support Vector Machine and Self Organizing Maps, respectively). The proposed methodology has been applied for classification and agglomeration of benchmark and real data. From the analysis of the results it can be shown that the model trained with semantic features obtains satisfactory results, in particular by using the unsupervised machine learning technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
Divya, C.: Mining Contents in Web Pages and Ranking of Web Pages Using Cosine Similarity. International Journal of Science and Research (IJSR) 3(4) (2014)
Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480 (1990)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, San Francisco, vol. 1, pp. 296–304 (1998)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: An on-line lexical database. International Journal of Lexicography 3(4), 235–244 (1990)
Placitelli, A.P.: Categorizzazione di pagine web mediante grafo semantico e tecniche di machine learning, MSc dissertion, University of Naples “Parthenope” (2013)
Qi, X., Davison, B.D.: Web Page classification: Features and algorithms. ACM Computing Surveys (CSUR) 41(2), 12 (2009)
http://www.daviddlewis.com/resources/testcollections/reuters21578/
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Information Processing and Management, pp. 513–523 (1988)
Trstenjaka, B., Mikacb, S., Donkoc, D.: KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering 69, 1356–1364 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Camastra, F., Ciaramella, A., Placitelli, A., Staiano, A. (2015). Machine Learning-Based Web Documents Categorization by Semantic Graphs. In: Bassis, S., Esposito, A., Morabito, F. (eds) Advances in Neural Networks: Computational and Theoretical Issues. Smart Innovation, Systems and Technologies, vol 37. Springer, Cham. https://doi.org/10.1007/978-3-319-18164-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-18164-6_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18163-9
Online ISBN: 978-3-319-18164-6
eBook Packages: EngineeringEngineering (R0)