Abstract
In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures. In: Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2000), Pittsburgh, PA (2000)
Cesarano, C., d’Acerno, A., Picariello, A.: An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. In: Proc. of the Fifth ACM International Workshop on the Web Information and Data Management, New Orleans, Louisiana, USA, pp. 111–117 (2003)
GATE - General Architecture for Text Engineering, The Natural Language Processing Research Group, Department of Computer Science, University of Sheffield, http://gate.ac.uk/
Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993)
van Heijst, G., Schreiber, A.T., Wielinga, B.J.: Using explicit ontologies in KBs development. IJHCS, 183–291 (1997)
Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proc. of ICDM-03 (2003)
Litvak, M., Last, M., Kisilevich, S.: Improving Classification of Multi-Lingual Web Documents using Domain Ontologies. In: ECML/PKDD-2005, Porto, Portugal (October 2005)
Miller, G.A., et al.: Wordnet: An Online Lexical Database. International Journal of Lexicography 3(4), 235–244 (1990)
Voorhees, E.: Using WordNetTM to disambiguate word senses for text retrieval. In: Proc. of the 16th annual international ACM SIGIR conference, Pittsburgh, PA (1993)
Witten, I.H., et al.: Weka: Practical machine learning tools and techniques with java implementations. In: Proc. of ICONIP/ANZIIS/ANNES’99 Int. Workshop on Emerging Knowledge Engineering and Connectionist-Based Info. Systems, pp. 192–196 (1999)
Yao, Y.Y., et al.: Web Intelligence (WI): research challenges and trends in the new information age. In: Zhong, N., et al. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 1–17. Springer, Heidelberg (2001)
Zhong, N., Liu, J., Yao, Y.Y.: In search of the wisdom Web. IEEE Computer 35(11), 27–31 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Litvak, M., Last, M., Kisilevich, S. (2007). Classification of Web Documents Using Concept Extraction from Ontologies. In: Gorodetsky, V., Zhang, C., Skormin, V.A., Cao, L. (eds) Autonomous Intelligent Systems: Multi-Agents and Data Mining. AIS-ADM 2007. Lecture Notes in Computer Science(), vol 4476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72839-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-72839-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72838-2
Online ISBN: 978-3-540-72839-9
eBook Packages: Computer ScienceComputer Science (R0)