Classification of Web Documents Using Concept Extraction from Ontologies

Litvak, Marina; Last, Mark; Kisilevich, Slava

doi:10.1007/978-3-540-72839-9_24

Marina Litvak¹,
Mark Last¹ &
Slava Kisilevich¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4476))

Included in the following conference series:

International Workshop on Autonomous Intelligent Systems: Multi-Agents and Data Mining

624 Accesses

Abstract

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

Web Content Classification Techniques Based on Fuzzy Ontology

Automatic Document Classification Based on J.S. Mill’s Ideas

References

Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures. In: Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2000), Pittsburgh, PA (2000)
Google Scholar
Cesarano, C., d’Acerno, A., Picariello, A.: An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. In: Proc. of the Fifth ACM International Workshop on the Web Information and Data Management, New Orleans, Louisiana, USA, pp. 111–117 (2003)
Google Scholar
GATE - General Architecture for Text Engineering, The Natural Language Processing Research Group, Department of Computer Science, University of Sheffield, http://gate.ac.uk/
Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993)
Article Google Scholar
van Heijst, G., Schreiber, A.T., Wielinga, B.J.: Using explicit ontologies in KBs development. IJHCS, 183–291 (1997)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proc. of ICDM-03 (2003)
Google Scholar
Litvak, M., Last, M., Kisilevich, S.: Improving Classification of Multi-Lingual Web Documents using Domain Ontologies. In: ECML/PKDD-2005, Porto, Portugal (October 2005)
Google Scholar
Miller, G.A., et al.: Wordnet: An Online Lexical Database. International Journal of Lexicography 3(4), 235–244 (1990)
Article Google Scholar
Voorhees, E.: Using WordNet^TM to disambiguate word senses for text retrieval. In: Proc. of the 16th annual international ACM SIGIR conference, Pittsburgh, PA (1993)
Google Scholar
Witten, I.H., et al.: Weka: Practical machine learning tools and techniques with java implementations. In: Proc. of ICONIP/ANZIIS/ANNES’99 Int. Workshop on Emerging Knowledge Engineering and Connectionist-Based Info. Systems, pp. 192–196 (1999)
Google Scholar
Yao, Y.Y., et al.: Web Intelligence (WI): research challenges and trends in the new information age. In: Zhong, N., et al. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 1–17. Springer, Heidelberg (2001)
Chapter Google Scholar
Zhong, N., Liu, J., Yao, Y.Y.: In search of the wisdom Web. IEEE Computer 35(11), 27–31 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev,Beer-Sheva 84105, Israel
Marina Litvak, Mark Last & Slava Kisilevich

Authors

Marina Litvak
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar
Slava Kisilevich
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Vladimir Gorodetsky Chengqi Zhang Victor A. Skormin Longbing Cao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Litvak, M., Last, M., Kisilevich, S. (2007). Classification of Web Documents Using Concept Extraction from Ontologies. In: Gorodetsky, V., Zhang, C., Skormin, V.A., Cao, L. (eds) Autonomous Intelligent Systems: Multi-Agents and Data Mining. AIS-ADM 2007. Lecture Notes in Computer Science(), vol 4476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72839-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-72839-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72838-2
Online ISBN: 978-3-540-72839-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics