Skip to main content

Classification of Web Documents Using Concept Extraction from Ontologies

  • Conference paper
Autonomous Intelligent Systems: Multi-Agents and Data Mining (AIS-ADM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4476))

  • 624 Accesses

Abstract

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed ontology and a classification algorithm, and classification of new documents by information agents via the induced model. We evaluated the proposed methodology in two specific domains: the chemical domain (web pages containing information about production of certain chemicals), and Yahoo! collection of web news documents divided into several categories. Our system receives as input the domain-specific ontology, and a set of categorized web documents, and then perfroms concept generalization on these documents. We use a key-phrase extractor with integrated ontology parser for creating a database from input documents and use it as a training set for the classification algorithm. The system classification accuracy is estimated using various levels of ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures. In: Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2000), Pittsburgh, PA (2000)

    Google Scholar 

  2. Cesarano, C., d’Acerno, A., Picariello, A.: An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. In: Proc. of the Fifth ACM International Workshop on the Web Information and Data Management, New Orleans, Louisiana, USA, pp. 111–117 (2003)

    Google Scholar 

  3. GATE - General Architecture for Text Engineering, The Natural Language Processing Research Group, Department of Computer Science, University of Sheffield, http://gate.ac.uk/

  4. Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993)

    Article  Google Scholar 

  5. van Heijst, G., Schreiber, A.T., Wielinga, B.J.: Using explicit ontologies in KBs development. IJHCS, 183–291 (1997)

    Google Scholar 

  6. Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proc. of ICDM-03 (2003)

    Google Scholar 

  7. Litvak, M., Last, M., Kisilevich, S.: Improving Classification of Multi-Lingual Web Documents using Domain Ontologies. In: ECML/PKDD-2005, Porto, Portugal (October 2005)

    Google Scholar 

  8. Miller, G.A., et al.: Wordnet: An Online Lexical Database. International Journal of Lexicography 3(4), 235–244 (1990)

    Article  Google Scholar 

  9. Voorhees, E.: Using WordNetTM to disambiguate word senses for text retrieval. In: Proc. of the 16th annual international ACM SIGIR conference, Pittsburgh, PA (1993)

    Google Scholar 

  10. Witten, I.H., et al.: Weka: Practical machine learning tools and techniques with java implementations. In: Proc. of ICONIP/ANZIIS/ANNES’99 Int. Workshop on Emerging Knowledge Engineering and Connectionist-Based Info. Systems, pp. 192–196 (1999)

    Google Scholar 

  11. Yao, Y.Y., et al.: Web Intelligence (WI): research challenges and trends in the new information age. In: Zhong, N., et al. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 1–17. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  12. Zhong, N., Liu, J., Yao, Y.Y.: In search of the wisdom Web. IEEE Computer 35(11), 27–31 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Vladimir Gorodetsky Chengqi Zhang Victor A. Skormin Longbing Cao

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Litvak, M., Last, M., Kisilevich, S. (2007). Classification of Web Documents Using Concept Extraction from Ontologies. In: Gorodetsky, V., Zhang, C., Skormin, V.A., Cao, L. (eds) Autonomous Intelligent Systems: Multi-Agents and Data Mining. AIS-ADM 2007. Lecture Notes in Computer Science(), vol 4476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72839-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72839-9_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72838-2

  • Online ISBN: 978-3-540-72839-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics