Skip to main content
Log in

Content annotation for the semantic web: an automatic web-based approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Semantic Annotation is required to add machine-readable content to natural language text. A global initiative such as the Semantic Web directly depends on the annotation of massive amounts of textual Web resources. However, considering the amount of those resources, a manual semantic annotation of their contents is neither feasible nor scalable. In this paper we introduce a methodology to partially annotate textual content of Web resources in an automatic and unsupervised way. It uses several well-established learning techniques and heuristics to discover relevant entities in text and to associate them to classes of an input ontology by means of linguistic patterns. It also relies on the Web information distribution to assess the degree of semantic co-relation between entities and classes of the input domain ontology. Special efforts have been put in minimizing the amount of Web accesses required to evaluate entities in order to ensure the scalability of the approach. A manual evaluation has been carried out to test the methodology for several domains showing promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alfonseca E, Manandhar S (2002) Improving an ontology refinement method with hyponymy patterns, In: 3rd international conference on language resources and evaluation, LREC 2002. Las Palmas, Spain

  2. Baumgartner R, Flesca S, Gottlob G (2001) Visual web information extraction with lixto. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT (eds) 27th international conference on very large data bases, VLDB 2001. Morgan Kaufmann, Roma, Italy, pp 119–128

  3. Berners-Lee T, Hendler J, Lassila O (2001) The semantic web—a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci Am 284: 34–43

    Article  Google Scholar 

  4. Bisson G, Nedellec C, Cañamero D (2000) Designing clustering methods for ontology building, the Mo’K workbench. In: Staab S, Maedche A, Nedellec C, Wiemer-Hastings P (eds) ECAI workshop on ontology learning 2000. CEUR-WS, Berlin, pp 13–19

    Google Scholar 

  5. Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (eds) 4th international conference, CICLing 2003. Springer, Heidelberg, pp 179–185

    Google Scholar 

  6. Buitelaar P, Ramaka S (2005) Unsupervised ontology-based semantic tagging for knowledge markup. In: De Raedt L, Wrobel S (eds) Workshop on learning in web search at 22nd international conference on machine learning, ICML 05. ACM, Bonn, pp 26–32

    Google Scholar 

  7. Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4: 177–210

    Article  MathSciNet  Google Scholar 

  8. Church K, Gale W, Hanks P et al (1991) Using statistics in lexical analysis. In: Zernik U (eds) Lexical acquisition: exploiting on-line resources to build a lexicon. Erlbaum, Hillsdale, pp 115–164

    Google Scholar 

  9. Cilibrasi RL, Vitányi PMB (2006) The google similarity distance. IEEE Trans Knowl Data Eng 19: 370–383

    Article  Google Scholar 

  10. Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 462–471

    Google Scholar 

  11. Cimiano P, Ladwig G, Staab S (2005) Gimme’ the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis A, Hagino T (eds) 14th international conference on world wide web. ACM, Chiba, pp 462–471

    Google Scholar 

  12. Ciravegna F, Dingli A, Petrelli D et al (2002) User-system cooperation in document annotation based on information extraction. In: Gómez-Pérez A, Benjamins R (eds) 13th international conference on knowledge engineering and knowledge management. Ontologies and the semantic web, EKAW 02. Springer, pp 122–137

    Google Scholar 

  13. Dill S, Eiron N, Gibson D et al (2003) A case for automated large-scale semantic annotation. Web Semant Sci Serv Agents World Wide Web 1: 115–132

    Article  Google Scholar 

  14. Etzioni O, Cafarella M, Downey D et al (2004) Web-scale information extraction in knowitall: (preliminary results). In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 100–110

    Google Scholar 

  15. Etzioni O, Cafarella M, Downey D et al (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165: 91–134

    Article  Google Scholar 

  16. Evans R (2003) A framework for named entity recognition in the open domain. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, RANLP 03. John Benjamins, Borovetz, pp 267–276

    Google Scholar 

  17. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Massachusetts, USA

    MATH  Google Scholar 

  18. Fleischman M, Hovy E (2002) Fine grained classification of named entities. In: Tseng S-C, Chen T-E, Liu Y-F (eds) 19th international conference on computational linguistics—vol. 1, COLING 02. Morgan Kaufmann Publishers, Taipei, pp 1–7

    Google Scholar 

  19. Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering with examples from the areas of knowledge management, e-Commerce and the semantic web. Springer, Berlin

    Google Scholar 

  20. Hahn U, Schnattinger K (1998) Towards text knowledge engineering. In: Mostow J, Rich C, Buchanan B (eds) Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, AAAI 98/IAAI 98. AAAI, Madison, pp 524–531

    Google Scholar 

  21. Handschuh S, Staab S, Studer R (2003) Leveraging metadata creation for the semantic web with CREAM. In: Günter A, Kruse R, Neumann B (eds) 26th annual german conference on AI, KI 2003. Springer, Hamburg, pp 19–33

    Google Scholar 

  22. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Kay M (eds) 14th conference on computational linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, pp 539–545

    Google Scholar 

  23. Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18: 199–211

    Article  Google Scholar 

  24. Kiyavitskaya N, Zeni N, Cordy JR et al (2005) Semi-automatic semantic annotations for web documents. In: Bouquet P, Tummarello G (eds) 2nd Italian semantic web workshop on semantic web applications and perspectives, SWAP 2005. CEUR-WS, Trento, pp 210–225

    Google Scholar 

  25. Koivunen M-R (2005) Annotea and semantic web supported collaboration (invited talk). In: Dzbor M, Takeda H, Vargas-Vera M (eds) Workshop on end user aspects of the semantic web at 2nd annual european semantic web conference, UserSWeb 05 CEUR Workshop Proceedings. Heraklion, Crete, pp 5–17

    Google Scholar 

  26. Krupka G, Hausman K (1998) IsoQuest, Inc: description of the NetOwl extractor system as used for MUC-7, In: 7th message understanding conference, MUC-7. Morgan Kaufman, Fairfax, Virginia, USA

  27. Lamparter S, Ehrig M, Tempich C (2004) Knowledge extraction from classification schemas. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems 2004: CoopIS, DOA, and ODBASE, OTM confederated international conferences, CoopIS/DOA/ODBASE 04. Springer, Cyprus, pp 618–636

    Chapter  Google Scholar 

  28. Leacock C, Chodorow M (1998) Combining local context and wordNet similarity for word sense identification. In: Fellbaum C (eds) WordNet: an electronic lexical database. MIT Press, Massachusetts, p 265 283

    Google Scholar 

  29. Michelson M, Knoblock CA (2007) An automatic approach to semantic annotation of unstructured, ungrammatical sources: a first look. In: Knoblock CA, Lopresti D, Roy S, Subramaniam LV (eds) IJCAI-2007 workshop on analytics for noisy unstructured text data. Hyderabad, India, pp 123–130

    Google Scholar 

  30. Mikheev A, Finch S (1997) A workbench for finding structure in texts. In: Grishman R (eds) 5th applied natural language processing conference, ANLP 1997. Association for Computional Linguistics, Washington, pp 8–16

    Google Scholar 

  31. Niekrasz J, Gruenstein A (2006) NOMOS: a semanticWeb software framework for annotation of multimodal corpora In: 5th international conference on language resources and evaluation, LREC 06. Genoa, Italy, pp 21–27

  32. Pasca M (2004) Acquisition of categorized named entities for web search. In: Grossman DA, Gravano L, Zhai C, Herzog O, Evans DA (eds) Thirteenth ACM international conference on Information and Knowledge Management, KM 06. ACM, Washington, pp 137–145

    Google Scholar 

  33. Roberts A, Gaizauskas R, Hepple M et al (2007) The CLEF corpus: semantic annotation of clinical text, In: AMIA 2007 annual symposium. American Medical Informatics Association, Chicago, USA, pp 625–629

  34. Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17: 17–33

    Article  Google Scholar 

  35. Sánchez D (2008) Domain ontology learning from the web. VDM Verlag, Saarbrücken, Germany

    Google Scholar 

  36. Sánchez D, Moreno A (2008a) Learning non-taxonomic relationships from web documents for domain ontology construction. Data Knowl Eng 64: 600–623

    Article  Google Scholar 

  37. Sánchez D, Moreno A (2008b) Pattern-based automatic taxonomy learning from the web. AI Commun 21: 27–48

    MathSciNet  MATH  Google Scholar 

  38. Sanderson M, Croft B (1999) Deriving concept hierarchies from text, In: 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ‘99. ACM, Berkeley, USA, pp 206–213

  39. Schroeter R, Hunterd J, Kosovic D (2003) Vannotea—a collaborative video indexing, annotation and discussion system for broadband networks. In: Handschuh S, Koivunen M-R, Dieng-Kuntz R, Staab S (eds) Knowledge markup and semantic annotation workshop, K-CAP 03. ACM, Sanibel, pp 9–26

    Google Scholar 

  40. Stevenson M, Gaizauskas RJ (2000) Using corpus-derived name lists for named entity recognition. In: Niremburg S (eds) 6th applied natural language processing conference, ANLP 2000. Association for Computional Linguistics, Seattle, pp 290–295

    Chapter  Google Scholar 

  41. Turney PD (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: De Raedt L, Flach P (eds) 12th european conference on machine learning, ECML 01. Springer, Freiburg, pp 491–502

    Google Scholar 

  42. Uren V, Cimiano P, Iria J et al (2006) Semantic annotation for knowledge management: requirements and a survey of the state of the art. J Web Semant 4: 14–28

    Article  Google Scholar 

  43. Wang P, Hu J, Zeng H-J et al (2009a) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281

    Article  Google Scholar 

  44. Wang Z, Wang Q, Wang D-W (2009b) Bayesian network based business information retrieval model. Knowl Inf Syst 20: 63–69

    Article  Google Scholar 

  45. Wong T-L, Lam W (2008) Learning to extract and summarize hot item features from multiple auction web sites. Knowl Inf Syst 14: 143–160

    Article  Google Scholar 

  46. Wu Z, Palmer MS (1994) Verb semantics and lexical selection, In: 32nd annual meeting of the association for computational linguistics (ACL). Morgan Kaufmann Publishers / ACL, Las Cruces, New Mexico, USA, pp 133–138

  47. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, Cambridge, Massachusetts, USA, pp 189–196

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Isern, D. & Millan, M. Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27, 393–418 (2011). https://doi.org/10.1007/s10115-010-0302-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0302-3

Keywords

Navigation