Content annotation for the semantic web: an automatic web-based approach

Sánchez, David; Isern, David; Millan, Miquel

doi:10.1007/s10115-010-0302-3

Content annotation for the semantic web: an automatic web-based approach

Regular Paper
Published: 21 May 2010

Volume 27, pages 393–418, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

David Sánchez¹,
David Isern¹ &
Miquel Millan¹

504 Accesses
70 Citations
Explore all metrics

Abstract

Semantic Annotation is required to add machine-readable content to natural language text. A global initiative such as the Semantic Web directly depends on the annotation of massive amounts of textual Web resources. However, considering the amount of those resources, a manual semantic annotation of their contents is neither feasible nor scalable. In this paper we introduce a methodology to partially annotate textual content of Web resources in an automatic and unsupervised way. It uses several well-established learning techniques and heuristics to discover relevant entities in text and to associate them to classes of an input ontology by means of linguistic patterns. It also relies on the Web information distribution to assess the degree of semantic co-relation between entities and classes of the input domain ontology. Special efforts have been put in minimizing the amount of Web accesses required to evaluate entities in order to ensure the scalability of the approach. A manual evaluation has been carried out to test the methodology for several domains showing promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alfonseca E, Manandhar S (2002) Improving an ontology refinement method with hyponymy patterns, In: 3rd international conference on language resources and evaluation, LREC 2002. Las Palmas, Spain
Baumgartner R, Flesca S, Gottlob G (2001) Visual web information extraction with lixto. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT (eds) 27th international conference on very large data bases, VLDB 2001. Morgan Kaufmann, Roma, Italy, pp 119–128
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web—a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci Am 284: 34–43
Article Google Scholar
Bisson G, Nedellec C, Cañamero D (2000) Designing clustering methods for ontology building, the Mo’K workbench. In: Staab S, Maedche A, Nedellec C, Wiemer-Hastings P (eds) ECAI workshop on ontology learning 2000. CEUR-WS, Berlin, pp 13–19
Google Scholar
Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (eds) 4th international conference, CICLing 2003. Springer, Heidelberg, pp 179–185
Google Scholar
Buitelaar P, Ramaka S (2005) Unsupervised ontology-based semantic tagging for knowledge markup. In: De Raedt L, Wrobel S (eds) Workshop on learning in web search at 22nd international conference on machine learning, ICML 05. ACM, Bonn, pp 26–32
Google Scholar
Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4: 177–210
Article MathSciNet Google Scholar
Church K, Gale W, Hanks P et al (1991) Using statistics in lexical analysis. In: Zernik U (eds) Lexical acquisition: exploiting on-line resources to build a lexicon. Erlbaum, Hillsdale, pp 115–164
Google Scholar
Cilibrasi RL, Vitányi PMB (2006) The google similarity distance. IEEE Trans Knowl Data Eng 19: 370–383
Article Google Scholar
Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 462–471
Google Scholar
Cimiano P, Ladwig G, Staab S (2005) Gimme’ the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis A, Hagino T (eds) 14th international conference on world wide web. ACM, Chiba, pp 462–471
Google Scholar
Ciravegna F, Dingli A, Petrelli D et al (2002) User-system cooperation in document annotation based on information extraction. In: Gómez-Pérez A, Benjamins R (eds) 13th international conference on knowledge engineering and knowledge management. Ontologies and the semantic web, EKAW 02. Springer, pp 122–137
Google Scholar
Dill S, Eiron N, Gibson D et al (2003) A case for automated large-scale semantic annotation. Web Semant Sci Serv Agents World Wide Web 1: 115–132
Article Google Scholar
Etzioni O, Cafarella M, Downey D et al (2004) Web-scale information extraction in knowitall: (preliminary results). In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 100–110
Google Scholar
Etzioni O, Cafarella M, Downey D et al (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165: 91–134
Article Google Scholar
Evans R (2003) A framework for named entity recognition in the open domain. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, RANLP 03. John Benjamins, Borovetz, pp 267–276
Google Scholar
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Massachusetts, USA
MATH Google Scholar
Fleischman M, Hovy E (2002) Fine grained classification of named entities. In: Tseng S-C, Chen T-E, Liu Y-F (eds) 19th international conference on computational linguistics—vol. 1, COLING 02. Morgan Kaufmann Publishers, Taipei, pp 1–7
Google Scholar
Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering with examples from the areas of knowledge management, e-Commerce and the semantic web. Springer, Berlin
Google Scholar
Hahn U, Schnattinger K (1998) Towards text knowledge engineering. In: Mostow J, Rich C, Buchanan B (eds) Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, AAAI 98/IAAI 98. AAAI, Madison, pp 524–531
Google Scholar
Handschuh S, Staab S, Studer R (2003) Leveraging metadata creation for the semantic web with CREAM. In: Günter A, Kruse R, Neumann B (eds) 26th annual german conference on AI, KI 2003. Springer, Hamburg, pp 19–33
Google Scholar
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Kay M (eds) 14th conference on computational linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, pp 539–545
Google Scholar
Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18: 199–211
Article Google Scholar
Kiyavitskaya N, Zeni N, Cordy JR et al (2005) Semi-automatic semantic annotations for web documents. In: Bouquet P, Tummarello G (eds) 2nd Italian semantic web workshop on semantic web applications and perspectives, SWAP 2005. CEUR-WS, Trento, pp 210–225
Google Scholar
Koivunen M-R (2005) Annotea and semantic web supported collaboration (invited talk). In: Dzbor M, Takeda H, Vargas-Vera M (eds) Workshop on end user aspects of the semantic web at 2nd annual european semantic web conference, UserSWeb 05 CEUR Workshop Proceedings. Heraklion, Crete, pp 5–17
Google Scholar
Krupka G, Hausman K (1998) IsoQuest, Inc: description of the NetOwl extractor system as used for MUC-7, In: 7th message understanding conference, MUC-7. Morgan Kaufman, Fairfax, Virginia, USA
Lamparter S, Ehrig M, Tempich C (2004) Knowledge extraction from classification schemas. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems 2004: CoopIS, DOA, and ODBASE, OTM confederated international conferences, CoopIS/DOA/ODBASE 04. Springer, Cyprus, pp 618–636
Chapter Google Scholar
Leacock C, Chodorow M (1998) Combining local context and wordNet similarity for word sense identification. In: Fellbaum C (eds) WordNet: an electronic lexical database. MIT Press, Massachusetts, p 265 283
Google Scholar
Michelson M, Knoblock CA (2007) An automatic approach to semantic annotation of unstructured, ungrammatical sources: a first look. In: Knoblock CA, Lopresti D, Roy S, Subramaniam LV (eds) IJCAI-2007 workshop on analytics for noisy unstructured text data. Hyderabad, India, pp 123–130
Google Scholar
Mikheev A, Finch S (1997) A workbench for finding structure in texts. In: Grishman R (eds) 5th applied natural language processing conference, ANLP 1997. Association for Computional Linguistics, Washington, pp 8–16
Google Scholar
Niekrasz J, Gruenstein A (2006) NOMOS: a semanticWeb software framework for annotation of multimodal corpora In: 5th international conference on language resources and evaluation, LREC 06. Genoa, Italy, pp 21–27
Pasca M (2004) Acquisition of categorized named entities for web search. In: Grossman DA, Gravano L, Zhai C, Herzog O, Evans DA (eds) Thirteenth ACM international conference on Information and Knowledge Management, KM 06. ACM, Washington, pp 137–145
Google Scholar
Roberts A, Gaizauskas R, Hepple M et al (2007) The CLEF corpus: semantic annotation of clinical text, In: AMIA 2007 annual symposium. American Medical Informatics Association, Chicago, USA, pp 625–629
Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17: 17–33
Article Google Scholar
Sánchez D (2008) Domain ontology learning from the web. VDM Verlag, Saarbrücken, Germany
Google Scholar
Sánchez D, Moreno A (2008a) Learning non-taxonomic relationships from web documents for domain ontology construction. Data Knowl Eng 64: 600–623
Article Google Scholar
Sánchez D, Moreno A (2008b) Pattern-based automatic taxonomy learning from the web. AI Commun 21: 27–48
MathSciNet MATH Google Scholar
Sanderson M, Croft B (1999) Deriving concept hierarchies from text, In: 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ‘99. ACM, Berkeley, USA, pp 206–213
Schroeter R, Hunterd J, Kosovic D (2003) Vannotea—a collaborative video indexing, annotation and discussion system for broadband networks. In: Handschuh S, Koivunen M-R, Dieng-Kuntz R, Staab S (eds) Knowledge markup and semantic annotation workshop, K-CAP 03. ACM, Sanibel, pp 9–26
Google Scholar
Stevenson M, Gaizauskas RJ (2000) Using corpus-derived name lists for named entity recognition. In: Niremburg S (eds) 6th applied natural language processing conference, ANLP 2000. Association for Computional Linguistics, Seattle, pp 290–295
Chapter Google Scholar
Turney PD (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: De Raedt L, Flach P (eds) 12th european conference on machine learning, ECML 01. Springer, Freiburg, pp 491–502
Google Scholar
Uren V, Cimiano P, Iria J et al (2006) Semantic annotation for knowledge management: requirements and a survey of the state of the art. J Web Semant 4: 14–28
Article Google Scholar
Wang P, Hu J, Zeng H-J et al (2009a) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Article Google Scholar
Wang Z, Wang Q, Wang D-W (2009b) Bayesian network based business information retrieval model. Knowl Inf Syst 20: 63–69
Article Google Scholar
Wong T-L, Lam W (2008) Learning to extract and summarize hot item features from multiple auction web sites. Knowl Inf Syst 14: 143–160
Article Google Scholar
Wu Z, Palmer MS (1994) Verb semantics and lexical selection, In: 32nd annual meeting of the association for computational linguistics (ACL). Morgan Kaufmann Publishers / ACL, Las Cruces, New Mexico, USA, pp 133–138
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, Cambridge, Massachusetts, USA, pp 189–196

Download references

Author information

Authors and Affiliations

Departament d’Enginyeria Informàtica i Matemàtiques, Intelligent Technologies for Advanced Knowledge Acquisition Research Group (ITAKA), Universitat Rovira i Virgili, Av Països Catalans, 26, 43007, Tarragona, Catalonia, Spain
David Sánchez, David Isern & Miquel Millan

Authors

David Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
David Isern
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Millan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Isern, D. & Millan, M. Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27, 393–418 (2011). https://doi.org/10.1007/s10115-010-0302-3

Download citation

Received: 07 July 2008
Revised: 26 October 2009
Accepted: 04 May 2010
Published: 21 May 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10115-010-0302-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Content annotation for the semantic web: an automatic web-based approach

Abstract

Access this article

Similar content being viewed by others

Semantic Annotation of Web Documents for Efficient Information Retrieval

Ontology-Based Automatic Annotation: An Approach for Efficient Retrieval of Semantic Results of Web Documents

ADnOTO: A Self-adaptive System for Automatic Ontology-Based Annotation of Unstructured Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Content annotation for the semantic web: an automatic web-based approach

Abstract

Access this article

Similar content being viewed by others

Semantic Annotation of Web Documents for Efficient Information Retrieval

Ontology-Based Automatic Annotation: An Approach for Efficient Retrieval of Semantic Results of Web Documents

ADnOTO: A Self-adaptive System for Automatic Ontology-Based Annotation of Unstructured Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation