Abstract
Semantic Annotation is required to add machine-readable content to natural language text. A global initiative such as the Semantic Web directly depends on the annotation of massive amounts of textual Web resources. However, considering the amount of those resources, a manual semantic annotation of their contents is neither feasible nor scalable. In this paper we introduce a methodology to partially annotate textual content of Web resources in an automatic and unsupervised way. It uses several well-established learning techniques and heuristics to discover relevant entities in text and to associate them to classes of an input ontology by means of linguistic patterns. It also relies on the Web information distribution to assess the degree of semantic co-relation between entities and classes of the input domain ontology. Special efforts have been put in minimizing the amount of Web accesses required to evaluate entities in order to ensure the scalability of the approach. A manual evaluation has been carried out to test the methodology for several domains showing promising results.
Similar content being viewed by others
References
Alfonseca E, Manandhar S (2002) Improving an ontology refinement method with hyponymy patterns, In: 3rd international conference on language resources and evaluation, LREC 2002. Las Palmas, Spain
Baumgartner R, Flesca S, Gottlob G (2001) Visual web information extraction with lixto. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT (eds) 27th international conference on very large data bases, VLDB 2001. Morgan Kaufmann, Roma, Italy, pp 119–128
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web—a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci Am 284: 34–43
Bisson G, Nedellec C, Cañamero D (2000) Designing clustering methods for ontology building, the Mo’K workbench. In: Staab S, Maedche A, Nedellec C, Wiemer-Hastings P (eds) ECAI workshop on ontology learning 2000. CEUR-WS, Berlin, pp 13–19
Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (eds) 4th international conference, CICLing 2003. Springer, Heidelberg, pp 179–185
Buitelaar P, Ramaka S (2005) Unsupervised ontology-based semantic tagging for knowledge markup. In: De Raedt L, Wrobel S (eds) Workshop on learning in web search at 22nd international conference on machine learning, ICML 05. ACM, Bonn, pp 26–32
Califf ME, Mooney RJ (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4: 177–210
Church K, Gale W, Hanks P et al (1991) Using statistics in lexical analysis. In: Zernik U (eds) Lexical acquisition: exploiting on-line resources to build a lexicon. Erlbaum, Hillsdale, pp 115–164
Cilibrasi RL, Vitányi PMB (2006) The google similarity distance. IEEE Trans Knowl Data Eng 19: 370–383
Cimiano P, Handschuh S, Staab S (2004) Towards the self-annotating web. In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 462–471
Cimiano P, Ladwig G, Staab S (2005) Gimme’ the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis A, Hagino T (eds) 14th international conference on world wide web. ACM, Chiba, pp 462–471
Ciravegna F, Dingli A, Petrelli D et al (2002) User-system cooperation in document annotation based on information extraction. In: Gómez-Pérez A, Benjamins R (eds) 13th international conference on knowledge engineering and knowledge management. Ontologies and the semantic web, EKAW 02. Springer, pp 122–137
Dill S, Eiron N, Gibson D et al (2003) A case for automated large-scale semantic annotation. Web Semant Sci Serv Agents World Wide Web 1: 115–132
Etzioni O, Cafarella M, Downey D et al (2004) Web-scale information extraction in knowitall: (preliminary results). In: Feldman S, Uretsky M (eds) 13th international conference on world wide web. ACM, New York, pp 100–110
Etzioni O, Cafarella M, Downey D et al (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165: 91–134
Evans R (2003) A framework for named entity recognition in the open domain. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, RANLP 03. John Benjamins, Borovetz, pp 267–276
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Massachusetts, USA
Fleischman M, Hovy E (2002) Fine grained classification of named entities. In: Tseng S-C, Chen T-E, Liu Y-F (eds) 19th international conference on computational linguistics—vol. 1, COLING 02. Morgan Kaufmann Publishers, Taipei, pp 1–7
Gómez-Pérez A, Fernández-López M, Corcho O (2004) Ontological engineering with examples from the areas of knowledge management, e-Commerce and the semantic web. Springer, Berlin
Hahn U, Schnattinger K (1998) Towards text knowledge engineering. In: Mostow J, Rich C, Buchanan B (eds) Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, AAAI 98/IAAI 98. AAAI, Madison, pp 524–531
Handschuh S, Staab S, Studer R (2003) Leveraging metadata creation for the semantic web with CREAM. In: Günter A, Kruse R, Neumann B (eds) 26th annual german conference on AI, KI 2003. Springer, Hamburg, pp 19–33
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Kay M (eds) 14th conference on computational linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, pp 539–545
Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18: 199–211
Kiyavitskaya N, Zeni N, Cordy JR et al (2005) Semi-automatic semantic annotations for web documents. In: Bouquet P, Tummarello G (eds) 2nd Italian semantic web workshop on semantic web applications and perspectives, SWAP 2005. CEUR-WS, Trento, pp 210–225
Koivunen M-R (2005) Annotea and semantic web supported collaboration (invited talk). In: Dzbor M, Takeda H, Vargas-Vera M (eds) Workshop on end user aspects of the semantic web at 2nd annual european semantic web conference, UserSWeb 05 CEUR Workshop Proceedings. Heraklion, Crete, pp 5–17
Krupka G, Hausman K (1998) IsoQuest, Inc: description of the NetOwl extractor system as used for MUC-7, In: 7th message understanding conference, MUC-7. Morgan Kaufman, Fairfax, Virginia, USA
Lamparter S, Ehrig M, Tempich C (2004) Knowledge extraction from classification schemas. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems 2004: CoopIS, DOA, and ODBASE, OTM confederated international conferences, CoopIS/DOA/ODBASE 04. Springer, Cyprus, pp 618–636
Leacock C, Chodorow M (1998) Combining local context and wordNet similarity for word sense identification. In: Fellbaum C (eds) WordNet: an electronic lexical database. MIT Press, Massachusetts, p 265 283
Michelson M, Knoblock CA (2007) An automatic approach to semantic annotation of unstructured, ungrammatical sources: a first look. In: Knoblock CA, Lopresti D, Roy S, Subramaniam LV (eds) IJCAI-2007 workshop on analytics for noisy unstructured text data. Hyderabad, India, pp 123–130
Mikheev A, Finch S (1997) A workbench for finding structure in texts. In: Grishman R (eds) 5th applied natural language processing conference, ANLP 1997. Association for Computional Linguistics, Washington, pp 8–16
Niekrasz J, Gruenstein A (2006) NOMOS: a semanticWeb software framework for annotation of multimodal corpora In: 5th international conference on language resources and evaluation, LREC 06. Genoa, Italy, pp 21–27
Pasca M (2004) Acquisition of categorized named entities for web search. In: Grossman DA, Gravano L, Zhai C, Herzog O, Evans DA (eds) Thirteenth ACM international conference on Information and Knowledge Management, KM 06. ACM, Washington, pp 137–145
Roberts A, Gaizauskas R, Hepple M et al (2007) The CLEF corpus: semantic annotation of clinical text, In: AMIA 2007 annual symposium. American Medical Informatics Association, Chicago, USA, pp 625–629
Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17: 17–33
Sánchez D (2008) Domain ontology learning from the web. VDM Verlag, Saarbrücken, Germany
Sánchez D, Moreno A (2008a) Learning non-taxonomic relationships from web documents for domain ontology construction. Data Knowl Eng 64: 600–623
Sánchez D, Moreno A (2008b) Pattern-based automatic taxonomy learning from the web. AI Commun 21: 27–48
Sanderson M, Croft B (1999) Deriving concept hierarchies from text, In: 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ‘99. ACM, Berkeley, USA, pp 206–213
Schroeter R, Hunterd J, Kosovic D (2003) Vannotea—a collaborative video indexing, annotation and discussion system for broadband networks. In: Handschuh S, Koivunen M-R, Dieng-Kuntz R, Staab S (eds) Knowledge markup and semantic annotation workshop, K-CAP 03. ACM, Sanibel, pp 9–26
Stevenson M, Gaizauskas RJ (2000) Using corpus-derived name lists for named entity recognition. In: Niremburg S (eds) 6th applied natural language processing conference, ANLP 2000. Association for Computional Linguistics, Seattle, pp 290–295
Turney PD (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: De Raedt L, Flach P (eds) 12th european conference on machine learning, ECML 01. Springer, Freiburg, pp 491–502
Uren V, Cimiano P, Iria J et al (2006) Semantic annotation for knowledge management: requirements and a survey of the state of the art. J Web Semant 4: 14–28
Wang P, Hu J, Zeng H-J et al (2009a) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Wang Z, Wang Q, Wang D-W (2009b) Bayesian network based business information retrieval model. Knowl Inf Syst 20: 63–69
Wong T-L, Lam W (2008) Learning to extract and summarize hot item features from multiple auction web sites. Knowl Inf Syst 14: 143–160
Wu Z, Palmer MS (1994) Verb semantics and lexical selection, In: 32nd annual meeting of the association for computational linguistics (ACL). Morgan Kaufmann Publishers / ACL, Las Cruces, New Mexico, USA, pp 133–138
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, Cambridge, Massachusetts, USA, pp 189–196
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez, D., Isern, D. & Millan, M. Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27, 393–418 (2011). https://doi.org/10.1007/s10115-010-0302-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0302-3