Abstract
With the advent of the Semantic Web, there is a great need to upgrade existing web content to semantic web content. This can be accomplished through semantic annotations. Unfortunately, manual annotation is tedious, time consuming and error-prone. In this paper, we propose a tool, called iASA, that learns to automatically annotate web documents according to an ontology. iASA is based on the combination of information extraction (specifically, the Similarity-based Rule Learner—SRL) and machine learning techniques. Using linguistic knowledge and optimal dynamic window size, SRL produces annotation rules of better quality than comparable semantic annotation systems. Similarity-based learning efficiently reduces the search space by avoiding pseudo rule generalization. In the annotation phase, iASA exploits ontology knowledge to refine the annotation it proposes. Moreover, our annotation algorithm exploits machine learning methods to correctly select instances and to predict missing instances. Finally, iASA provides an explanation component that explains the nature of the learner and annotator to the user. Explanations can greatly help users understand the rule induction and annotation process, so that they can focus on correcting rules and annotations quickly. Experimental results show that iASA can reach high accuracy quickly.
Supported by the National Natural Science Foundation of China under Grant No. 60443002.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P., Shadbolt, N.: Automatic Ontology-Based Knowledge Extraction from Web Documents. IEEE Intelligent Systems 18(1), 14–21 (2003)
Benjamins, R., Contreras, J.: White Paper Six Challenges for the Semantic Web. Intelligent Software Components. Intelligent software for the networked economy, isoco (April 2002)
Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22, 39–71 (1996)
Berners-Lee, T., Fischetti, M., Dertouzos, M.L.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web (1999)
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284, 34–43 (2001)
Buitelaar, P., Declerck, T.: Linguistic Annotation for the Semantic Web. In: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications Series, vol. 96. IOS Press, Amsterdam (2003)
Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis. University of Texas, Austin (1998)
Chieu, H.L., Ng, H.T.: A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In: Eighteenth national conference on Artificial intelligence (2002)
Ciravegna, F.: (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with 17th International Joint Conference on Artificial Intelligence (IJCAI), Seattle, Usa (August 2001)
Ciravegna, F., Dingli, A., Iria, J., Wilks, Y.: Multi-strategy Definition of Annotation Services in Melita. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 97–107. Springer, Heidelberg (2003)
Cohen, W., Jensen, L.: A Structured Wrapper Induction System for Extracting Information from Semi-structured Documents. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining, IJCAI 2001 (2001)
Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: Proceedings of the Conference on Empirical Methods in NLP (2002)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297 (1995)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Dean, M., Schreiber, G., Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Andrea Stein, L.: OWL Web Ontology Language Reference. W3C Recommendation (February 10, 2004), http://www.w3.org/TR/owl-ref/
Dhamankar, R., Lee, Y., Doan, A.H., Halevy, A., Domingos, P.: iMAP: Discovering Complex Semantic Matches between Database Schemas. In: SIGMOD 2004, Paris, France (June 13–18, 2004)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A Case for Automated Large-scale Semantic Annotation. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 115–132 (July 2003)
Eriksson, H., Fergerson, R., Shahar, Y., Musen, M.: Automatic Generation of Ontology Editors. In: Proceedings of the 12th Banff Knowledge Acquisition Workshop, Banff Alberta, Canada (1999)
Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: Or how to enable intelligent access to the WWW. In: Proceedings of 11th Banff Knowledge Acquisition for Knowledge-Based SystemsWorkshop, Banff, Canada (1998)
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of 17th National Conference on Artificial Intelligence (2000)
Ghahramani, Z., Jordan, M.I.: Factorial Hidden Markov Models. Machine Learning 29, 245–273 (1997)
Hammond, B., Sheth, A., Kochut, K.: Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content. In: Kashyap, V., Shklar, L. (eds.) Real World Semantic Web Applications, December 2002, pp. 29–49. IOS Press, Amsterdam (2002)
Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic Document Metadata Extraction Using Support Vector Machine. In: Proceedings of Joint Conference on Digital Libraries (JCDL 2003), pp. 37–48 (2003)
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM—Semi-automatic Creation of Metadata, In Proceedings of the 13th International Conference on Knowledge Engineering and Management (EKAW 2002), Siguenza, Spain. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 358–372. Springer, Heidelberg (2002)
Handschuh, S., Staab, S.: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications, vol. 96. New IOS Publication (2003)
Heflin, J., Hendler, J.: Searching the Web with SHOE. In: Proceedings of AAAI-2000 Workshop on AI for Web Search, Austin, Texas (2000)
Kahan, J., Koivunen, M.R.: Annotea: an Open RDF Infrastructure for Shared Web Annotations. In: Proceedings of World Wide Web, pp. 623–632 (2001)
Kogut, P., Holmes, W.: AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages (2001)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, pp. 729–737 (1997)
Leonard, T., Glaser, H.: Large Scale Acquisition and Maintenance from the Web without Source Access (2001), http://www.semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA (August 2001)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001 (2001)
Lavelli, A., Califf, M., Ciravegna, F., Freitag, F., Giuliano, D., Kushmerick, C., Romano, N.: A Critical Survey of the Methodology for IE Evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)
Li, J., Yu, Y.: Learning to Generate Semantic Annotation for Domain Specific Sentences. In: Proceedings of the Knowledge Markup and Semantic Annotation Workshop in K-CAP 2001, Victoria, BC (2001)
Martin, P., Eklund, P.: Embedding Knowledge in Web Documents. In: Proceedings of the 8th International World Wide Web Conf (WWW 1998), Toronto, May 1999, pp. 1403–1419. Elsevier Science B.V, Amsterdam (1999)
McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of the ICML Coference (2000)
Mukherjee, S., Yang, G., Ramakrishnan, I.V.: Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 533–549. Springer, Heidelberg (2003)
Muslea, I.: Active Learning with Multiple Views. Ph.D. dissertation USC (2002)
Nahm, U.Y., Mooney, R.J.: Using Soft-Matching Mined Rules to Improve Information Extraction. In: Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), San Jose, CA, July 2004, pp. 27–32 (2004)
Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers using Conditional Random Fields. In: Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, HLT-NAACL (2004)
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (2003)
Popov, B., Kiryakov, A., Manov, D., Kirilov, A., Ognyanoff, D., Goranov, M.: Towards Semantic Web Information Extraction. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 1–21. Springer, Heidelberg (2003)
Schaffer, C.: Selecting a Classification method by Cross-Validation. Machine Learning 13(1), 135–143 (1993)
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: Proceedings of AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 1–44 (January 1999)
Soo, V.W., Lee, C.Y., Li, C.–C., Chen, S.L., Chen, C.: Automated Semantic Annotation and Retrieval Based on Sharable Ontology and Case-based Learning Techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, Los Alamitos (2003)
Vapnik, V.: Statistical Learning Theroy. Springer, New York (1998)
Vargas-Vera, M., Motta, E., Domingue, J., Buckingham Shum, S., Lanzoni, M.: Knowledge Extraction by Using an Ontology-based Annotation Tool. In: Proceedings of K-CAP 2001 Workshop on Knowledge Markup and Semantic Annotation, Victoria, BC, Canada (October 2001)
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology Driven Semiautomatic and Automatic Support for Semantic Markup. In: GĂłmez-PĂ©rez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 379. Springer, Heidelberg (2002)
Zhang, K., Xu, P., Li, J.: Optimal Hierarchical Clustering based Logic Structure Extraction. Journal of Tsinghua Science and Technology (2005)
Zhang, L., Pan, Y., Zhang, T.: Recognising and using named entities: Focused named entity recognition using machine learning. In: Proceedings of the SIGIR 2004 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K. (2005). iASA: Learning to Annotate the Semantic Web. In: Spaccapietra, S. (eds) Journal on Data Semantics IV. Lecture Notes in Computer Science, vol 3730. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11603412_4
Download citation
DOI: https://doi.org/10.1007/11603412_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31001-3
Online ISBN: 978-3-540-31447-9
eBook Packages: Computer ScienceComputer Science (R0)