Abstract
Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches. We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ciravegna, F.: (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In: Proc. IJCAI 2001 Workshop on Adaptive Text Extraction and Mining, Seattle (2001)
Duda, R.O., Gasching, J., Hart, P.E.: Model design in the Prospector consultant system for mineral exploration. In: Readings in Artificial Intelligence, pp. 334–348 (1981)
Embley, D.W., Tao, C., Liddle, D.W.: Automatically extracting ontologically specified data from HTML tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall, New Jersey (2001)
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. J. Web. Sem. 2, 49–79 (2004)
Labský, M.: Information Extraction from Websites using Extraction Ontologies. Technical Report, KEG UEP (2009), http://eso.vse.cz/~labsky/ex/exo09.pdf
Labský, M., Svátek, V.: On the Design and Exploitation of Presentation Ontologies for Information Extraction. In: ESWC 2006 Workshop on Mastering the Gap: From Information Extraction to Semantic Representation. CEUR-WS, vol. 187 (2006)
Labský, M., Svátek, V., Šváb, O.: Types and Roles of Ontologies in Web Information Extraction. In: ECML/PKDD Workshop on Knowledge Discovery and Ontologies, Pisa (2004)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Nekvasil, M., Svátek, V., Labský, M.: Transforming Existing Knowledge Models to Information Extraction Ontologies. In: Proc. 11th International Conference on Business Information Systems. LNBIP, vol. 7, pp. 106–117. Springer, Heidelberg (2008)
Popescu, A., Etzioni, O.: Extracting Product Features and Opinions from Reviews. In: Proc. Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 339–346 (2005)
Ruffolo, M., Manna, M.: HiLεX: A System for Semantic Information Extraction from Web Documents. In: Proc. Enterprise Information Systems. LNBIP, pp. 194–209. Springer, Heidelberg (2008)
Sabou, M., Lopez, V., Motta, E.: Ontology selection for the real semantic web: How to cover the queen’s birthday dinner? In: Staab, S., Svátek, V. (eds.) EKAW 2006. LNCS, vol. 4248, pp. 96–111. Springer, Heidelberg (2006)
Stamatakis, K., Metsis, V., Karkaletsis, V., Růžička, M., Svátek, V., Amigó, E., Pöllä, M., Spyropoulos, C.D.: Content collection for the labelling of health-related web content. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS, vol. 4594, pp. 341–345. Springer, Heidelberg (2007)
Svátek, V., Labský, M., Vacura, M.: Knowledge Modelling for Deductive Web Mining. In: Motta, E., Shadbolt, N.R., Stutt, A., Gibbins, N. (eds.) EKAW 2004. LNCS, vol. 3257, pp. 337–353. Springer, Heidelberg (2004)
Wei, X., Croft, B., McCallum, A.: Table Extraction for Answer Retrieval. Information Retrieval Journal 9(5), 589–611 (2006)
Wick, M., Culotta, A., McCallum, A.: Learning Field Compatibilities to Extract Database Records from Unstructured Text. In: Proc. Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 603–611 (2006)
Yates, A., Etzioni, O.: Unsupervised Resolution of Objects and Relations on the Web. In: Proc. NAACL Human Language Technologies Conference, pp. 121–130 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Labský, M., Svátek, V., Nekvasil, M., Rak, D. (2009). The Ex Project: Web Information Extraction Using Extraction Ontologies. In: Berendt, B., et al. Knowledge Discovery Enhanced with Semantic and Social Information. Studies in Computational Intelligence, vol 220. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01891-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-01891-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01890-9
Online ISBN: 978-3-642-01891-6
eBook Packages: EngineeringEngineering (R0)