Summary
Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bădică, C, Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Boley, H., Antoniou, G. (eds): Proc. RuleML’04, Hiroshima, Japan. LNCS 3323 Springer-Verlag (2004) 37–8.
Bădică, C, Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, PS., Kacprzyk, J., Niewiadomski, A. (eds.): Proc.AWIC’05, Lodz, Poland. LNAI 3528 Springer-Verlag (2005) 44–50.
Bădică, C, Bădică, A.: Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S.; Ceri, S.; Hunt, E.; Ives, Z.G.; Bellahsene, Z.; Rys, M.; Unland, R. (eds): Proc. 3 rd International XML Database Symposium XSym’05, Trondheim, Norway. LNCS 3671, Springer-Verlag (2005) 177–191
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (2003) 3–8.
Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xslt (1999).
Gottlob, G., Koch, C, Schulz, K.U.: Conjunctive Queries over Trees. In: Proc.PODS’2004, Paris, France. ACM Press, (2004) 189–200.
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for In formation Agents, In: Intelligent Information Agents R&D in Europe: An AgentLink perspective (Klusch, et al. eds.). LNCS 2586, Springer-Verlag (2003).
Li, Z., Ng, W.K.: WDEE: Web Data Extraction by Example. In: L. Zhou et al. (Eds.): Proc.DASFAA’2005, Beijing, China. LNCS 3453, Springer-Verlag (2005), 347–358.
Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.): Progress in Discovery Science. LNCS 2281, Springer-Verlag (2002) 586–599.
World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, November 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bădică, C., Bădică, A., Popescu, E. (2006). A New Path Generalization Algorithm for HTML Wrapper Induction. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_2
Download citation
DOI: https://doi.org/10.1007/3-540-33880-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)