A New Path Generalization Algorithm for HTML Wrapper Induction

Bădică, Costin; Bădică, Amelia; Popescu, Elvira

doi:10.1007/3-540-33880-2_2

Costin Bădică⁷,
Amelia Bădică⁸ &
Elvira Popescu⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

660 Accesses

Summary

Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Rule Induction and Reasoning over Knowledge Graphs

HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

References

Bădică, C, Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Boley, H., Antoniou, G. (eds): Proc. RuleML’04, Hiroshima, Japan. LNCS 3323 Springer-Verlag (2004) 37–8.
Google Scholar
Bădică, C, Bădică, A., Popescu, E.: Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, PS., Kacprzyk, J., Niewiadomski, A. (eds.): Proc.AWIC’05, Lodz, Poland. LNAI 3528 Springer-Verlag (2005) 44–50.
Google Scholar
Bădică, C, Bădică, A.: Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S.; Ceri, S.; Hunt, E.; Ives, Z.G.; Bellahsene, Z.; Rys, M.; Unland, R. (eds): Proc. 3 ^rd International XML Database Symposium XSym’05, Trondheim, Norway. LNCS 3671, Springer-Verlag (2005) 177–191
Google Scholar
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (2003) 3–8.
Google Scholar
Clark, J.: XSLT Transformation (XSLT) Version 1.0, W3C Recommendation, 16 November 1999, http://www.w3.org/TR/xslt (1999).
Google Scholar
Gottlob, G., Koch, C, Schulz, K.U.: Conjunctive Queries over Trees. In: Proc.PODS’2004, Paris, France. ACM Press, (2004) 189–200.
Google Scholar
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for In formation Agents, In: Intelligent Information Agents R&D in Europe: An AgentLink perspective (Klusch, et al. eds.). LNCS 2586, Springer-Verlag (2003).
Google Scholar
Li, Z., Ng, W.K.: WDEE: Web Data Extraction by Example. In: L. Zhou et al. (Eds.): Proc.DASFAA’2005, Beijing, China. LNCS 3453, Springer-Verlag (2005), 347–358.
Google Scholar
Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.): Progress in Discovery Science. LNCS 2281, Springer-Verlag (2002) 586–599.
Google Scholar
World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, November 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, University of Craiova, Bvd.Decebal 107, Craiova, RO-200440, Romania
Costin Bădică & Elvira Popescu
Business Information Systems Department, University of Craiova, A.I.Cuza 13, Craiova, RO-200585, Romania
Amelia Bădică

Authors

Costin Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Amelia Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Elvira Popescu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Institute of Computer Sciences, Technical University of Lodz, ul. Wolczanska 215, 93-1005, Lodz, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Department of Software Engineering, ORT Braude College, POB. 78, 21982, Karmiel, Israel
Zeev Volkovich
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL, 33620, USA
Abraham Kandel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bădică, C., Bădică, A., Popescu, E. (2006). A New Path Generalization Algorithm for HTML Wrapper Induction. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_2

Download citation

DOI: https://doi.org/10.1007/3-540-33880-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics