Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming

Bădică, Costin; Bădică, Amelia; Popescu, Elvira

doi:10.1007/11495772_8

Costin Bădică²¹,
Amelia Bădică²² &
Elvira Popescu²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3528))

Included in the following conference series:

International Atlantic Web Intelligence Conference

973 Accesses
2 Citations

Abstract

This paper presents an approach for applying inductive logic programming to information extraction from HTML documents structured as unranked ordered trees. We consider information extraction from Web resources that are abstracted as providing sets of tuples. Our approach is based on defining a new class of wrappers as a special class of logic programs – logic wrappers. The approach is demonstrated with examples and experimental results in the area of collecting product information, highlighting the advantages and the limitations of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Rule Induction and Reasoning over Knowledge Graphs

Information Extraction Approaches: A Survey

References

Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)
Chapter Google Scholar
Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops. Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IIWeb 2003, Acapulco, Mexico, pp. 3–8 (2003)
Google Scholar
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proc. AAAI 1998, pp. 517–523 (1998)
Google Scholar
Junker, M., Sintek, M., Rinck, M.: Learning for Text Categorization and Information Extraction with ILP. In: Proc. Workshop on Learning Language in Logic, Bled, Slovenia (1999)
Google Scholar
Kosala, R., Bussche, J., van den Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Chapter Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Article MATH MathSciNet Google Scholar
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Chapter Google Scholar
Ikeda, D., Yamada, Y., Hirokawa, S.: Expressive Power of Tree and String Based Wrappers. In: Proc. IIWeb 2003, Acapulco, Mexoco, pp. 21–16 (2003)
Google Scholar
Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)
Article Google Scholar
Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.) Progress in Discovery Science. LNCS (LNAI), vol. 2281, pp. 586–599. Springer, Heidelberg (2002)
Chapter Google Scholar
Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, University of Craiova, Bvd.Decebal 107, Craiova, RO-200440, Romania
Costin Bădică & Elvira Popescu
Business Information Systems Department, University of Craiova, A.I.Cuza 13, Craiova, RO-200585, Romania
Amelia Bădică

Authors

Costin Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Amelia Bădică
View author publications
You can also search for this author in PubMed Google Scholar
Elvira Popescu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01–447, Warsaw, Poland
Janusz Kacprzyk
Institute of Computer Science, Technical University of Łódź, Poland
Adam Niewiadomski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bădică, C., Bădică, A., Popescu, E. (2005). Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds) Advances in Web Intelligence. AWIC 2005. Lecture Notes in Computer Science(), vol 3528. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11495772_8

Download citation

DOI: https://doi.org/10.1007/11495772_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26219-0
Online ISBN: 978-3-540-31900-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics