Abstract
This paper presents an approach for applying inductive logic programming to information extraction from HTML documents structured as unranked ordered trees. We consider information extraction from Web resources that are abstracted as providing sets of tuples. Our approach is based on defining a new class of wrappers as a special class of logic programs – logic wrappers. The approach is demonstrated with examples and experimental results in the area of collecting product information, highlighting the advantages and the limitations of the method.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bădică, C., Bădică, A.: Rule Learning for Feature Values Extraction from HTML Product Information Sheets. In: Antoniou, G., Boley, H. (eds.) RuleML 2004. LNCS, vol. 3323, pp. 37–48. Springer, Heidelberg (2004)
Bădică, C., Popescu, E., Bădică, A.: Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou, M., Yamazaki, K. (eds.) Proc. SAINT 2005 Workshops. Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy, pp. 336–339. IEEE Computer Society Press, Los Alamitos (2005)
Chidlovskii, B.: Information Extraction from Tree Documents by Learning Subtree Delimiters. In: Proc. IIWeb 2003, Acapulco, Mexico, pp. 3–8 (2003)
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proc. AAAI 1998, pp. 517–523 (1998)
Junker, M., Sintek, M., Rinck, M.: Learning for Text Categorization and Information Extraction with ILP. In: Proc. Workshop on Learning Language in Logic, Bled, Slovenia (1999)
Kosala, R., Bussche, J., van den Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Ikeda, D., Yamada, Y., Hirokawa, S.: Expressive Power of Tree and String Based Wrappers. In: Proc. IIWeb 2003, Acapulco, Mexoco, pp. 21–16 (2003)
Quinlan, J.R., Cameron-Jones, R.M.: Induction of Logic Programs: FOIL and Related Systems. New Generation Computing 13, 287–312 (1995)
Sakamoto, H., Arimura, H., Arikawa, S.: Knowledge Discovery from Semistructured Texts. In: Arikawa, S., Shinohara, A. (eds.) Progress in Discovery Science. LNCS (LNAI), vol. 2281, pp. 586–599. Springer, Heidelberg (2002)
Thomas, B.: Token-Templates and Logic Programs for Intelligent Web Search. Intelligent Information Systems. Special Issue: Methodologies for Intelligent Information Systems 14(2/3), 241–261 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bădică, C., Bădică, A., Popescu, E. (2005). Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds) Advances in Web Intelligence. AWIC 2005. Lecture Notes in Computer Science(), vol 3528. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11495772_8
Download citation
DOI: https://doi.org/10.1007/11495772_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26219-0
Online ISBN: 978-3-540-31900-9
eBook Packages: Computer ScienceComputer Science (R0)