Abstract
In recent years, much work has been invested into automatically learning wrappers for information extraction from HTML tables and lists. Our research has focused on a system that can learn a wrapper from a single unlabelled page. An essential step is to locate the tabular data within the page. This is not trivial when the structures of data tuples are similar but not identical. In this paper we describe an algorithm that can automatically detect approximate repetitive structures within one sequence. The algorithm does not rely on any domain knowledge or HTML heuristics and it can be used in detecting repetitive patterns and hence to learn wrappers from a single unlabeled tabular page.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: IJCAI 1997, Nagoya, Japan, pp. 729–735 (1997)
Freitag, D.: Information extration from html: Application of a general machine learning approach. In: AAAI 1998 (1998)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: The 3rd conference on Autonomous Agents(Agent 1999) (1999)
Soderland, S.: Learning to extract text-based information from the world wide web. In: Proceedings of Third International Conference on Knowledge Discovery and Data Mining, KDD 1997 (1997)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001)
Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the Web. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 216–223. Springer, Heidelberg (2001)
Gao, X., Zhang, M., Andreae, P.: Learning information extraction patterns from tabular web pages without manual labelling. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, Canada, October 13-17, pp. 495–498 (2003)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)
Kushmerick, N.: Wrapper Induction for Information Extraction. PhD thesis, Department of Computer Science and Engineering, University of Washington (1997)
Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)
Gao, X., Sterling, L.: Knowledge-based information agents. In: Carbonell, J.G., Siekmann, J. (eds.). LNCS (LNAI), pp. 229–238. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gao, X., Andreae, P., Collins, R. (2004). Approximately Repetitive Structure Detection for Wrapper Induction. In: Zhang, C., W. Guesgen, H., Yeap, WK. (eds) PRICAI 2004: Trends in Artificial Intelligence. PRICAI 2004. Lecture Notes in Computer Science(), vol 3157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28633-2_62
Download citation
DOI: https://doi.org/10.1007/978-3-540-28633-2_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22817-2
Online ISBN: 978-3-540-28633-2
eBook Packages: Springer Book Archive