Abstract
We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, P. Buneman and D. Suciu, Data on the Web. Morgan Kaufmann Publishers, 2000.
N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources. Proc. Workshop on Management of Semistructured Data, 1997.
P. Atzeni, G. Mecca, Cut and Paste. Proc. the 16th ACM SIGMOD Symposium on Principles of Database Systems, 144–153, 1997.
D. W. Embley, Y. Jiang and Y.-K. Ng, Record-Boundary Discovery in Web Documents. Proc. ACM SIGMOD Conference, 467–478, 1999.
D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts. Proc. the 4th International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, 2001. (to appear)
N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction. International Joint Conference on Artificial Intelligence, 729–737, 1997.
N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Vol. 118, 15–68, 2000.
H. Sakamoto, H. Arimura and S. Arikawa, Extracting Partial Structures from HTML Documents, Proc. the 14th International FLAIRS Conference: Knowledge Discovery and Data Mining. (to appear)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yamada, Y., Ikeda, D., Hirokawa, S. (2001). SCOOP: A Record Extractor without Knowledge on Input. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_45
Download citation
DOI: https://doi.org/10.1007/3-540-45650-3_45
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive