SCOOP: A Record Extractor without Knowledge on Input

Yamada, Yasuhiro; Ikeda, Daisuke; Hirokawa, Sachio

doi:10.1007/3-540-45650-3_45

Yasuhiro Yamada³,
Daisuke Ikeda⁴ &
Sachio Hirokawa⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2226))

Included in the following conference series:

International Conference on Discovery Science

376 Accesses
2 Citations

Abstract

We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, P. Buneman and D. Suciu, Data on the Web. Morgan Kaufmann Publishers, 2000.
Google Scholar
N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources. Proc. Workshop on Management of Semistructured Data, 1997.
Google Scholar
P. Atzeni, G. Mecca, Cut and Paste. Proc. the 16th ACM SIGMOD Symposium on Principles of Database Systems, 144–153, 1997.
Google Scholar
D. W. Embley, Y. Jiang and Y.-K. Ng, Record-Boundary Discovery in Web Documents. Proc. ACM SIGMOD Conference, 467–478, 1999.
Google Scholar
D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts. Proc. the 4th International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, 2001. (to appear)
Google Scholar
N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction. International Joint Conference on Artificial Intelligence, 729–737, 1997.
Google Scholar
N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Vol. 118, 15–68, 2000.
Article MATH MathSciNet Google Scholar
H. Sakamoto, H. Arimura and S. Arikawa, Extracting Partial Structures from HTML Documents, Proc. the 14th International FLAIRS Conference: Knowledge Discovery and Data Mining. (to appear)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Electrical Engineering, Kyushu University, 812-8581, Fukuoka, Japan
Yasuhiro Yamada
Computing and Communications Center, Kyushu University, 812-8581, Fukuoka, Japan
Daisuke Ikeda & Sachio Hirokawa

Authors

Yasuhiro Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Sachio Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DFKI GmbH Saarbrücken, 66123, Saarbrücken, Germany
Klaus P. Jantke
Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Ayumi Shinohara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, Y., Ikeda, D., Hirokawa, S. (2001). SCOOP: A Record Extractor without Knowledge on Input. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_45

Download citation

DOI: https://doi.org/10.1007/3-540-45650-3_45
Published: 20 December 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics