Automatic Web Information Extraction in the RoadRunner System

Crescenzi, Valter; Mecca, Giansalvatore; Merialdo, Paolo

doi:10.1007/3-540-46140-X_21

Valter Crescenzi⁶,
Giansalvatore Mecca⁷ &
Paolo Merialdo⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2465))

Included in the following conference series:

International Conference on Conceptual Modeling

479 Accesses
6 Citations
6 Altmetric

Abstract

This paper presents Road Runner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the Road Runner System, and the novel research challenges posed by the attempt to automate the information extraction process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In ACM SIGMOD International Conf. on Management of Data (SIGMOD’98), Seattle, Washington, 1998.
Google Scholar
R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In International Conference of Foundations of Data Organization (FODO’93), pages 69–84, 1993.
Google Scholar
A. Bonifati and S. Ceri. Comparative analysis of five XML query languages. ACM SIGMOD Record, 29(1):68–79, 2000.
Article Google Scholar
D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98), pages 102–108, 1998.
Google Scholar
D. Chamberlin et al. Xquery 1.0: An xml query language. W3C Working Draft, June 2001.
Google Scholar
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In International Conf. on Very Large Data Bases (VLDB’2001), Rome, Italy, September 11–14, pages 109–119, 2001.
Google Scholar
V. Crescenzi, G. Mecca, and P. Merialdo. Wrapping-oriented classification of Web pages. Submitted for publication, 2001.
Google Scholar
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, Y. Ng, D. Quass, and R. D. Smith. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER’98), pages 78–91, 1998.
Google Scholar
T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the world wide web. In AAAI Spring Symposium on Machine Learning in Information Access, 1996.
Google Scholar
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.
Article MathSciNet MATH Google Scholar
E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.
Article MathSciNet MATH Google Scholar
S. Grumbach and G. Mecca. In search of the lost schema. In Seventh International Conference on Data Base Theory, (ICDT’99), Jerusalem (Israel), Lecture Notes in Computer Science, Springer-Verlag, pages 314–331, 1999.
Google Scholar
C. Hsu and M. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998.
Article Google Scholar
A. K. Jain, N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.
Article Google Scholar
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.
Google Scholar
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third Annual Conference on Autonomous Agents, pages 190–197, 1999.
Google Scholar
A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-Time Signal Processing. Prentice Hall, second edition edition, 1999.
Google Scholar
L. Pitt. Inductive inference, DFAs and computational complexity. In K. P. Jantke, editor, Analogical and Inductive Inference, Lecture Notes in AI 397, pages 18–44. Springer-Verlag, Berlin, 1989.
Chapter Google Scholar
B. A. Ribeiro-Neto, A. H. F. Laender, and A. Soares da Silva. Extracting semistructured data through examples. In Proceedings of the 1999 ACM International Conference on Information and Knowledge Management (CIKM’99), pages 94–101, 1999.
Google Scholar
S. Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1–3):233–272, 1999.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

D.I.A., Università di Roma Tre, Italy
Valter Crescenzi & Paolo Merialdo
D.I.F.A., Università della Basilicata, Italy
Giansalvatore Mecca

Authors

Valter Crescenzi
View author publications
You can also search for this author in PubMed Google Scholar
Giansalvatore Mecca
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Merialdo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate school of environment and information sciences, Yokohama National University, 79-7, Tokiwadai, Hodogaya-ku, yokohama, 240-8501, Japan
Hiroshi Arisawa
Department of Social Informatics, Graduate School of Informatics, Kyoto University, Yoshida, Sakyo, Kyoto, 606-8501, Japan
Yahiko Kambayashi
SICE Computer Networking, University of Missouri-Kansas City, 5100 Rockhill Road, Kansas City, MO, 64110, USA
Vijay Kumar
University of Klagenfurt, Universitätsstraße 65-67, 9020, Klagenfurt, Austria
Heinrich C. Mayr
VP Industry Services DAMA International, PO Box 5786, Bellevue, WA, 98006-5786, USA
Ingrid Hunt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Crescenzi, V., Mecca, G., Merialdo, P. (2002). Automatic Web Information Extraction in the RoadRunner System. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_21

Download citation

DOI: https://doi.org/10.1007/3-540-46140-X_21
Published: 13 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44122-9
Online ISBN: 978-3-540-46140-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics