Abstract
This paper presents Road Runner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the Road Runner System, and the novel research challenges posed by the attempt to automate the information extraction process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In ACM SIGMOD International Conf. on Management of Data (SIGMOD’98), Seattle, Washington, 1998.
R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In International Conference of Foundations of Data Organization (FODO’93), pages 69–84, 1993.
A. Bonifati and S. Ceri. Comparative analysis of five XML query languages. ACM SIGMOD Record, 29(1):68–79, 2000.
D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98), pages 102–108, 1998.
D. Chamberlin et al. Xquery 1.0: An xml query language. W3C Working Draft, June 2001.
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In International Conf. on Very Large Data Bases (VLDB’2001), Rome, Italy, September 11–14, pages 109–119, 2001.
V. Crescenzi, G. Mecca, and P. Merialdo. Wrapping-oriented classification of Web pages. Submitted for publication, 2001.
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, Y. Ng, D. Quass, and R. D. Smith. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER’98), pages 78–91, 1998.
T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the world wide web. In AAAI Spring Symposium on Machine Learning in Information Access, 1996.
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.
E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.
S. Grumbach and G. Mecca. In search of the lost schema. In Seventh International Conference on Data Base Theory, (ICDT’99), Jerusalem (Israel), Lecture Notes in Computer Science, Springer-Verlag, pages 314–331, 1999.
C. Hsu and M. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998.
A. K. Jain, N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third Annual Conference on Autonomous Agents, pages 190–197, 1999.
A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-Time Signal Processing. Prentice Hall, second edition edition, 1999.
L. Pitt. Inductive inference, DFAs and computational complexity. In K. P. Jantke, editor, Analogical and Inductive Inference, Lecture Notes in AI 397, pages 18–44. Springer-Verlag, Berlin, 1989.
B. A. Ribeiro-Neto, A. H. F. Laender, and A. Soares da Silva. Extracting semistructured data through examples. In Proceedings of the 1999 ACM International Conference on Information and Knowledge Management (CIKM’99), pages 94–101, 1999.
S. Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1–3):233–272, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Crescenzi, V., Mecca, G., Merialdo, P. (2002). Automatic Web Information Extraction in the RoadRunner System. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_21
Download citation
DOI: https://doi.org/10.1007/3-540-46140-X_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44122-9
Online ISBN: 978-3-540-46140-1
eBook Packages: Springer Book Archive