Skip to main content

Automatic Web Information Extraction in the RoadRunner System

  • Conference paper
  • First Online:
Conceptual Modeling for New Information Systems Technologies (ER 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2465))

Included in the following conference series:

Abstract

This paper presents Road Runner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the Road Runner System, and the novel research challenges posed by the attempt to automate the information extraction process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B. Adelberg. NoDoSE-a tool for semi-automatically extracting structured and semistructured data from text documents. In ACM SIGMOD International Conf. on Management of Data (SIGMOD’98), Seattle, Washington, 1998.

    Google Scholar 

  2. R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in sequence databases. In International Conference of Foundations of Data Organization (FODO’93), pages 69–84, 1993.

    Google Scholar 

  3. A. Bonifati and S. Ceri. Comparative analysis of five XML query languages. ACM SIGMOD Record, 29(1):68–79, 2000.

    Article  Google Scholar 

  4. D. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98), pages 102–108, 1998.

    Google Scholar 

  5. D. Chamberlin et al. Xquery 1.0: An xml query language. W3C Working Draft, June 2001.

    Google Scholar 

  6. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In International Conf. on Very Large Data Bases (VLDB’2001), Rome, Italy, September 11–14, pages 109–119, 2001.

    Google Scholar 

  7. V. Crescenzi, G. Mecca, and P. Merialdo. Wrapping-oriented classification of Web pages. Submitted for publication, 2001.

    Google Scholar 

  8. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, Y. Ng, D. Quass, and R. D. Smith. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER’98), pages 78–91, 1998.

    Google Scholar 

  9. T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the world wide web. In AAAI Spring Symposium on Machine Learning in Information Access, 1996.

    Google Scholar 

  10. E. M. Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967.

    Article  MathSciNet  MATH  Google Scholar 

  11. E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978.

    Article  MathSciNet  MATH  Google Scholar 

  12. S. Grumbach and G. Mecca. In search of the lost schema. In Seventh International Conference on Data Base Theory, (ICDT’99), Jerusalem (Israel), Lecture Notes in Computer Science, Springer-Verlag, pages 314–331, 1999.

    Google Scholar 

  13. C. Hsu and M. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998.

    Article  Google Scholar 

  14. A. K. Jain, N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.

    Article  Google Scholar 

  15. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI’97), 1997.

    Google Scholar 

  16. I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third Annual Conference on Autonomous Agents, pages 190–197, 1999.

    Google Scholar 

  17. A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-Time Signal Processing. Prentice Hall, second edition edition, 1999.

    Google Scholar 

  18. L. Pitt. Inductive inference, DFAs and computational complexity. In K. P. Jantke, editor, Analogical and Inductive Inference, Lecture Notes in AI 397, pages 18–44. Springer-Verlag, Berlin, 1989.

    Chapter  Google Scholar 

  19. B. A. Ribeiro-Neto, A. H. F. Laender, and A. Soares da Silva. Extracting semistructured data through examples. In Proceedings of the 1999 ACM International Conference on Information and Knowledge Management (CIKM’99), pages 94–101, 1999.

    Google Scholar 

  20. S. Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1–3):233–272, 1999.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Crescenzi, V., Mecca, G., Merialdo, P. (2002). Automatic Web Information Extraction in the RoadRunner System. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_21

Download citation

  • DOI: https://doi.org/10.1007/3-540-46140-X_21

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44122-9

  • Online ISBN: 978-3-540-46140-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics