Skip to main content

Toolkits for Generating Wrappers

A Survey of Software Toolkits for Automated Data Extraction from Web Sites

  • Conference paper
  • First Online:
Objects, Components, Architectures, Services, and Applications for a Networked World (NODe 2002)

Abstract

Various web applications in e-business, such as online price comparisons, competition monitoring and personalised newsletters require retrieval of distributed information from the Internet. This paper examines the suitability of software toolkits for the extraction of data from web sites. The term wrapper is defined and an overview of presently available toolkits for generating wrappers is provided. In order to give a better insight into the workings of such toolkits, a detailed analysis of the non-commercial software program LAPIS is presented. An example application using this toolkit demonstrates how acceptable results can be achieved with relative ease. The functionality of the program is compared with the functionality of the commercial toolkit RoboMaker and the differences are highlighted. With the aim of providing improved ease-of-use and faster wrapper generation in mind, possible areas for further development of toolkits for automated web data extraction are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adelberg, B. and Denny, M.: Building Robust Wrappers for Text Sources, Technical Report 1999, http://www.ai.mit.edu/people/jimmylin/papers/Adelberg99.pdf (Aug. 2002)

  2. Baumgartner, R., Flesca, S. and Gottlob, G.: Visual Web Information Extraction with Lixto, Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, September 2001

    Google Scholar 

  3. Doorenbos, R., Etzioni, O. and Weld, S.: A Scalable Comparison-Shopping Agent for the World-Wide Web, Paper for the First International Conference on Autonomous Agents, February 1997, http://www.cs.washington.edu/homes/weld/papers/shopbot.pdf (Oct. 2002)

  4. Eikvil, L.: Information Extraction from World Wide Web-A Survey, Report No. 945, ISBN 82-539-0429-0, July 1999

    Google Scholar 

  5. Fetch Technologies: Technology Overview-Reliably Extracting Web Data, White Paper, November 2001, http://www.fetch.com/whitepapers/FetchWhitePaper.doc (Oct. 2002)

  6. Golgher, P., Laender, A., Silva, A. and Ribeiro-Neto, B.: An Example-Based Environment for Wrapper Generation, in: Proceedings of the 2nd International Workshop on The World Wide Web and Conceptual Modeling, pp. 152–164, Salt Lake City, Utah, USA, 2000

    Google Scholar 

  7. Kapow Technologies: RoboSuite Technical White Paper, November 2001, http://www.kapowtech.com/filarkiv/pdf/robosuitetechnicalwhitepaper.pdf (Oct. 2002)

  8. Knoblock, C., Minton, S., Ambite, J., Ashish, N., Muslea, I., Philpot, A. and Tejada, S.: The ARIADNE Approach to Web-Based Information Integration, 2000, in: International Journal of Cooperative Information Systems 10(1–2): pp. 145–169, 2001

    Google Scholar 

  9. Kuhlins, S. and Tredwell, R.: Wrapper-Generating Toolkits, Online Overview, available since December 2001: http://www.wifo.uni-mannheim.de/~kuhlins/wrappertools/

  10. Kushmerick, N.: Wrapper Induction for Information Extraction, Dissertation 1997, Dept of Computer Science & Engineering, Univ. of Washington, Tech. Report UW-CSE-97-11-04, http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-phd.ps.gz (Oct. 2002)

  11. Kushmerick, N.: Wrapper Verification, World Wide Web Journal 3(2): pp. 79–94, 2000

    Article  MATH  Google Scholar 

  12. Laender, A., Ribeiro-Neto, B., Silva, A. and Teixeira, J.: A Brief Survey of Web Data Extraction Tools, in: SIGMOD Record, Volume 31, Number 2, June 2002

    Google Scholar 

  13. Laender, A., Ribeiro-Neto, B., Silva, A. and Silva, E.: Representing Web Data as Complex Objects, in: Proceedings of the First International Conference on Electronic Commerce and Web Technologies (EC-Web 2000), pp. 216–228, Greenwich, UK, 2000

    Google Scholar 

  14. Liu, L. Pu, C. and Han, W.: XWrap-An XML-enabled Wrapper Construction System for Web Information Sources, Proceedings of the 16th International Conference on Data Engineering (ICDE’2000), San Diego CA, 2000

    Google Scholar 

  15. Miller, R.: Lightweight Structured Text Processing, PhD Thesis Proposal, Computer Science Department, Carnegie Mellon University, USA, April 1999, http://www-2.cs.cmu.edu/~rcm/papers/proposal/proposal.html (Oct. 2002)

  16. Miller, R. and Myers, B.: Lightweight Structured Text Processing, in: Proceedings of 1999 USENIX Annual Technical Conference, Monterey, CA, pp. 131–144, June 1999

    Google Scholar 

  17. Miller, R. and Myers, B.: Integrating a Command Shell Into a Web Browser, in: Proceedings of USENIX 2000 Annual Technical Conference, San Diego, pp. 171–182, June 2000

    Google Scholar 

  18. Miller, R. and Myers, B.: Outlier Finding: Focusing User Attention on Possible Errors, in: Proceedings of UIST 2001, Orlando, FL, pp. 81–90, November 2001

    Google Scholar 

  19. Miller, R. and Myers, A.: Multiple Selections in Smart Text Editing, in: Proceedings of IUI 2002, San Francisco, CA, pp. 103–110, January 2002

    Google Scholar 

  20. Sahuguet, A. and Azavant, F.: Web Ecology-Recycling HTML pages as XML documents using W4F, in: ACM International Workshop on the Web and Databases (WebDB’99), Philadelphia, Pennsylvania, USA, June 1999

    Google Scholar 

  21. Sahuguet, A. and Azavant, F.: Building Intelligent Web Applications Using Lightweight Wrappers, Paper, July 2000, http://db.cis.upenn.edu/DL/dke.pdf (Oct. 2002)

  22. Wiederhold, G.: Mediators in the Architecture of Future Information Systems, IEEE Computer 25 (3), pp. 38–49, 1992

    Google Scholar 

  23. World Wide Web Consortium: The Document Object Model, http://www.w3.org/DOM/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuhlins, S., Tredwell, R. (2003). Toolkits for Generating Wrappers. In: Aksit, M., Mezini, M., Unland, R. (eds) Objects, Components, Architectures, Services, and Applications for a Networked World. NODe 2002. Lecture Notes in Computer Science, vol 2591. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36557-5_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-36557-5_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00737-1

  • Online ISBN: 978-3-540-36557-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics