Abstract
Various web applications in e-business, such as online price comparisons, competition monitoring and personalised newsletters require retrieval of distributed information from the Internet. This paper examines the suitability of software toolkits for the extraction of data from web sites. The term wrapper is defined and an overview of presently available toolkits for generating wrappers is provided. In order to give a better insight into the workings of such toolkits, a detailed analysis of the non-commercial software program LAPIS is presented. An example application using this toolkit demonstrates how acceptable results can be achieved with relative ease. The functionality of the program is compared with the functionality of the commercial toolkit RoboMaker and the differences are highlighted. With the aim of providing improved ease-of-use and faster wrapper generation in mind, possible areas for further development of toolkits for automated web data extraction are discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adelberg, B. and Denny, M.: Building Robust Wrappers for Text Sources, Technical Report 1999, http://www.ai.mit.edu/people/jimmylin/papers/Adelberg99.pdf (Aug. 2002)
Baumgartner, R., Flesca, S. and Gottlob, G.: Visual Web Information Extraction with Lixto, Paper for the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, September 2001
Doorenbos, R., Etzioni, O. and Weld, S.: A Scalable Comparison-Shopping Agent for the World-Wide Web, Paper for the First International Conference on Autonomous Agents, February 1997, http://www.cs.washington.edu/homes/weld/papers/shopbot.pdf (Oct. 2002)
Eikvil, L.: Information Extraction from World Wide Web-A Survey, Report No. 945, ISBN 82-539-0429-0, July 1999
Fetch Technologies: Technology Overview-Reliably Extracting Web Data, White Paper, November 2001, http://www.fetch.com/whitepapers/FetchWhitePaper.doc (Oct. 2002)
Golgher, P., Laender, A., Silva, A. and Ribeiro-Neto, B.: An Example-Based Environment for Wrapper Generation, in: Proceedings of the 2nd International Workshop on The World Wide Web and Conceptual Modeling, pp. 152–164, Salt Lake City, Utah, USA, 2000
Kapow Technologies: RoboSuite Technical White Paper, November 2001, http://www.kapowtech.com/filarkiv/pdf/robosuitetechnicalwhitepaper.pdf (Oct. 2002)
Knoblock, C., Minton, S., Ambite, J., Ashish, N., Muslea, I., Philpot, A. and Tejada, S.: The ARIADNE Approach to Web-Based Information Integration, 2000, in: International Journal of Cooperative Information Systems 10(1–2): pp. 145–169, 2001
Kuhlins, S. and Tredwell, R.: Wrapper-Generating Toolkits, Online Overview, available since December 2001: http://www.wifo.uni-mannheim.de/~kuhlins/wrappertools/
Kushmerick, N.: Wrapper Induction for Information Extraction, Dissertation 1997, Dept of Computer Science & Engineering, Univ. of Washington, Tech. Report UW-CSE-97-11-04, http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-phd.ps.gz (Oct. 2002)
Kushmerick, N.: Wrapper Verification, World Wide Web Journal 3(2): pp. 79–94, 2000
Laender, A., Ribeiro-Neto, B., Silva, A. and Teixeira, J.: A Brief Survey of Web Data Extraction Tools, in: SIGMOD Record, Volume 31, Number 2, June 2002
Laender, A., Ribeiro-Neto, B., Silva, A. and Silva, E.: Representing Web Data as Complex Objects, in: Proceedings of the First International Conference on Electronic Commerce and Web Technologies (EC-Web 2000), pp. 216–228, Greenwich, UK, 2000
Liu, L. Pu, C. and Han, W.: XWrap-An XML-enabled Wrapper Construction System for Web Information Sources, Proceedings of the 16th International Conference on Data Engineering (ICDE’2000), San Diego CA, 2000
Miller, R.: Lightweight Structured Text Processing, PhD Thesis Proposal, Computer Science Department, Carnegie Mellon University, USA, April 1999, http://www-2.cs.cmu.edu/~rcm/papers/proposal/proposal.html (Oct. 2002)
Miller, R. and Myers, B.: Lightweight Structured Text Processing, in: Proceedings of 1999 USENIX Annual Technical Conference, Monterey, CA, pp. 131–144, June 1999
Miller, R. and Myers, B.: Integrating a Command Shell Into a Web Browser, in: Proceedings of USENIX 2000 Annual Technical Conference, San Diego, pp. 171–182, June 2000
Miller, R. and Myers, B.: Outlier Finding: Focusing User Attention on Possible Errors, in: Proceedings of UIST 2001, Orlando, FL, pp. 81–90, November 2001
Miller, R. and Myers, A.: Multiple Selections in Smart Text Editing, in: Proceedings of IUI 2002, San Francisco, CA, pp. 103–110, January 2002
Sahuguet, A. and Azavant, F.: Web Ecology-Recycling HTML pages as XML documents using W4F, in: ACM International Workshop on the Web and Databases (WebDB’99), Philadelphia, Pennsylvania, USA, June 1999
Sahuguet, A. and Azavant, F.: Building Intelligent Web Applications Using Lightweight Wrappers, Paper, July 2000, http://db.cis.upenn.edu/DL/dke.pdf (Oct. 2002)
Wiederhold, G.: Mediators in the Architecture of Future Information Systems, IEEE Computer 25 (3), pp. 38–49, 1992
World Wide Web Consortium: The Document Object Model, http://www.w3.org/DOM/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kuhlins, S., Tredwell, R. (2003). Toolkits for Generating Wrappers. In: Aksit, M., Mezini, M., Unland, R. (eds) Objects, Components, Architectures, Services, and Applications for a Networked World. NODe 2002. Lecture Notes in Computer Science, vol 2591. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36557-5_15
Download citation
DOI: https://doi.org/10.1007/3-540-36557-5_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00737-1
Online ISBN: 978-3-540-36557-0
eBook Packages: Springer Book Archive