loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Stefan Huber ; Fabio Knoll and Mario Döller

Affiliation: University of Applied Sciences Kufstein, Andreas Hofer-Straße 7, 6330 Kufstein, Austria

Keyword(s): Web Scraping, Data Pipelines, Fault-tolerant Execution.

Abstract: Web scraping is a widely-used technique to extract unstructured data from different websites and transform it into a unified and structured form. Due to the nature of the WWW, long-term and continuous web scraping is a volatile and error-prone endeavor. The setup of a reliable extraction procedure comes along with various challenges. In this paper, a system design and implementation for a pipeline-oriented approach to web scraping is proposed. The main goal of the proposal is to establish a fault-tolerant execution of web scraping tasks with proper error handling strategies set in place. As errors are prevalent in web scraping, logging and error replication procedures are part of the processing pipeline. These mechanisms allow for effectively adapting web scraper implementations to evolving website targets. An implementation of the system was evaluated in a real-world case study, where thousands of web pages were scraped and processed on a daily basis. The results indicated that the system allows for effectively operating reliable and long-term web scraping endeavors. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.219.63.90

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Huber, S.; Knoll, F. and Döller, M. (2022). A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping. In Proceedings of the 17th International Conference on Software Technologies - ICSOFT; ISBN 978-989-758-588-3; ISSN 2184-2833, SciTePress, pages 441-448. DOI: 10.5220/0011275100003266

@conference{icsoft22,
author={Stefan Huber. and Fabio Knoll. and Mario Döller.},
title={A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping},
booktitle={Proceedings of the 17th International Conference on Software Technologies - ICSOFT},
year={2022},
pages={441-448},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011275100003266},
isbn={978-989-758-588-3},
issn={2184-2833},
}

TY - CONF

JO - Proceedings of the 17th International Conference on Software Technologies - ICSOFT
TI - A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping
SN - 978-989-758-588-3
IS - 2184-2833
AU - Huber, S.
AU - Knoll, F.
AU - Döller, M.
PY - 2022
SP - 441
EP - 448
DO - 10.5220/0011275100003266
PB - SciTePress