loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Thomas Osterland 1 ; 2 and Thomas Rose 1 ; 2

Affiliations: 1 Fraunhofer FIT, Schloss Birlinghoven 1, Sankt Augustin, Germany ; 2 RWTH Aachen University, Ahornstr. 55, Aachen, Germany

Keyword(s): Web Scraping, Periodicity Analysis, Content Extraction.

Abstract: The comprehensive analysis of large data volumes forms the shape of the future. It enables decision-making based on empiric evidence instead of expert experience and its utilization for the training of machine learning models enables new use cases in image recognition, speech analysis or regression and classification. One problem with data is, that it is often not readily available in aggregated form. Instead, it is necessary to search the web for information and elaborately mine websites for specific data. This is known as web scraping. In this paper we present an interactive, scoring based approach for the scraping of specific information from websites. We propose a scoring function, that enables the adaption of threshold values to select specific sets of data. We combine the scoring of paths in a web pages DOM with periodicity analysis to enable the selection of complex patterns in structured data. This allows non-expert users to train content selection models and to label classif ication data for supervised learning. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.118.200.136

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Osterland, T. and Rose, T. (2022). Scoring-based DOM Content Selection with Discrete Periodicity Analysis. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS; ISBN 978-989-758-569-2; ISSN 2184-4992, SciTePress, pages 280-289. DOI: 10.5220/0011116300003179

@conference{iceis22,
author={Thomas Osterland. and Thomas Rose.},
title={Scoring-based DOM Content Selection with Discrete Periodicity Analysis},
booktitle={Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS},
year={2022},
pages={280-289},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011116300003179},
isbn={978-989-758-569-2},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS
TI - Scoring-based DOM Content Selection with Discrete Periodicity Analysis
SN - 978-989-758-569-2
IS - 2184-4992
AU - Osterland, T.
AU - Rose, T.
PY - 2022
SP - 280
EP - 289
DO - 10.5220/0011116300003179
PB - SciTePress