A Dynamic Approach for Template and Content Extraction in Websites

Cristian-Catalin, Nicolae; Dragan, Mihaita

doi:10.1007/978-3-030-45688-7_2

A Dynamic Approach for Template and Content Extraction in Websites

Nicolae Cristian-Catalin²⁰ &
Mihaita Dragan²⁰

Conference paper
First Online: 18 May 2020

2165 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1159))

Abstract

Web scraping is a technique used to extract data from websites and it is the pillar of information retrieval in a world wide web that is ever growing. There are two main ways of extracting data from a website: static and dynamic scraping. Static scraping requires input beyond the target website because the user needs to inspect the HTML content of the target and find certain patterns in the templates that are then used to extract data. Static scraping is also very vulnerable to changes in the template of the web page. Dynamic scraping is a very broad topic and it has been tackled from many different angles: tree-based, natural language processing (NLP), computer vision or machine learning techniques. For most websites, the problem can be broken in two big steps: finding the template for the pages we want to extract data from and then removing irrelevant text such as ads, text from controls or JavaScript code.

This paper proposes a solution for dynamic scraping that uses AngleSharp for HTML retrieval and involves a slightly modified approach of the graph technique mentioned in for template finding. Once we find a number of pages then several heuristics can be applied for content extraction and noise filtering. Such heuristics can include: text and hyperlink density, but also removing common content between multiple pages (usually text from controls, static JavaScript) and then of final layer of NLP techniques (breaking the content into sentences, tokenization and part-of-speech tagging).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alarte, J., Insa, D., Silva, J., Tamarit, S.: Web template extraction based on hyperlink analysis. In: Escobar, S. (ed.): XIV Jornadas sobre Programación Y Lenguajes, PROLE 2014, Revised Selected Papers EPTCS, vol. 173, pp. 16–26 (2015)
Google Scholar
Liu, Q., Shao, M., Wu, L., Zhao, G., Fan, G.: Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11(2), 39–48 (2017)
Article Google Scholar
Ferrara, E., De Meob, P., Fiumarac, G., Baumgartnerd, R.F.: Web Data Extraction, Applications and Techniques: A Survey. arXiv:1207.0246v4 [cs.IR], 10 June 2014
OpenNLP Documentation. https://opennlp.apache.org/docs/. Accessed 2 Nov 2019
Uzun, E., Doruk, A., Nusret Buluş, H., Özhan, E.: Evaluation of HAP, AngleSharp and HTML document in web content extraction. In: International Scientific Conference, Gabrovo, 18 November 2017 (2017)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)
Article Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 93–114 (2001). https://doi.org/10.1023/A%3A1010022931168

Download references

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania
Nicolae Cristian-Catalin & Mihaita Dragan

Authors

Nicolae Cristian-Catalin
View author publications
You can also search for this author in PubMed Google Scholar
Mihaita Dragan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolae Cristian-Catalin .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Universidade de Coimbra, Coimbra, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
FEUP, Universidade do Porto, Porto, Portugal
Luís Paulo Reis
DIMES, Università della Calabria, Arcavacata, Italy
Sandra Costanzo
Faculty of Electrical Engineering, University of Montenegro, Podgorica, Montenegro
Irena Orovic
Universidade Portucalense, Porto, Portugal
Fernando Moreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cristian-Catalin, N., Dragan, M. (2020). A Dynamic Approach for Template and Content Extraction in Websites. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S., Orovic, I., Moreira, F. (eds) Trends and Innovations in Information Systems and Technologies. WorldCIST 2020. Advances in Intelligent Systems and Computing, vol 1159. Springer, Cham. https://doi.org/10.1007/978-3-030-45688-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-45688-7_2
Published: 18 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45687-0
Online ISBN: 978-3-030-45688-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics